引入Transformer和尺度融合的动物骨骼关键点检测模型构建

张飞宇; 王美丽; 王正超

doi:10.11975/j.issn.1002-6819.2021.23.021

摘要: 动物的姿态和行为与其自身的健康状况有着密切联系，检测动物骨骼关键点是进行动物姿态识别、异常行为分析的前提。针对现有的关键点检测方法对动物骨骼的识别准确率低、鲁棒性差等问题，该研究提出了一种引入Transformer编码器的动物骨骼关键点检测模型。首先，在HRNet网络的特征提取层中引入改进的Transformer编码器，用于捕捉关键点之间的空间约束关系，在小规模的羊数据集上有较优的检测性能。其次，引入多尺度信息融合模块，提高模型在不同维度特征上的学习能力，让模型可以适用于更多的实际场景。为了验证模型的有效性和泛化性，该研究采集并标注了羊的骨骼关键点数据集，并加入东北虎数据集ATRW共同作为训练集。试验结果表明，在羊和东北虎关键点数据集上，该模型分别取得77.1%和89.7%的准确率，均优于对比模型且计算量更小，单张图像检测时间为14 ms，满足实时检测的需求。使用牛、马等数据集进行跨域测试均能较好地检测出骨骼关键点，并分析了Transformer编码器的可解释性。该研究可为精确检测动物骨骼关键点提供一种有效的技术支持。

Abstract: An accurate and rapid recognition of animal posture and abnormal behavior has been very necessary to effectively prevent diseases in large-scale breeding, with the rapid development of intelligent agriculture and animal husbandry. Alternatively, the successful estimation tasks of a human pose can be attributed to the large-scale datasets and complex network models using deep learning. However, there are only a few studies related to the estimation of animal posture, compared with the human pose. In this study, an improved key point detection model of animal skeletons was proposed to improve the accuracy and robustness using the Transformer encoder and scale fusion. First, an improved Transformer encoder was introduced into the feature extraction layer of the HRNet network to capture the spatial constraint relationship between the key points. A better detection was performed on the small-scale sheep datasets. In the Transformer encoder, a sine position embedding module was introduced to improve the utilization of spatial position relations. The Hardswish activation function was used to improve the convergence speed of the training process. Secondly, a multi-scale information fusion module was introduced to improve the learning ability of the model in the different dimensional features. As such, the improved model was also applied for the more practical scenarios. A distribution-aware coordinate representation strategy was adopted to reduce the quantization error in the conversion of coordinates and heat map when encoding and decoding from the small-scale heat map, where the mean square error was used as the loss function.Furthermore, the key point dataset of sheep skeletons was collected and annotated to verify the effectiveness and generalization of the model. The Siberian tiger dataset ATRW was also added as the training set. The experimental results showed that the accuracy of 77.1% and 89.7% were achieved on the key point datasets of sheep and Siberian tiger, respectively, indicating better performance with a smaller amount of calculation, compared with the rest model. The detection time of a single image was 14 ms, fully meeting the demand for real-time detection. The cross-domain tests demonstrated a better detection of bone key points using data sets, such as the cattle and horses, indicating the excellent interpretability of the Transformer encoder. The global constraint relationship of the network was also obtained from a higher resolution with the feature information of fine-grained local images. The overall performance of the model was better than the rest, due to the decrease in the number of parameters and calculations with better accuracy. Consequently, the improved model performed better in the small-scale data sets and small-resolution input, particularly suitable for the actual applications. A variety of animal experiments was implemented to prove the cross-domain and generalization ability of the model. This finding can also provide effective technical support to accurately detect the key points of animal skeletons for the animal behaviour in the intelligent animal husbandry.

引入Transformer和尺度融合的动物骨骼关键点检测模型构建

Construction of the animal skeletons keypoint detection model based on transformer and scale fusion