改进AVSlowFast音视频融合模型对哺乳期母猪关键行为的识别

李泊; 陈天明; 朱佳颖

doi:10.11975/j.issn.1002-6819.202312135

摘要: 哺乳期母猪的自动行为监测对于保障母猪健康并及时发现异常状态具有重要意义。为了在识别母猪行为中整合视觉和听觉信号蕴含的信息，该研究提出了一种基于音视频特征多模态融合的哺乳期母猪关键行为识别方法。首先，引入三分支结构的AVSlowFast模型作为基础网络，通过视频慢通道、视频快通道、音频通道有效挖掘在视觉和听觉2种模态下的相关行为特征，并基于多层次侧向连接深入融合视听觉模态信息。在此基础上，该研究在特征融合后期引入高斯上下文变换器通道注意力模块，在不新增模型参数的条件下进一步优化高维多模态三维特征的融合效果，提高行为识别的准确率。该研究以哺乳期母猪为对象，采集实际养殖环境中的音频与视频数据进行试验，试验结果表明基于改进AVSlowFast音视频融合模型识别进食、哺乳、睡眠、拱栏、饮水、日常活动6种关键行为的平均精确率与召回率分别为94.3%和94.6%。与基于SlowFast的单模态行为识别方法相比，该研究提出的方法对6种行为识别的平均F1分数上显著提升了12.7%，为实现畜禽多模态行为监测提供了一种有效思路。

Abstract: Pig farming can be greatly promoted by automatic behavior recognition for lactating sows. However, the recognition accuracies have been confined to behaviors with similar visual characteristics. In this study, an audio-video fusion-based model was proposed for the behavior classification of lactating sows in pig farming. A three-branch deep neural network (AVSlowFast) was employed as the backbone. The gaussian context transformer (GCT) attention mechanism was introduced to optimize the model without increasing the number of parameters. The experiment was conducted in Lihua Pig Farm of Changzhou City, Jiangsu Province, China, from August 1, 2023 to September 10, 2023. Ten long white sows were randomly selected as the research objects with significant differences in their litter environment and farrowing houses. All of these sows were within three days postpartum. The camera and sound recorder were used to collect video and audio data in the experiment, respectively. The dataset was constructed from the captured video and audio data. The sow behaviors were then manually labelled into six groups: breastfeeding, eating, drinking, sleeping, fence-hitting, and daily activities. Three models of behavior recognition verified the vision-audio fusion with different feature models. Specifically, MFCC-Vision Transformer was tested with audio features, SlowFast was with vision features, and AVSlowFast was with vision-audio multimodal features. The results showed that the outstandingly higher accuracies of multimodal models (AVSlowFast) were achieved to identify six types of sow behaviors, compared with two single-modal models, Vision Transfomer and Slowfast. Notably, AVSlowFast demonstrated superior performance in the behaviors with similar visual features among lactating sows, such as feeding, drinking, and fence-hitting. Nevertheless, there was a relatively smaller decrease in the recognition accuracy of sleeping behavior with a multimodal approach, compared with the single-vision. The reason was that the distinct audio features of sleep behavior were often lacking in the inclusion of audio information. The attention mechanisms (such as SENet and GCT) were then introduced to improve the recognition performance, especially in sleep behavior. After that, the accuracy of sleeping behavior recognition increased with the improved model. The attention mechanisms effectively adjusted the weight values of feature channels during iterative training, thus mitigating the interference caused by audio signals. GCT-AVSlowFast had achieved an accuracy of 94.3% precision and 94.6% recall, compared with SENet-AVSlowFast. The average F1-score of behavior recognition was significantly improved by 12.7%, compared with the single-modal (SlowFast). Finally, the superior performance of GCT-AVSlowFast without additional model parameters was suitable for deployment in resource-limited pig farm environments. The finding can also provide an effective approach to implementing multi-modal behavior monitoring in livestock and poultry.

改进AVSlowFast音视频融合模型对哺乳期母猪关键行为的识别

Behavior recognition of lactating sows using improved AVSlowFast audio-video fusion model