基于CNN与Transformer混合模型的自然场景奶牛身份重识别

    Cattle Re-Identification in Natural Scenes Using a CNN-Transformer Hybrid Model

    • 摘要: 在规模化奶牛养殖场中,身份识别是行为监测和精细化管理的前提。基于计算机视觉的奶牛身份识别,是当前智慧畜牧领域的研究热点。为了提升自然场景下跨摄像头奶牛身份识别的准确性,本文提出基于卷积神经网络(convolutional neural network,CNN)与Transformer混合模型(CNN-Transformer)的奶牛身份重识别算法。模型的CNN分支提取奶牛局部纹理细节等特征,Transformer分支利用全局自注意力机制捕获奶牛整体特征,实现全局依赖关系的建模。模型构建跨纬度多尺度特征融合模块,分别在Transformer分支和CNN分支对应的三个语义层级进行特征融合,以兼顾空间结构与语义表达,实现全局和局部特征的动态交互。此外,在Transformer分支的第八个Layer(语义层)构建Token-SE注意力模块,增强模型通道选择性,提升模型对关键语义特征的聚焦能力。使用11个摄像头拍摄了21头待产区奶牛图像构建数据集,共包含不同视角、不同姿态、存在遮挡的奶牛图像7371张。身份重识别实验结果显示,本文提出的模型在Rank-1、Rank-5、Rank-10、mAP上达到86.2%、93.1%、95.7%、45.1%,相较基准Transformer模型分别提升了8.6%、6.0%、1.7%、5.5%,性能提升明显。同时,注意力热图、t-SNE特征嵌入可视化、Top-10检索结果可视化、特征距离热图可视化进一步验证了所提模型在复杂环境跨摄像头场景的身份重识别任务中具有更强的特征学习能力。本文所提模型可为复杂自然场景下的奶牛身份重识别提供技术参考。

       

      Abstract: In large-scale dairy farms, reliable individual identification of cows is a fundamental prerequisite for behavior monitoring, precision feeding, and fine-grained health management. Vision-based cow re-identification, which aims to recognize the same individual across different cameras and time periods, has become an important research topic in smart livestock farming. However, in natural barn environments, cross-camera cow re-identification remains challenging due to high inter-individual similarity, large intra-individual variations in posture and viewpoint, frequent occlusions, and complex illumination changes. To address these issues, this paper proposes a cow re-identification algorithm based on a hybrid Convolutional Neural Network and Transformer model (CNN–Transformer). The proposed method adopts a dual-branch backbone, where the CNN branch focuses on extracting local texture details such as hair, spots, and body edges, while the Transformer branch employs global self-attention mechanisms to capture holistic body shape and spot distribution patterns, thereby modeling long-range dependencies across the entire cow body. Both branches are trained within a unified re-identification framework using a combination of cross-entropy loss and triplet loss to encourage compact intra-class clustering and large inter-class separation in the embedding space. To enhance the complementarity between global and local representations, a cross-dimension multi-scale feature fusion module is designed and inserted at three semantic levels of the backbone. At shallow and intermediate stages, feature maps from the Transformer and CNN branches are first rescaled to a consistent spatial resolution and aligned along the channel dimension. The fusion module then performs multi-scale pooling and cross-channel rearrangement, enabling global semantic cues to guide the selection of informative local textures while suppressing noisy or redundant local patterns caused by cluttered backgrounds or partial occlusions. At the final stage, the module fuses the terminal outputs of both branches, where the Transformer branch aggregates global semantic information and the CNN branch concentrates rich local spatial details, producing a unified feature map that jointly encodes overall body structure, spot patterns, and multi-scale contextual cues. This fused feature map is subsequently fed into a global average pooling and normalization pipeline to obtain a discriminative identity descriptor, which is used for cosine-similarity-based Query–Gallery retrieval in the re-identification setting. In addition, a Token-SE attention module is introduced at the eighth semantic layer of the Transformer branch to further enhance channel-wise selectivity. The module first aggregates token-wise responses to obtain a compact channel descriptor for each feature channel, and then passes it through a bottleneck structure composed of dimensionality reduction and expansion fully connected layers with non-linear activation. A Sigmoid function is applied to normalize the learned channel importance weights, which are then used to rescale the original Transformer feature channels. In this way, channels that are highly correlated with cow-specific appearance cues, such as stable spot patterns and body contours, are emphasized, while channels dominated by illumination changes or background noise are suppressed. Experimental ablation studies show that the Token-SE module effectively strengthens the Transformer branch’s focus on discriminative semantic information and works synergistically with the CNN branch and the cross-dimension multi-scale fusion module. The experimental dataset was collected in a real dairy farm calving area using 11 fixed surveillance cameras continuously monitoring 21 cows under natural conditions. The resulting dataset contains 7,371 annotated cow images with diverse viewpoints, postures, and occlusion patterns. Among them, images from 10 cows are used for training and images from the remaining 11 cows are used for testing. During evaluation, the test data are organized according to the standard re-identification protocol into a Query set and a Gallery set, and performance is measured by Rank-1, Rank-5, Rank-10 accuracy and mean Average Precision (mAP). On this dataset, the proposed CNN–Transformer hybrid model achieves Rank-1, Rank-5, Rank-10, and mAP of 86.2%, 93.1%, 95.7%, and 45.1%, respectively, outperforming the baseline Transformer model by 8.6%, 6.0%, 1.7%, and 5.5%. These results demonstrate that introducing the CNN branch, the cross-dimension multi-scale feature fusion module, and the Token-SE attention module significantly improves the joint modeling of global and local features, as well as the robustness to cross-view and cross-camera variations. Furthermore, qualitative analyses are conducted to validate the effectiveness of the proposed method. Attention heatmaps indicate that the improved model increasingly focuses on key regions such as the head, back, and characteristic spot areas. t-SNE visualization of feature embeddings reveals better inter-class separability and intra-class compactness compared with the baseline. Top-10 retrieval examples and pairwise distance heatmaps under different illumination, occlusion, and appearance-similarity conditions show that the proposed model can still correctly retrieve the target individual in challenging scenarios. Overall, the proposed CNN–Transformer hybrid model and its associated modules provide a promising technical solution for cow identity re-identification in complex natural farm environments and offer a useful reference for the design and deployment of practical intelligent monitoring systems in large-scale dairy farms.

       

    /

    返回文章
    返回