QIN Lifeng, ZHOU Xinyi, GAO Yannian, et al. Cattle Re-Identification in Natural Scenes Using a CNN-Transformer Hybrid ModelJ. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2026, 42(5): 1-13. DOI: 10.11975/j.issn.1002-6819.202508021
    Citation: QIN Lifeng, ZHOU Xinyi, GAO Yannian, et al. Cattle Re-Identification in Natural Scenes Using a CNN-Transformer Hybrid ModelJ. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2026, 42(5): 1-13. DOI: 10.11975/j.issn.1002-6819.202508021

    Cattle Re-Identification in Natural Scenes Using a CNN-Transformer Hybrid Model

    • In large-scale dairy farms, reliable individual identification of cows is a fundamental prerequisite for behavior monitoring, precision feeding, and fine-grained health management. Vision-based cow re-identification, which aims to recognize the same individual across different cameras and time periods, has become an important research topic in smart livestock farming. However, in natural barn environments, cross-camera cow re-identification remains challenging due to high inter-individual similarity, large intra-individual variations in posture and viewpoint, frequent occlusions, and complex illumination changes. To address these issues, this paper proposes a cow re-identification algorithm based on a hybrid Convolutional Neural Network and Transformer model (CNN–Transformer). The proposed method adopts a dual-branch backbone, where the CNN branch focuses on extracting local texture details such as hair, spots, and body edges, while the Transformer branch employs global self-attention mechanisms to capture holistic body shape and spot distribution patterns, thereby modeling long-range dependencies across the entire cow body. Both branches are trained within a unified re-identification framework using a combination of cross-entropy loss and triplet loss to encourage compact intra-class clustering and large inter-class separation in the embedding space. To enhance the complementarity between global and local representations, a cross-dimension multi-scale feature fusion module is designed and inserted at three semantic levels of the backbone. At shallow and intermediate stages, feature maps from the Transformer and CNN branches are first rescaled to a consistent spatial resolution and aligned along the channel dimension. The fusion module then performs multi-scale pooling and cross-channel rearrangement, enabling global semantic cues to guide the selection of informative local textures while suppressing noisy or redundant local patterns caused by cluttered backgrounds or partial occlusions. At the final stage, the module fuses the terminal outputs of both branches, where the Transformer branch aggregates global semantic information and the CNN branch concentrates rich local spatial details, producing a unified feature map that jointly encodes overall body structure, spot patterns, and multi-scale contextual cues. This fused feature map is subsequently fed into a global average pooling and normalization pipeline to obtain a discriminative identity descriptor, which is used for cosine-similarity-based Query–Gallery retrieval in the re-identification setting. In addition, a Token-SE attention module is introduced at the eighth semantic layer of the Transformer branch to further enhance channel-wise selectivity. The module first aggregates token-wise responses to obtain a compact channel descriptor for each feature channel, and then passes it through a bottleneck structure composed of dimensionality reduction and expansion fully connected layers with non-linear activation. A Sigmoid function is applied to normalize the learned channel importance weights, which are then used to rescale the original Transformer feature channels. In this way, channels that are highly correlated with cow-specific appearance cues, such as stable spot patterns and body contours, are emphasized, while channels dominated by illumination changes or background noise are suppressed. Experimental ablation studies show that the Token-SE module effectively strengthens the Transformer branch’s focus on discriminative semantic information and works synergistically with the CNN branch and the cross-dimension multi-scale fusion module. The experimental dataset was collected in a real dairy farm calving area using 11 fixed surveillance cameras continuously monitoring 21 cows under natural conditions. The resulting dataset contains 7,371 annotated cow images with diverse viewpoints, postures, and occlusion patterns. Among them, images from 10 cows are used for training and images from the remaining 11 cows are used for testing. During evaluation, the test data are organized according to the standard re-identification protocol into a Query set and a Gallery set, and performance is measured by Rank-1, Rank-5, Rank-10 accuracy and mean Average Precision (mAP). On this dataset, the proposed CNN–Transformer hybrid model achieves Rank-1, Rank-5, Rank-10, and mAP of 86.2%, 93.1%, 95.7%, and 45.1%, respectively, outperforming the baseline Transformer model by 8.6%, 6.0%, 1.7%, and 5.5%. These results demonstrate that introducing the CNN branch, the cross-dimension multi-scale feature fusion module, and the Token-SE attention module significantly improves the joint modeling of global and local features, as well as the robustness to cross-view and cross-camera variations. Furthermore, qualitative analyses are conducted to validate the effectiveness of the proposed method. Attention heatmaps indicate that the improved model increasingly focuses on key regions such as the head, back, and characteristic spot areas. t-SNE visualization of feature embeddings reveals better inter-class separability and intra-class compactness compared with the baseline. Top-10 retrieval examples and pairwise distance heatmaps under different illumination, occlusion, and appearance-similarity conditions show that the proposed model can still correctly retrieve the target individual in challenging scenarios. Overall, the proposed CNN–Transformer hybrid model and its associated modules provide a promising technical solution for cow identity re-identification in complex natural farm environments and offer a useful reference for the design and deployment of practical intelligent monitoring systems in large-scale dairy farms.
    • loading

    Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return