基于移位窗口Transformer网络的玉米田间场景下杂草识别

王璨; 武新慧; 张燕青; 王文俊

doi:10.11975/j.issn.1002-6819.2022.15.014

基于移位窗口Transformer网络的玉米田间场景下杂草识别

Recognizing weeds in maize fields using shifted window Transformer network

摘要

摘要: 针对实际复杂田间场景中作物与杂草识别准确性与实时性差，易受交叠遮挡影响，像素级数据标注难以大量获取等问题，该研究提出基于移位窗口Transformer网络（Shifted Window Transformer，Swin Transformer）的高效识别方法，在实现作物语义分割的基础上快速分割杂草。首先建立玉米语义分割模型，引入Swin Transformer主干并采用统一感知解析网络作为其高效语义分割框架；改进Swin Transformer主干调整网络参数，生成4种改进模型，通过精度与速度的综合对比分析确定最佳模型结构；基于玉米形态分割，建立改进的图像形态学处理组合算法，实时识别并分割全部杂草区域。测试结果表明，该研究4种改进模型中，Swin-Tiny-UN达到最佳精度-速度平衡，平均交并比为94.83%、平均像素准确率为97.18%，推理速度为18.94帧/s。对于模拟实际应用的视频数据，平均正确检测率为95.04%，平均每帧检测时间为5.51′10-2 s。该方法有效实现了玉米与杂草的实时准确识别与精细分割，可为智能除草装备的研发提供理论参考。

Abstract: Weeds have been one of the main factors to affect the growth of crops in the seedling stage. Timely weeding is a necessary measure to ensure crop yield. An intelligent field weeding equipment can also be a promising potential deployment in the unmanned farm system at the current stage of intelligent agriculture. Effective recognition of crops and weeds has been a high demand to promote the development of intelligent weeding equipment. Previous research was focused mainly on object detection and semantic segmentation using deep learning. A great challenge is still remained in the performance of target detection, in the case of overlap images between the crops and weeds under the complex field. The reason was that the different target areas cannot be further divided when the generated anchor box overlaps in a large area. The pixel level annotation can also be required to train the semantic segmentation, where the data samples cannot be easy to obtain. The weak real-time performance cannot be conducive to practical application. In this study, an improved model was proposed using shifted window Transformer (Swin Transformer) network, in order to enhance the accuracy and real-time performance of crop and weed recognition. The specific procedure was as follows. 1) A semantic segmentation model of corn was established for the real and complex field scene. The backbone of the model was the Swin Transformer architecture, which was denoted by Swin-Base. The full self-attention mechanism was also adopted to significantly enhance the modeling ability in the Swin Transformer using the shift window division configuration. Self-attention was then calculated locally in the non-overlapping window of the segmented image block, where the cross-window connection was allowed. The computational complexity of the backbone presented a linear relationship with the image size, thereby elevating the inference speed of the model. The hierarchical feature representation was constructed through the Swin Transformer for the dense prediction of the model at the pixel level. 2) The Unified perceptual parsing Network (UperNet) was used as an efficient semantic segmentation framework. Among them, the feature extractor was the Feature Pyramid Network (FPN) using the Swin Transformer backbone. The multi-level features obtained by Swin Transformer were used by the FPN to represent the corresponding pyramid level. An effective global prior feature expression was added in the Pyramid Pooling Module (PPM). Better performance of semantic segmentation was achieved using the fusion of the hierarchical semantic information. The Swing transformer backbone and UperNet framework were combined into one model through the Decoder-Encoder structure, denoted by Swin-Base-UN. 3) The structure of the Swin-Base backbone was improved to enhance the inference speed. The number of network parameters and calculation cost were reduced to decrease the number of hidden layer channels, headers, and Swin Transformer blocks. Therefore, four improved models were generated, including the Swin-Large-UN, Swin-Small-UN, Swin-Tiny-UN, and Swin-Nano-UN. The model size and computational complexity of improved models were about 2, 1/2, 1/4, and 1/8 times of Swin-Base-UN, respectively. 4) Taking the segmentation of corn morphological region as the case study, an improved image morphological processing combination was established to recognize and segment all the weed regions in real time. The segmentation of corn was also used to segment the weeds. The weed pixel annotation was removed from the training data of the model. As such, a large number of annotation data at the pixel level was obtained in the semantic segmentation of the improved model, compared with the original one. A comparison was made on the performance of all models in training, validation, and testing. Consequently, the Swin-Tiny-UN was determined as the best model to achieve the optimal balance between accuracy and speed. Specifically, the mean Intersection over Union (mIoU) and mean Pixel Accuracy (mPA) on the test set were 94.83% and 97.18%, respectively, which increased by 3.27 and 4.71 percentage points, respectively, compared with the RestNet-101-UN using traditional Convolutional Neural Networks (CNN) backbone. The inference speed of the model was achieved by 18.94 frames/s. The best model of semantic segmentation was superior to the traditional one, in terms of the region segmentation accuracy, pixel recognition accuracy, and inference speed. The image segmentation showed that the improved model can be expected to accurately recognize and segment maize and weeds in complex field scenes. The average correct detection rate of the improved model was 95.04% for the video stream data in the process of field work, whereas, the average detection time per frame was 5.51´10-2 s. Consequently, the improved model can be expected to detect the corn and weeds in the process of field work, indicating higher accuracy and real-time performance under practical application conditions. The findings can provide a strong reference for the development of intelligent weeding equipment.

HTML全文

参考文献(38)

施引文献

资源附件(0)