基于YOLO深度卷积神经网络的复杂背景下机器人采摘苹果定位

赵德安; 吴任迪; 刘晓洋; 赵宇艳

doi:10.11975/j.issn.1002-6819.2019.03.021

基于YOLO深度卷积神经网络的复杂背景下机器人采摘苹果定位

Apple positioning based on YOLO deep convolutional neural network for picking robot in complex background

摘要

摘要: 为提高苹果采摘机器人的工作效率和环境适应性，使其能全天候的在不同光线环境下对遮挡、粘连和套袋等多种情况下的果实进行识别定位，该文提出了基于YOLOv3(you only look once)深度卷积神经网络的苹果定位方法。该方法通过单个卷积神经网络(one-stage)遍历整个图像，回归目标的类别和位置，实现了直接端到端的目标检测，在保证效率与准确率兼顾的情况下实现了复杂环境下苹果的检测。经过训练的模型在验证集下的mAP(mean average precision)为87.71%，准确率为97%，召回率为90%，IOU(intersection over union)为83.61%。通过比较YOLOv3与Faster RCNN算法在不同数目、不同拍摄时间、不同生长阶段、不同光线下对苹果的实际检测效果，并以F1为评估值对比分析了4种算法的差异，试验结果表明YOLOv3在密集苹果的F1高于YOLOv2算法4.45个百分点，在其他环境下高于Faster RCNN将近5个百分点，高于HOG+SVM(histogram of oriented gradient + support vector machine)将近10个百分点。并且在不同硬件环境验证了该算法的可行性，一幅图像在GPU下的检测时间为16.69 ms，在CPU下的检测时间为105.21 ms，实际检测视频的帧率达到了60帧/s和15帧/s。该研究可为机器人快速长时间高效率在复杂环境下识别苹果提供理论基础。

Abstract: Abstract: Automatic recognition of apple is one of the important aspects for apple harvest robots. Fast apple recognition can improve the efficiency of picking robots. In the actual scene of the orchard, the recognition conditions for apple are complex such as daytime, night, overlap apples, occlusion, bagged, backlighting, reflected light and dense apple, considering which a highly robust and fast visual recognition scheme is required. A fast and stable apple recognition scheme was proposed based on improved YOLOv3 in this paper. The entire image was traversed by a single convolutional neural network (one-stage), dividing an image into a plurality of sub-regions with the same size, and predicting the class of the target and its bounding box in each sub-region. Finally, the non-maximum value suppression was merged into the outer frame of the whole target, and the category and position of the target were returned. In order to improve the detection efficiency, the VGG-like network model was used to replace the original residual network of YOLOv3, and the model size was reduced, in which the 53-layer neural network was compressed into a 13-layer neural network without affecting the detection effect. Taking into account the size of the smallest apple in dense apples images, the anchor points of 3 different sizes were reduced to 2, reducing the final predicted tensor and ensuring that the smallest anchor point could still include the minimum target. The steps in this paper were stated as follows: Firstly, the data set was manually marked, including 400 images for the training set and 115 images for the verification set, including a total of 1 158 apple samples. In addition, in order to increase the generalization ability of the model, the data set was enhanced by adjusting the hue, color amount and exposure of the image, and a total of 51 500 images were generated. Then the initial value of the anchor points was calculated through K-means. Secondly, training the data set, output a model every 100 iterations. For the verification set, the mean average precision (mAP) value of each weight in batches was calculated, selecting the model with the highest mAP value, and finding the appropriate threshold to ensure most preferred precision, recall rate and intersection over union (IOU). The trained model had a mAP which reached up to 87.71%, an accuracy rate up to 97%, a recall rate up to 90%, and an IOU up to 83.61%. Thirdly, the specific performance of the model under image conditions for different fruit number, illumination angle, fruit growth stage and shooting time were verified in additional experimental data sets. The experimental data set consisted of 336 pictures containing 1 410 apple samples. The comparison was performed with algorithms of HOG+SVM, Faster RCNN, YOLOv2, and YOLOv3, with the evaluated index of F1 value. The experimental results showed that YOLOv3 performed significantly better than YOLOv2 in dense apples image, and better in other environments than Faster RCNN and HOG+SVM. Finally, the detection accuracy of the algorithm was verified in different hardware environments. The detection time of an image under the GPU was 16.69 ms with 60 frame/s for the actual video, and under the CPU was 105.21 ms with 15 frame/s for the actual video. Since it was positioned only at the beginning of the picking process and it did not require frequently refreshing during the picking process, in which the detection time in this paper was qualified. A reference was provided for the rapid, long-term high efficiency of robots to locate apples in complex environments in this research.

HTML全文

参考文献(31)

施引文献

资源附件(0)