基于多模态双线性池化机制的蘑菇类别识别模型

    Mushroom category recognition model based on multimodal bilinear pooling mechanism

    • 摘要: 蘑菇具有丰富的营养价值,针对蘑菇种类繁多且在复杂背景下视觉特征相似的蘑菇不易识别的问题,该研究构建基于多模态信息融合的双线性池化模型MCB-Net(multimodal compact bilinear network),通过多源信息融合实现更加准确的蘑菇类别识别。首先,采用多模态紧凑双线性池化算法,建立图像-文本跨模态特征融合机制,通过语义信息增强实现关键区域注意力聚焦;其次,构建了多尺度特征金字塔模块,采用跨层级特征融合策略强化多尺度表征能力,提升对小目标样本识别精度;然后,引入全局二阶池化模块,通过高阶统计特征提取增强对于视觉特征相似的蘑菇的鉴别能力;最后,引入自适应动态损失函数,实现基于梯度冲突分析的图像和文本两种模态权重动态优化,缓解模态竞争问题。结果表明,针对59类真实田间复杂背景的蘑菇数据集,MCB_Net的准确精确度、召回率、F1值和准确率分别为98.60%、98.53%、98.56%和98.60%,相较于最优单模态卷积神经网络模型(MobileViT),分别提升了6.70、7.61、7.62和6.67个百分点;相较于最优多模态模型(MobileViT_BERT),分别提升了5.39、5.59、5.65和5.66个百分点,表明本模型在田间复杂场景下的检测性能更好。该研究可为复杂场景中蘑菇类别的识别提供一定的理论支持。

       

      Abstract: Mushrooms have drawn increasing attention due to their rich nutritional, medicinal, and economic values. However, the identification of the mushroom species is limited to the substantial differences in the growth habits, intra-class variations across developmental stages, and inter-class morphological similarities among certain species. Conventional identification can often rely on expert visual inspection, leading to human error and oversight. Fortunately, deep learning-based approaches have emerged as advanced artificial intelligence in mushroom classification. Existing models can be employed for the single-modal image recognition algorithms, leading to low accuracy on a narrow range of species. Single-modal data can also fail to characterize the target features. Whereas multimodal learning can be expected to extract complementary information from diverse data sources, in order to achieve more robust feature representations. In this study, the MCB_Net (multimodal compact bilinear network) was proposed to serve as a mushroom species identification model, according to a multimodal compact bilinear pooling mechanism. The MO106 dataset was reconstructed with the 14 836 images of 59 common mushroom species. Training, validation, and test sets were then divided at an 8:1:1 ratio. The multimodal compact bilinear pooling (MCBP) module was integrated to complement the complementary semantic information. The image and text representations were projected into a higher-dimensional space via random projection, approximating dimensionality reduction using the Count Sketch method. Subsequently. Fast Fourier Transform (FFT)-based convolution and inverse FFT operations were enhanced to improve the temporal feature extraction, followed by the normalization to strengthen visual representations. In multimodal frameworks, there were inherent disparities between image (local texture-focused) and text (semantic description-dependent) modalities, leading to the imbalanced convergence rates and contribution biases during training. Conventional static loss weighting was also suffered from three critical limitations: 1) modal dominance fixation, where the image features were disproportionately guided the early-stage optimization, in order to suppress the text modality learning; 2) noise sensitivity, where the fixed weights were amplified the annotation errors (e.g., textual misdescriptions); 3) phase-dependent modality reliance, the dynamic adjustments (e.g., prioritizing images for rapid convergence initially, then text for boundary refinement). Consequently, an adaptive dynamic loss (ADL) function was proposed to dynamically adjust the modal loss weights using gradient conflict intensity. The learnable parameters were employed to mitigate the inter-modal competition and high-error modalities, in order to accelerate the global optimum search. Furthermore, a multi-scale feature pyramid (MSFP) module was enhanced to fuse the hierarchical features. A global second-order pooling (GSoP) was also improved to emphasize the discriminative characteristics among visually similar species. The accuracy of the improved model was 96.36% in the reference state, particularly without the MCBP, global second-order pooling (GSoP), and ADL. Once the GSoP module was introduced separately, the accuracy rate was improved to 96.73%. The similar target features were focused on for a positive impact on the recognition accuracy. The accuracy was significantly improved by 97.42%, only in the MCBP. The recognition rate was significantly improved under a complex background, where the image modal and text modal features were effectively combined. Once GSoP modules were combined with ADL, the accuracy of the model reached the highest 98.60%. Experimental results demonstrate that the MCB_Net achieved 98.60% accuracy on the 59-class dataset, significantly outperforming existing approaches. This work can provide an effective technical solution to the precise mushroom identification in complex environments, with promising potential applications in ecological, pharmaceutical, and food safety monitoring.

       

    /

    返回文章
    返回