Abstract:
Mushrooms have drawn increasing attention due to their rich nutritional, medicinal, and economic values. However, the identification of the mushroom species is limited to the substantial differences in the growth habits, intra-class variations across developmental stages, and inter-class morphological similarities among certain species. Conventional identification can often rely on expert visual inspection, leading to human error and oversight. Fortunately, deep learning-based approaches have emerged as advanced artificial intelligence in mushroom classification. Existing models can be employed for the single-modal image recognition algorithms, leading to low accuracy on a narrow range of species. Single-modal data can also fail to characterize the target features. Whereas multimodal learning can be expected to extract complementary information from diverse data sources, in order to achieve more robust feature representations. In this study, the MCB_Net (multimodal compact bilinear network) was proposed to serve as a mushroom species identification model, according to a multimodal compact bilinear pooling mechanism. The MO106 dataset was reconstructed with the 14 836 images of 59 common mushroom species. Training, validation, and test sets were then divided at an 8:1:1 ratio. The multimodal compact bilinear pooling (MCBP) module was integrated to complement the complementary semantic information. The image and text representations were projected into a higher-dimensional space via random projection, approximating dimensionality reduction using the Count Sketch method. Subsequently. Fast Fourier Transform (FFT)-based convolution and inverse FFT operations were enhanced to improve the temporal feature extraction, followed by the normalization to strengthen visual representations. In multimodal frameworks, there were inherent disparities between image (local texture-focused) and text (semantic description-dependent) modalities, leading to the imbalanced convergence rates and contribution biases during training. Conventional static loss weighting was also suffered from three critical limitations: 1) modal dominance fixation, where the image features were disproportionately guided the early-stage optimization, in order to suppress the text modality learning; 2) noise sensitivity, where the fixed weights were amplified the annotation errors (e.g., textual misdescriptions); 3) phase-dependent modality reliance, the dynamic adjustments (e.g., prioritizing images for rapid convergence initially, then text for boundary refinement). Consequently, an adaptive dynamic loss (ADL) function was proposed to dynamically adjust the modal loss weights using gradient conflict intensity. The learnable parameters were employed to mitigate the inter-modal competition and high-error modalities, in order to accelerate the global optimum search. Furthermore, a multi-scale feature pyramid (MSFP) module was enhanced to fuse the hierarchical features. A global second-order pooling (GSoP) was also improved to emphasize the discriminative characteristics among visually similar species. The accuracy of the improved model was 96.36% in the reference state, particularly without the MCBP, global second-order pooling (GSoP), and ADL. Once the GSoP module was introduced separately, the accuracy rate was improved to 96.73%. The similar target features were focused on for a positive impact on the recognition accuracy. The accuracy was significantly improved by 97.42%, only in the MCBP. The recognition rate was significantly improved under a complex background, where the image modal and text modal features were effectively combined. Once GSoP modules were combined with ADL, the accuracy of the model reached the highest 98.60%. Experimental results demonstrate that the MCB_Net achieved 98.60% accuracy on the 59-class dataset, significantly outperforming existing approaches. This work can provide an effective technical solution to the precise mushroom identification in complex environments, with promising potential applications in ecological, pharmaceutical, and food safety monitoring.