食品中甜味分子发掘模型构建

任海斌; 冯宝龙; 范蓓; 贺斌彬; 李知陆; 王清华; 高飞; 王玉堂

doi:10.11975/j.issn.1002-6819.2021.19.035

摘要: 食品工业一直在积极地发现新的甜味分子，传统发掘方法费时费力，效率较低。该研究基于分子的甜味和分子结构相关的假设，利用文献、专利及数据库中的数据，建立甜味、非甜味分子数据集和甜度分子数据集，采用随机森林和支持向量机算法建立定性构效关系模型定性预测甜味分子；采用主成分回归、k最邻近回归、随机森林回归和偏最小二乘回归四种算法建立定量构效关系模型定量预测甜味分子的甜度。研究发现，随机森林算法模型的分类效果最好，接受者操作特性曲线下的面积为0.987，准确度为0.966；随机森林回归模型的甜度预测效果最好，决定系数为0.82，误差均方根为0.60。联用这两个模型在食品成分数据库中，发现542个具有甜味剂潜力的食品分子。

Abstract: Sweet taste is one of the most important tastes in food flavor and quality. Sweet molecules that can be used to produce new sweeteners have also been actively explored in food processing. However, the traditional methods cannot meet the rapid development of the economy and market demand, due mainly to time-consuming, laborious, and inefficient methods. Therefore, an effective and reliable strategy is essential to produce the sweet stuff. Currently, machine learning and structure-activity relationship can be utilized to realize accurate predictions of sweet molecules in the food industry. In this study, a new database of sweeteners and non-sweeteners together with the scores of sweetness was established using molecular sweetness and structure-activity correlation between molecular structures. MOE software was selected to compute molecular descriptors, to fully characterize the properties of molecules. These descriptors were then filtered through neighborhood variance screening, collinearity removal, and principal component contribution rate screening. Specifically, the feature descriptors were screened by removing the descriptors with high correlation. 80% of the dataset was then divided into training sets for model construction, and 20% were divided into test sets for model validation. Random forest and support vector machines were utilized to establish a qualitative structure-activity relationship for the prediction and identification of potential sweet molecules. Evaluation indexes were taken as the area under the receiver characteristic curve (AUC) and accuracy rate. The higher the AUC and accuracy rate represented the better classification. As such, the optimal model was obtained. Subsequently, the principal component, K-nearest neighbor, random forest, and partial least squares regression were used to establish the quantitative structure-activity relationship for better prediction of sweet molecules. The determination coefficient R2 and Root Mean Square Error (RMSE) were used as evaluation indexes of the quantitative structure-activity model. The higher R2 and lower RMSE showed the better model. The optimal model was obtained to compare the performance. The food composition database (FooDB) was applied to predict the possible sweet food ingredients and the sweetness. Correspondingly, the publicly accessible dataset was established ranging from artificially revised and continuously updated on sweetener, non-sweetener substances, and sweetness values. A new model was established to identify sweet molecules using the random forest. The accuracy of the model was 0.966 on the test set, and the area under the ROC curve was 0.987, indicating excellent predictive ability. The prediction model of sweetness was also established using the random forest. Specifically, the R2 was 0.82 and RMSE was 0.60. A manually modified data set was established to combine qualitative and quantitative sweetener prediction. 542 potential sweetener molecules, including lycopene were discovered in the food composition database. All data and code were then stored at the website of https://gitee.com/wang_lab/EMMSM for a better extension. Consequently, the new model indicated universal applicability and high practical application in searching for new sweet molecules.

食品中甜味分子发掘模型构建

Establishment of the mining model for sweet molecules in food