基于潜在扩散采样与片段异构图的农药分子生成模型

    A pesticide molecule generation model based on latent diffusion sampling and fragment heterogeneous graph

    • 摘要: 近年来,深度生成模型在农药发现和从头分子设计方面展现出巨大潜力,但现有方法常面临生成分子结构碎片化、多样性不足以及难以兼顾特定靶点活性的挑战。为解决上述问题,该研究提出了一种基于潜在扩散采样与片段异构图融合的农药分子生成模型。首先,构建异构图神经网络协同变分自编码器,将分子的原子级拓扑与片段级语义映射至统一的潜在空间;其次,引入潜在扩散模型,通过加噪-去噪的迭代优化机制缓解生成模型的模式坍塌问题;最后,结合前缀微调策略,针对昆虫乙酰胆碱酯酶(AChE)和植物乙酰乳酸合成酶(ALS)两类典型农药靶点进行定向优化生成。试验结果表明,在AChE靶点生成任务中,该模型生成分子的有效性、新颖性和唯一性分别达到了100.00%、100.00%和98.50%,此外,生成分子在脂水分配系数(logarithm of partition coefficient,LogP)、拓扑极性表面积(topological polar surface area,TPSA)、相对分子质量(molecular weight,MW)等理化性质上的分布与真实农药分子高度一致。分子对接结果显示,62.81%的生成分子与AChE靶标蛋白(PDB: 6XYU)的结合亲和力低于−7.0 kcal/mol,且成功复现了与关键氨基酸残基(如 GLU-485、TYR-498)的相互作用模式。该方法能够高效生成结构合理、性质优良且具有潜在生物活性的候选农药分子,为突破农药研发数据稀缺瓶颈与加速新药创制提供了新的计算范式。

       

      Abstract: The discovery of novel pesticide molecular structures is the primary driver for overcoming pest resistance and ensuring sustainable agricultural development. Traditional computer-aided drug design methodologies often rely on restricted chemical libraries and human expertise, which significantly limits the exploration of the extensive chemical space. Although deep generative models have demonstrated considerable potential in de novo design, existing frameworks frequently encountered difficulties such as structural instability, insufficient chemical diversity, and suboptimal alignment with target-specific biological activities. This study aimed to develop an advanced computational paradigm by integrating multi-scale structural representation and latent space optimization to generate pesticide candidates characterized by high chemical rationality and potent bioactivity. A novel molecular generation model was developed by integrating latent diffusion sampling with fragment-based heterogeneous graphs. The methodology involved constructing a Heterogeneous Graph Neural Network (HGNN) synergized with a Variational Autoencoder (VAE) to map atomic-level topology and fragment-level semantics into a unified latent space. Molecular fragmentation was performed using the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm to ensure chemical validity. The encoder utilized three Graph Convolutional Network (GCN) layers with a hidden dimension of 300 to process atomic, bond, pharmacophore, and reaction features. Subsequently, a Latent Diffusion Model (LDM) employing a one-dimensional U-Net architecture with six residual layers was introduced to implement a "noising-denoising" mechanism through 1,000 training steps, effectively alleviating mode collapse. Finally, a Prefix-tuning strategy was integrated into a four-head Transformer-based decoder to guide conditional generation toward specific targets, including insect Acetylcholinesterase (AChE) and plant Acetolactate Synthase (ALS). Systematic evaluations demonstrated the superior performance of the proposed framework. In the AChE-targeted generation task, the generated molecules achieved a validity rate of 100.00%, a novelty rate of 100.00%, and a uniqueness rate of 98.50%. On benchmark datasets QM9 and ZINC, the model attained a New/Sample metric of 64.2% and 98.5%, respectively, significantly outperforming baselines such as MolGPT and GeoBFN. Ablation studies confirmed that synergistic modeling of atom-level and fragment-level views was essential for capturing fine-grained topology and high-order semantics. The distributions of physicochemical properties, including the Logarithm of Partition Coefficient (LogP), Topological Polar Surface Area (TPSA), and Molecular Weight (MW), showed high consistency with real-world pesticides. Molecular docking revealed that 62.81% of the candidates exhibited a binding affinity lower than -7.0 kcal/mol with the AChE protein (PDB: 6XYU). Furthermore, the model reproduced critical interaction patterns with essential residues, such as Glutamic Acid 485 and Tyrosine 498, with hydrogen bond lengths ranging from 2.4 to 3.3 Å. Prefix-tuning required only 8,576 trainable parameters, significantly reducing training time while avoiding overfitting. The proposed model successfully integrated multi-scale representation and latent diffusion to enhance molecular diversity and innovation. The results indicated that the framework effectively captured target-specific structure-activity relationships while maintaining high chemical rationality. This research provides a scalable tool for targeted bioactive molecule design, offering a new paradigm to overcome data scarcity and accelerate the discovery of environment-friendly agrochemicals.

       

    /

    返回文章
    返回