Small Molecule Optimization with Large Language Models
๐ Abstract
The article presents two language models, Chemlactica and Chemma, that have been fine-tuned on a novel corpus of 110M molecules with computed properties. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. The authors introduce a novel optimization algorithm that leverages their language models to optimize molecules for arbitrary properties given limited access to a black box oracle. This approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization, and achieves state-of-the-art performance on multiple molecular optimization benchmarks.
๐ Q&A
[01] Training Corpus
1. What is the training corpus used for the language models?
- The authors constructed a comprehensive SQL database using PubChem dumps, encompassing information on 110M molecules, their similar molecule pairs, experimental properties, and bioassays.
- They computed key molecular properties like synthesizability score (SAS), quantitatively estimated drug-likeness (QED), molecular weight (MW), total polar surface area (TPSA), partition coefficient (CLogP), and various structural features.
- The dataset was transformed into a corpus of JSONL files, with each molecule represented as a single JSON object containing its identifiers, computed properties, similarity data, synonyms, and experimental properties.
2. How was the training corpus structured to enhance the models' versatility?
- The authors developed a template system using paired tags to delimit each property and data point, randomizing the property order and alternating the position of the primary molecule to enable the models to learn complex relationships between molecular structures and properties.
[02] Model Training and Evaluation
1. How were the language models Chemlactica and Chemma selected and trained?
- The authors chose Galactica and Gemma as the base models for continued pretraining, as they demonstrated strong general-purpose performance and domain-specific knowledge.
- Chemlactica (125M and 1.3B parameters) and Chemma (2B parameters) were trained using the Adam optimizer with cross-entropy loss and a causal language modeling objective.
- The training was conducted at Yerevan State University and on the Nebius.ai cloud, leveraging PyTorch's Fully Sharded Data Parallel (FSDP) and Flash Attention for optimization.
2. How were the models evaluated on property prediction and conditional generation tasks?
- For property prediction, the authors prompted the models with the SMILES string of a molecule and the property to predict, then calculated the Root Mean Square Error (RMSE) between the predicted and actual property values.
- For conditional generation, the authors sampled target property values and prompted the models to generate molecules with the specified properties, again calculating the RMSE between the generated and target values.
- The authors also conducted an ablation study on the sampling techniques used during generation, demonstrating the impact of various components like Chain-of-Thought, repetition penalty, and undesired token suppression.
3. How did the models perform on the calibration analysis?
- The authors developed a suite of multiple-choice property prediction questions based on the training data format to assess the calibration of their models.
- Chemlactica and Chemma demonstrated robust calibration, with a near-linear relationship between the assigned probabilities and the correct outcomes across all computed properties.
- This suggests that the perplexity scores generated by the models can serve as reliable confidence indicators for molecular data predictions.
[03] Molecular Optimization Algorithm
1. What is the key innovation of the proposed molecular optimization algorithm?
- The algorithm maintains a pool of high-performing molecules and iteratively generates new candidates using the language models, which can be viewed as a genetic algorithm where traditional crossover/mutation operations are replaced by language model generation.
- It also incorporates explicit oracle modeling and fine-tuning of the language models to learn the relationship between molecular structure and oracle scores, enabling more targeted generation.
2. How does the algorithm handle the need for generating thousands of unique molecules?
- The authors implement a dynamic temperature scheduling strategy, where the sampling temperature starts at 1 and linearly increases to 1.5 as the number of oracle evaluations grows, promoting the generation of more diverse molecules over time.
3. What are the key hyperparameters of the optimization algorithm, and how were they tuned?
- The authors identified and froze the less sensitive hyperparameters, then focused on tuning the more sensitive ones (pool size, number of similar molecules, fine-tuning tolerance level, and fine-tuning peak learning rate) using a grid search.
- The tuning was performed on the perindopril_mpo and zaleplon_mpo tasks from the PMO benchmark, with the best-performing configuration applied across all benchmarks.
[04] Experiments
1. How did the proposed optimization algorithm perform on the Practical Molecular Optimization (PMO) benchmark?
- The authors' approach, powered by the Chemlactica-125M, Chemlactica-1.3B, and Chemma-2B models, outperformed existing methods on the PMO benchmark, achieving an average improvement of 8% over the previous best method.
- The authors also demonstrated the ability to further improve performance by incorporating additional property information into the prompts used during optimization.
2. How did the models perform on the multi-property optimization with docking task?
- On this benchmark, which evaluates the models' ability to generate viable drug candidates, the authors' approach consistently achieved the highest performance for the generative yield metric across all evaluated receptors (DRD2, MK2, and AChE).
- The smaller Chemlactica-125M model demonstrated superior performance in terms of oracle burden, while the larger Chemma-2B model excelled in generative yield, suggesting a trade-off between exploration and exploitation.
3. What was the impact of numerical precision on the optimization performance?
- The authors conducted an ablation study comparing 32-bit floating point precision with bfloat16 precision, finding that lower precision significantly degraded the optimization performance.
- This was attributed to the cascading effects of sub-optimal molecule generation, where errors in logit values propagate and lead to a negative feedback loop in the optimization process.