Ranking protein-protein models with large language models and graph neural networks
๐ Abstract
The article discusses the use of large language models and graph neural networks to rank protein-protein interaction (PPI) models. It introduces DeepRank-GNN-esm, a deep learning tool that can analyze PPI interfaces and predict the quality of PPI models in terms of the fraction of native intermolecular contacts (fnat). The article provides detailed instructions on how to use DeepRank-GNN-esm, including installation, data preprocessing, running the computations, and analyzing the results. It also covers how to customize the deep learning architecture to meet individual research needs.
๐ Q&A
[01] Installation and Setup
1. What are the software requirements for using DeepRank-GNN-esm?
- ESMFold software package for protein language model embedding calculation
- DeepRank-GNN-esm software package for fnat computation
- PDB-tools for preprocessing input PDB files
- Conda for environment management
- Jupyter Notebook (optional) for interactive computing
2. What are the hardware requirements for using DeepRank-GNN-esm?
- A Linux computer with multiple CPUs and/or GPU is preferable, as tested on CentOS 7, RockyLinux, Ubuntu 20.04 LTS, and openSUSE tumbleweed with CUDA versions 11.6, 12.5, 12.4, and 12.3, respectively.
[02] Data Preparation
1. What preprocessing step is required for the input PDB files?
- Renumbering the residue indexes for every chain in the PDB file is required to ensure correct matching between the calculated ESM-2 embeddings and their corresponding residue nodes in the graph.
2. How can the PDB files be renumbered using the provided script?
- The
pdb_renumber.py
script can be used to renumber all the chains in the protein starting from residue '1' and generate a new PDB file in the user-defined output directory.
[03] Running DeepRank-GNN-esm
1. What are the two execution modes available for DeepRank-GNN-esm?
- Command-line interface tool
- Python deep learning architecture
2. How do the two execution modes differ in terms of performance and use cases?
- The command-line interface tool is recommended for scoring a small number of PDB structures (less than 100), as it consolidates all functionalities into one user-friendly command.
- The Python deep learning architecture is more efficient for scoring a larger number of PDB structures, as it can leverage parallel predictions to harness multiple CPUs for data loading.
[04] Analyzing the Results
1. What information is contained in the output folder/files from the command-line interface tool?
- The output folder contains the preprocessed input protein, the extracted protein sequences, the calculated ESM-2 embeddings, the generated interface graph, and the prediction output files with the predicted fnat values.
2. How can the node features and edge features of the generated interface graph be accessed?
- The node features and edge features can be accessed using Python code that reads the HDF5 format of the generated interface graph.
[05] Customizing the Deep Learning Network
1. What are the key steps to retrain DeepRank-GNN-esm for customized tasks?
- Construct and format training and evaluation data sets
- Define the training tasks (regression or classification)
- Define the learning parameters (batch size, learning rate)
- Load the network
- Retrain the network
2. What aspects of the deep learning architecture can be customized?
- Graphs, neural network layers, and training targets can be customized for advanced users.