ESM-IF1 is an inverse folding model trained on 12 million AlphaFold2-predicted structures, which predicts amino acid sequences from backbone atom coordinates. Built on an autoregressive transformer architecture with invariant geometric layers, ESM-IF1 achieves 51% native sequence recovery and up to 72% recovery for buried residues on structurally held-out backbones. GPU-accelerated inference provides efficient prediction for protein design, multi-state conformations, and complex binding interfaces in bioengineering and synthetic biology workflows.
Generate¶
Generate protein sequences from a provided protein backbone using the ESM-IF1 inverse folding model.
- POST /api/v3/esm-if1/generate/¶
Generate endpoint for ESM-IF1.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, required) — Configuration parameters:
chain (string, max length: 1, default: “A”) — Chain identifier from the input PDB structure
num_samples (int, range: 1-3, default: 1) — Number of sequences to generate per input structure
temperature (float, range: 0.0-8.0, default: 0.6) — Sampling temperature for sequence generation
multichain_backbone (bool, default: False) — Indicates if input backbone includes multiple chains
items (array of objects, min: 1, max: 1) — Input data items:
pdb (string, min length: 1, max length: 100000, required) — Protein backbone structure in PDB format
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
(array of objects, length: num_samples) — Generated sequence samples for the input backbone structure:
sequence (string) — Generated amino acid sequence (single-letter amino acid codes)
recovery (float, range: 0.0 - 1.0) — Fraction of residues matching the native sequence (sequence recovery accuracy)
Example response:
Performance¶
ESM-IF1 is designed for inverse folding tasks, focusing on predicting protein sequences from given backbone structures.
Utilizes a GPU-accelerated environment with NVIDIA T4 GPUs, ensuring efficient processing of complex computations.
Supports a batch size of 1, optimizing for precise sequence predictions per request.
The model can handle protein sequences up to a length of 500 amino acids, making it suitable for a wide range of protein design applications.
ESM-IF1 offers a competitive advantage in sequence recovery, achieving up to 51% recovery for native sequences, surpassing other models like ESMFold in accuracy.
Compared to AlphaFold2, ESM-IF1 provides faster processing times for inverse folding tasks, though AlphaFold2 remains superior for full structure prediction accuracy.
ESM-IF1’s performance is optimized through the integration of predicted structures, enhancing its ability to generalize across various protein design tasks.
The model demonstrates improved perplexity and sequence recovery metrics when trained with a mix of experimental and predicted data, leveraging BioLM’s extensive dataset.
Typical completion times are efficient, with seconds per batch, ensuring rapid turnaround for real-time applications.
ESM-IF1’s architecture is fine-tuned for inverse folding, providing a balance between speed and accuracy, making it a preferred choice for targeted protein sequence design tasks.
Applications¶
Designing novel protein sequences given backbone coordinates, enabling rapid exploration of protein variants optimized for stability or function; valuable for biotech companies performing de novo protein engineering or developing protein therapeutics; limitations include lower accuracy in predicting surface-exposed residues due to their inherent flexibility.
Optimizing protein-protein interfaces by predicting sequences that enhance binding affinity, useful for engineering biologics with improved therapeutic efficacy; particularly valuable for companies designing protein-based therapeutics or biosensors; accuracy depends on availability of accurate structural information for the binding partners.
Predicting mutational effects on protein stability and binding affinity through zero-shot scoring of sequence variants, enabling rapid screening and prioritization of candidate mutations; beneficial for protein engineering teams developing stable protein scaffolds or affinity-enhanced binding domains; performance may degrade for mutations involving large structural rearrangements or insertions.
Multi-state protein sequence design by conditioning on multiple backbone conformations, enabling design of proteins that adopt desired conformations under different conditions; useful for developing protein switches or biosensors that respond to environmental changes; less effective for highly flexible or disordered regions lacking clear structural conformations.
Infilling partially masked protein backbone structures to generate plausible sequences for missing regions, enabling completion of partial protein structures from experimental data; valuable for companies performing structural biology studies or protein fragment optimization; limited accuracy for longer masked spans (greater than 30 residues) or highly flexible loop regions.
Limitations¶
Batch Size: The maximum number of input items per request is limited to
1.Maximum Input Length: Input PDB structures must not exceed
max_pdb_str_lencharacters; longer structures must be truncated or split.Chain Selection: The algorithm predicts sequences for a single chain specified by the
chainparameter (default"A"); multichain predictions require settingmultichain_backbonetoTruebut may yield reduced accuracy.Sequence Length: The algorithm is trained and optimized for protein backbones up to 500 amino acids; longer sequences may result in degraded performance and increased inference time.
The model is trained on AlphaFold2-predicted structures and may produce suboptimal results for backbones significantly different from those in the training data, such as highly flexible regions, disordered proteins, or structures with novel folds.
ESM-IF1 is designed for fixed-backbone inverse folding tasks; it does not perform well for tasks requiring backbone flexibility, side-chain packing optimization, or predicting the effects of large insertions and deletions.
How We Use It¶
ESM-IF1 enables BioLM users to perform rapid, structure-guided protein sequence optimization and design by predicting amino acid sequences from backbone coordinates. By integrating seamlessly with other predictive models (e.g., masked language models, embedding-based filtering, and 3D structure-derived metrics), ESM-IF1 facilitates iterative rounds of protein engineering, antibody maturation, and enzyme optimization. Its standardized and scalable API allows users to accelerate research outcomes and streamline engineering workflows, translating sequence-to-structure insights into actionable design decisions.
Accelerates iterative protein optimization by integrating predictive and generative modeling pipelines.
Facilitates informed selection of candidate sequences for synthesis based on predicted stability, binding affinity, and other biophysical properties.
References¶
Hsu, C., Verkuil, R., Liu, J., Lin, Z., Hie, B., Sercu, T., Lerer, A., & Rives, A. (2021). Learning inverse folding from millions of predicted structures. BioRxiv.
