LigandMPNN is a GPU-accelerated, structure-conditioned protein sequence design model that incorporates explicit atomic context from small molecules, nucleic acids, metals, and optional sidechain coordinates. Given fixed backbone and ligand coordinates in a PDB input, the API generates amino acid sequences in an autoregressive manner, with per-residue probabilities, overall and ligand-contact confidence scores, and optional sidechain repacking. Typical uses include active-site remodeling, small-molecule binder and sensor optimization, and redesign of complexes around bound ligands.
Generate¶
This endpoint gensats for LigandMPNN.
- POST /api/v3/ligand-mpnn/generate/¶
Generate endpoint for LigandMPNN.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, required) — Configuration parameters:
temperature (float, default: 0.1) — Sampling temperature
fixed_residues (array of strings, default: []) — Residue identifiers to keep fixed; each entry formatted as [ChainID][ResidueNumber][OptionalInsertionCode]
redesigned_residues (array of strings, default: []) — Residue identifiers to redesign; each entry formatted as [ChainID][ResidueNumber][OptionalInsertionCode]
bias_AA (object, default: {}) — Global per–amino-acid logit biases; keys are single-letter amino acid codes, values are floats
bias_AA_per_residue (object, default: {}) — Per-residue amino-acid logit biases; keys are residue specs [ChainID][ResidueNumber][OptionalInsertionCode], values are objects mapping single-letter amino acid codes to floats
omit_AA (string, default: “”) — Concatenated single-letter amino acid codes to globally omit
omit_AA_per_residue (object, default: {}) — Per-residue amino acids to omit; keys are residue specs [ChainID][ResidueNumber][OptionalInsertionCode], values are strings of single-letter amino acid codes
symmetry_residues (array of arrays of strings, default: []) — Groups of symmetry-linked residues; each inner array contains residue specs [ChainID][ResidueNumber][OptionalInsertionCode]
symmetry_weights (array of arrays of floats, default: []) — Per-residue symmetry weights; each inner array must match the length of the corresponding entry in symmetry_residues
homo_oligomer (boolean, default: False) — Flag indicating homo-oligomer sequence sharing across chains
chains_to_design (array of strings, default: []) — Chain identifiers to include in sequence design; each entry must match a chain ID present in the PDB
parse_these_chains_only (array of strings, default: []) — Chain identifiers to parse from the PDB; each entry must match a chain ID present in the PDB
parse_atoms_with_zero_occupancy (boolean, default: False) — Whether to include atoms with zero occupancy from the PDB when parsing
number_of_batches (int, range: 1–1, default: 1) — Number of batches of designs to generate per request
batch_size (int, range: 1–2, default: 1) — Number of designs to generate per batch
repack_everything (boolean, default: False) — Whether to repack all residues when side-chain packing is enabled
pack_side_chains (boolean, default: False) — Whether to run side-chain packing and include side-chain-packed outputs
number_of_packs_per_design (int, range: 1–8, default: 1) — Number of independent side-chain packing runs per design
sc_num_samples (int, range: 1–64, default: 16) — Number of side-chain samples per packing run
sc_num_denoising_steps (int, range: 1–10, default: 3) — Number of denoising steps used in side-chain sampling
force_hetatm (boolean, default: False) — Whether to treat all HETATM records as ligand or context atoms for side-chain packing
pack_with_ligand_context (boolean, default: True) — Whether to include ligand atomic context during side-chain packing
fasta_seq_separation (string, default: “:”) — Separator string used between chain sequences when writing FASTA-formatted sequences
file_ending (string, default: “”) — Optional suffix string associated with output file naming
zero_indexed (int, default: 0) — Flag indicating residue index origin in associated filenames
pdb_path (null, fixed) — Unused; must be null
redesigned_residues_multi (null, fixed) — Unused; must be null
fixed_residues_multi (null, fixed) — Unused; must be null
bias_AA_per_residue_multi (null, fixed) — Unused; must be null
omit_AA_per_residue_multi (null, fixed) — Unused; must be null
save_stats (null, fixed) — Unused; must be null
verbose (boolean, default: True) — Verbosity flag for internal logging
ligand_mpnn_use_side_chain_context (null, fixed) — Unused; must be null
ligand_mpnn_use_atom_context (boolean, default: True) — Whether to use ligand atomic context in the design computation
ligand_mpnn_cutoff_for_score (float, default: 8.0) — Distance cutoff in ångströms for including ligand atoms in scoring
items (array of objects, min: 1, max: 1, required) — Input structures:
pdb (string, required, min length: 1, max length: max_pdb_str_len) — PDB-formatted structure text containing ATOM and optional HETATM records
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
sequence (string, length ≤ 1,024) — Designed amino-acid sequence
pdb (string) — Designed structure in PDB format
overall_confidence (float) — Mean model confidence score over redesigned residues, range: 0.0–100.0
ligand_confidence (float) — Mean model confidence score over residues within the ligand/context cutoff, range: 0.0–100.0
seq_rec (float, range: 0.0–1.0) — Native sequence recovery fraction over redesigned residues
log_probs (array of arrays of floats, shape: [L, 21]) — Autoregressive per-position log-probabilities over 21 amino-acid types
Inner array (float[21]) — log-probabilities from
log_softmaxover 21 residue types for one designed position
sampling_probs (array of arrays of floats, shape: [L, 21]) — Per-position categorical sampling probabilities over 21 amino-acid types
Inner array (float[21], sum: 1.0) — Softmax probabilities used for temperature-based sampling at one position
pdb_packed (object, optional) — Sidechain-packed structures by design index for side_chain model outputs
<design_id> (string) — Sidechain-packed structure for one design instance in PDB format
Example response:
Performance¶
Model complexity and runtime:
LigandMPNN uses the same message-passing backbone as ProteinMPNN with additional protein–ligand and intraligand graph encoders, increasing parameter count from ~1.7M to ~2.6M
Despite this, decoding remains linear in protein length; CPU benchmarks from the reference implementation scale from ~0.6 s (ProteinMPNN) to ~0.9 s (LigandMPNN) per 100 residues, so BioLM’s deployment has only a small constant-factor overhead relative to ProteinMPNN
Within the MPNN family (ProteinMPNN, soluble/membrane variants, LigandMPNN), LigandMPNN is slightly slower per residue but still far lighter than structure-prediction models such as AlphaFold2 or ESMFold, which require MSAs or large sequence databases
Accuracy at ligand-contact sites (versus ProteinMPNN and Rosetta, held-out PDB complexes):
Small-molecule contacts (side-chain atoms within 5 Å of non-protein atoms): native sequence recovery ≈ 63.3 % for LigandMPNN versus ≈ 50.4 % for both ProteinMPNN and Rosetta (genpot)
Nucleotide contacts: ≈ 50.5 % for LigandMPNN, versus ≈ 34.0 % for ProteinMPNN and ≈ 35.2 % for Rosetta with a DNA-optimized energy function
Metal contacts: ≈ 77.5 % for LigandMPNN, versus ≈ 40.6 % for ProteinMPNN and ≈ 36.0 % for a Rosetta metal-optimized protocol; away from ligand-contact positions, LigandMPNN’s global sequence recovery is comparable to ProteinMPNN
Side-chain packing performance near ligands (within 5 Å of context atoms):
χ1 recovery (torsion within 10° of crystal) improves consistently over Rosetta and over a protein-only MPNN side-chain model: small-molecule sites ≈ 86.1 % (LigandMPNN) vs 83.3 % (Protein-only MPNN) vs 76.0 % (Rosetta); nucleotide sites ≈ 71.4 % vs 65.6 % vs 66.2 %; metal sites ≈ 79.3 % vs 76.7 % vs 68.6 %
For deeper torsions (χ3, χ4), all methods degrade in accuracy, but LigandMPNN still outperforms Rosetta and slightly improves on protein-only MPNN near ligands
Ligand-aware packing in the API leverages this model, enabling more reliable evaluation and repacking of residues around small molecules, nucleic acids and metals than ProteinMPNN-based packing alone
Relative efficiency and deployment behavior:
Compared to Rosetta-based sequence-design and packing protocols, LigandMPNN inference avoids combinatorial Monte Carlo searches and is ≈ 250× faster on similar CPUs in the reference implementation; BioLM’s GPU-backed deployment preserves this large speed advantage at scale
Within BioLM pipelines, LigandMPNN is typically a minor contributor to end-to-end latency: it is significantly cheaper and faster per call than structure-prediction models (AlphaFold2, ESMFold, NanobodyBuilder2) while providing 10–40 percentage points higher native sequence recovery in ligand-contact regions than ProteinMPNN
Training with ~0.1 Å Gaussian coordinate noise improves robustness to non-crystallographic backbones (e.g., RFdiffusion-like designs, AlphaFold2/ESMFold hallucinations), so performance degrades less on generated structures than methods that assume ideal crystallographic geometry
Applications¶
Structure-based optimization of small-molecule binders around fixed ligand poses, using LigandMPNN to redesign residues within ~5 Å of a bound compound to improve shape complementarity, hydrogen bonding, and pocket pre-organization; this is valuable for biotech and platform companies that already have scaffold backbones (for example from RFdiffusion, RoseTTAFold All-Atom, or Rosetta) and need to rapidly generate and prioritize binder variants for specific drugs, metabolites, or imaging agents without hand-tuning force-field parameters
Design and maturation of protein-based small-molecule sensors, where LigandMPNN is used to refine or re-pack binding sites around pre-positioned ligands to improve affinity and specificity while maintaining a given global fold; this is useful for engineering biosensor proteins for diagnostics, process monitoring, or cell-based reporting systems, since the model explicitly accounts for non-protein atoms and predicts sidechain conformations rather than relying on purely backbone-based inverse folding
Sequence design of protein–DNA interfaces for sequence-specific DNA binders, by conditioning on the atomic coordinates of a target DNA duplex and a candidate protein backbone to generate residues that recognize bases in the major groove; this enables industrial teams building programmable DNA-binding domains (for synthetic regulation, genome targeting, or DNA diagnostics) to move beyond generic DNA affinity and toward more sequence-selective binders with experimentally validated design patterns, while still requiring downstream structure prediction and experimental validation for off-target assessment
Context-aware redesign of metal-binding sites (for example Zn, Fe, or other transition metals), leveraging LigandMPNN’s explicit encoding of metal identity and geometry to propose coordinating residues and sidechain conformations around an already positioned metal ion; this is useful for stabilizing metal-dependent scaffolds (such as catalytic or structural metal sites) in bioprocess-compatible formats, but it is not optimal for elements that are extremely rare or absent in PDB training data, where additional physics-based modeling or element mapping is required
Rescue and affinity improvement of existing structure-based designs that show weak or no binding, by feeding in the original backbone and ligand pose and using LigandMPNN to locally respecify binding-site residues and sidechains; this is particularly valuable for teams with legacy Rosetta or in-house designs that partially failed in the lab, allowing systematic sequence-level optimization and sidechain repacking around the ligand at GPU-accelerated throughput instead of re-running expensive Monte Carlo packing or manually re-engineering pockets
Limitations¶
Sequence length and input size: Each design request is limited to backbones with at most
1024residues per chain (MPNNParams.max_sequence_len). Very large assemblies (for example, multi‑megadalton complexes) may need to be split and designed in pieces. The request payloaditemsmust contain valid PDB text initems[0].pdb(only one item is allowed per request for LigandMPNN); malformed coordinates or missing backbone atoms (N, CA, C, O) will cause validation to fail.Batching and throughput: LigandMPNN design is optimized for small batches. The top‑level parameters
number_of_batchesandbatch_sizeare both capped at1(MPNNParams.num_batchesandMPNNParams.batch_size), so each API call can only generate one batch of one design per input structure. TheLigandMPNNGenerateRequest.itemslist may contain at mostMPNNParams.items_batch_sizePDBs (1) per call; to design larger libraries you must parallelize across multiple API calls.Ligand and atom context assumptions: LigandMPNN expects small molecules, nucleotides, metals, or selected side chains encoded as
HETATM(or side‑chain atoms whenligand_mpnn_use_atom_context/force_hetatmare used) with reasonable geometry near the binding site. Only the closest 25 ligand atoms per residue are used internally, so very large ligands (for example, long DNA or RNA segments) may be only partially “seen” at each position. Unusual elements that are rare or absent in the PDB training data should be mapped to chemically similar elements before submission; otherwise, scores and designs around those atoms can be unreliable.Design scope and control: Residue‑level control via
fixed_residues,redesigned_residues,symmetry_residues, and chain selection viachains_to_design/parse_these_chains_onlyis validated against the input PDB: chain IDs and residue indices must exist and lie within the detected chain lengths, or the request will fail. The model is sequence‑design only: it assumes the input backbone (and ligand coordinates) are fixed and close to a realizable structure, and it does not relax or repair poor backbones or clashes; for backbone generation or large‑scale conformational search, diffusion‑based or structure‑prediction models are more appropriate.Scoring and side‑chain modeling: The API returns per‑residue probabilities and overall confidence scores (for example,
overall_confidence,ligand_confidence,log_probs,sampling_probs), but these are not physical binding free energies and should not be interpreted as quantitative ΔG values. The optional side‑chain packing model (controlled viapack_side_chains/repack_everythingand relatedsc_*parameters) improves local packing but still struggles with higher‑order torsions (chi3/chi4) and cannot replace full all‑atom refinement when exact hydrogen‑bonding networks or metal coordination geometry are critical.Non‑optimal use cases: LigandMPNN is specialized for sequence design on pre‑specified backbones with local ligand/atom context. It is not the best choice when you primarily need: (a) de novo backbone generation (use backbone diffusion or structure‑hallucination models instead), (b) high‑throughput ranking of millions of sequences without explicit structures (sequence‑ or embedding‑based models scale better), (c) global stability optimization without nonprotein context (standard ProteinMPNN or sequence‑based fitness models are typically more efficient), or (d) detailed, physics‑based binding energy decomposition (classical docking or Rosetta‑style all‑atom energy functions are more appropriate downstream of LigandMPNN designs).
How We Use It¶
LigandMPNN enables structure-based sequence design around small molecules, nucleotides and metals as a programmable service, turning docked or experimentally determined complexes into ranked sequence hypotheses that are ready for synthesis and testing. In BioLM workflows it integrates upstream with backbone-generation and docking (for example RFdiffusion-style backbones, ligand docking or curated crystal structures) and downstream with sequence-embedding, stability and developability predictors and experimental design loops. Standardized, GPU-backed APIs allow teams to run scalable campaigns where candidate binding pockets are generated, sequences are locally redesigned around the ligand with explicit atomic context and optional sidechain repacking, then filtered and prioritized under project-specific constraints such as manufacturability, IP considerations or assay cost.
Integrates with other structure-generation, docking and scoring models to form end-to-end binder, sensor and enzyme design workflows.
Supports iterative design–build–test–learn cycles where experimental data refines which ligand-contact variants and pockets are explored in subsequent rounds.
References¶
Dauparas, J., Lee, G. R., Pecoraro, R., An, L., Anishchenko, I., Glasscock, C., & Baker, D. (2025). Atomic context-conditioned protein sequence design using LigandMPNN. Nature Methods. https://doi.org/10.1038/s41592-025-02626-1
