Soluble ProteinMPNN is a structure-conditioned sequence design service for generating soluble analogs of integral membrane protein folds. Given a single backbone in PDB format (one structure per request), it applies a ProteinMPNN model retrained on soluble structures to redesign residues, lowering surface hydrophobicity while preserving the target topology. The API exposes residue-level constraints, per-residue amino acid biases/omissions, symmetry specifications, and soluble/membrane labeling, supporting design of GPCR-like and other complex folds for enzyme engineering, ligand binding, and biophysical studies.
Generate¶
This endpoint gensats for Soluble ProteinMPNN.
- POST /api/v3/soluble-mpnn/generate/¶
Generate endpoint for Soluble ProteinMPNN.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, required) — Configuration parameters:
temperature (float, default: 0.1) — Sampling temperature
fixed_residues (array of strings, default: []) — Residue specifications kept fixed during sequence generation
redesigned_residues (array of strings, default: []) — Residue specifications that are required to change during sequence generation
bias_AA (object, default: {}) — Global amino acid sampling bias; keys are single-letter amino acid codes, values are float bias weights
bias_AA_per_residue (object, default: {}) — Per-residue amino acid sampling bias; keys are residue specifications, values are objects mapping single-letter amino acid codes to float bias weights
omit_AA (string, default: “”) — String of single-letter amino acid codes globally excluded from sampling
omit_AA_per_residue (object, default: {}) — Per-residue amino acid exclusions; keys are residue specifications, values are strings of single-letter amino acid codes
symmetry_residues (array of arrays of strings, default: []) — Groups of residue specifications constrained together for symmetry
symmetry_weights (array of arrays of floats, default: []) — Weight values associated with each group in symmetry_residues
homo_oligomer (boolean, default: False) — Homooligomer design mode flag
chains_to_design (array of strings, default: []) — Chain identifiers to include in design
parse_these_chains_only (array of strings, default: []) — Chain identifiers to restrict PDB parsing to
parse_atoms_with_zero_occupancy (boolean, default: False) — Whether atoms with zero occupancy are included when parsing the PDB
number_of_batches (int, range: 1-1, default: 1) — Number of batches per request
batch_size (int, range: 1-2, default: 1) — Number of designs generated per batch
repack_everything (boolean, default: False) — Whether all residues are repacked in side-chain modeling
pack_side_chains (boolean, default: False) — Whether side-chain packing is enabled
number_of_packs_per_design (int, range: 1-8, default: 1) — Number of side-chain packing runs per design
sc_num_samples (int, range: 1-64, default: 16) — Number of side-chain sampling trajectories
sc_num_denoising_steps (int, range: 1-10, default: 3) — Number of denoising steps in side-chain sampling
force_hetatm (boolean, default: False) — Whether HETATM records are included in side-chain modeling
pack_with_ligand_context (boolean, default: True) — Whether side-chain packing uses ligand context
fasta_seq_separation (string, default: “:”) — Separator used when concatenating sequences in FASTA-formatted outputs
file_ending (string, default: “”) — File suffix used internally
zero_indexed (int, default: 0) — Residue index origin flag
pdb_path (null, fixed: null) — Unused field
redesigned_residues_multi (null, fixed: null) — Unused field
fixed_residues_multi (null, fixed: null) — Unused field
bias_AA_per_residue_multi (null, fixed: null) — Unused field
omit_AA_per_residue_multi (null, fixed: null) — Unused field
save_stats (null, fixed: null) — Unused field
verbose (boolean, default: True) — Logging verbosity flag
ligand_mpnn_use_side_chain_context (null, fixed: null) — Unused field
ligand_mpnn_use_atom_context (boolean, default: True) — Whether ligand-aware scoring uses atom-level context
ligand_mpnn_cutoff_for_score (float, default: 8.0) — Distance cutoff in angstroms for ligand scoring
global_transmembrane_label (string, allowed: “membrane” | “soluble”, default: “soluble”) — Global transmembrane environment label applied to all residues
transmembrane_buried (array of strings, default: null) — Residue specifications labeled as buried transmembrane
transmembrane_interface (array of strings, default: null) — Residue specifications labeled as transmembrane interface
items (array of objects, min: 1, max: 1, required) — Input structures:
pdb (string, min length: 1, max length: max_pdb_str_len, required) — PDB-formatted structure string containing ATOM and/or HETATM records
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
sequence (string) — Designed amino acid sequence using single-letter codes, length L
pdb (string) — Designed structure in PDB format corresponding to
sequenceoverall_confidence (float) — Model confidence score for the design, unitless
ligand_confidence (float) — Model confidence score related to ligand or atom context, unitless
seq_rec (float) — Sequence recovery percentage relative to the input, range: 0.0–100.0
log_probs (array of arrays of floats, shape: [L, 20]) — Per-position log-probabilities for 20 amino acids, values in natural logarithm space
sampling_probs (array of arrays of floats, shape: [L, 20]) — Per-position sampling probabilities for 20 amino acids, each inner array sums to 1.0
pdb_packed (object, optional) — Side-chain repacked structures for side_chain models
<key> (string) — Context identifier such as a chain ID or design label
<value> (string) — Repacked structure in PDB format for the corresponding key
Example response:
Performance¶
Model behavior and training: - Soluble ProteinMPNN is a ProteinMPNN variant retrained on a PDB subset filtered to exclude annotated transmembrane structures, biasing designs toward soluble-like surface composition while retaining core packing and fold-conditional behavior comparable to the base ProteinMPNN on soluble backbones.
Backbone‑conditional structural accuracy: - For experimentally validated AF2seq → Soluble ProteinMPNN designs from Goverde et al. (e.g., CLF_4, RPF_9, GLF_18/32, TBF_24), design models vs X‑ray structures show backbone RMSDCα ≈ 0.73–1.05 Å and full‑atom RMSD ≈ 1.3–2.1 Å, indicating high side‑chain and global‑fold fidelity under fixed‑backbone design conditions.
Solubility bias and experimental outcomes vs standard ProteinMPNN: - Relative to generic ProteinMPNN, the soluble variant reduces the fraction of apolar residues on solvent‑exposed surfaces for membrane‑like or amphipathic backbones, shifting surface hydrophobicity toward naturally soluble profiles and supporting higher empirical rates of soluble, monomeric, often hyperthermostable designs (many with Tm > 90 °C) in AF2seq → Soluble ProteinMPNN pipelines.
Performance relative to other BioLM MPNN‑family models: - On generic soluble backbones, fixed‑backbone design metrics (e.g., sequence recovery, AF2/RoseTTAFold reprediction confidence) are similar to base ProteinMPNN, while on membrane‑like topologies the soluble model empirically yields a better trade‑off between maintaining the target fold and lowering surface hydrophobicity, and is typically not the runtime bottleneck compared with downstream structure‑prediction models in full design–validation workflows.
Applications¶
Designing soluble GPCR-like scaffolds for high-throughput ligand screening and SAR campaigns by generating water-soluble 7TM-like helical bundles that preserve overall GPCR topology while removing membrane-facing hydrophobics; this enables pharma and biotech teams to run fragment screens, SPR/biophysical assays, or display-based selections using robust, easily expressed scaffolds instead of native membrane GPCRs, with the important caveat that these designs are structural surrogates and are not expected to reproduce native signaling or G protein coupling
Engineering soluble analogues of other membrane folds (for example, claudin-like and rhomboid-like helices) as tractable surrogates for difficult membrane targets, enabling biophysical characterization, structural studies, and binder discovery against GPCR-adjacent or junction-forming topologies in standard aqueous expression systems; this is useful when native membrane proteins are unstable, aggregate in detergents, or require lipid environments that slow down lead discovery, noting that catalytic or transport activities of the original membrane proteins are typically not preserved
Creation of de novo soluble helical scaffolds with predefined complex topologies for small-molecule and peptide binder design by using Soluble ProteinMPNN to generate low-hydrophobicity surfaces on integral-membrane-like backbones while maintaining long-range packing; this supports industrial platforms that need panels of stable, monomeric, non-natural scaffolds (low similarity to natural sequences) to host binding pockets, switchable elements, or conformational sensors directed at intracellular signaling components, rather than reproducing known natural folds
Development of soluble “mock targets” for assay development, QC, and manufacturability screening by designing highly thermostable soluble variants of topologies normally found only in the membrane that are compatible with bacterial or yeast expression; this lets companies de-risk downstream screening and analytical workflows before committing to cell-based or membrane reconstitution assays, while acknowledging that these analogues approximate the structural environment but not full native activity
Integration into automated protein engineering pipelines for fold-space exploration and IP generation, where Soluble ProteinMPNN is used to systematically sample sequence space around integral membrane-like topologies under soluble constraints (reduced surface hydrophobicity, cysteines masked by default), enabling creation of proprietary, highly stable scaffolds with novel sequences and folds; this is valuable for organizations building platform IP, but not optimal when preserving native membrane protein function is the primary objective
Limitations¶
Maximum sequence length and structure size: Each input
pdbis limited to backbones where every designed chain has at mostMPNNParams.max_sequence_len = 1024residues. Very large complexes or concatenated multi-domain models that exceed this per-chain limit are not supported and should be split into smaller design units before calling the API.Batching and throughput constraints: The request-wide
params.batch_sizeandparams.number_of_batchesare capped atMPNNParams.batch_size = 2andMPNNParams.num_batches = 1respectively, anditemsis limited to a singleMPNNGenerateRequestItemper call (min_items=1,max_items=1). This API is optimized for iterative design on a small number of backbones per request rather than bulk redesign of large structure libraries in a single call.Backbone quality and topology dependence: Soluble ProteinMPNN assumes the input
pdbalready represents a physically reasonable soluble backbone; it does not repair distorted structures, add missing regions, or change global topology. It works best on high-quality, globular backbones and may perform poorly on highly flexible, low-confidence, or poorly packed models. For generating or refining backbones (e.g., AF2- or diffusion-based backbone sampling), other algorithms should be used upstream before sequence design.Solubility-focused sequence priors: This model is trained on soluble proteins (
MPNNModelTypes.SOLUBLE) and biases surface composition away from transmembrane-like hydrophobicity. It is not suitable when you want to preserve native membrane-protein surface properties (e.g., for embedding in lipid bilayers); in those cases, use other ProteinMPNN variants (such asMPNNModelTypes.PROTEINor membrane-labeled models) rather than this soluble design API.Design scope and labels: Residue-level control is limited to what you specify in
params.fixed_residues,params.redesigned_residues,params.bias_AA, and related per-residue options (such asparams.bias_AA_per_residue,params.omit_AA, andparams.omit_AA_per_residue). The soluble model does not consume or enforce per-residue membrane labels (transmembrane_buried,transmembrane_interface), so it is not optimal when you need fine-grained control over membrane vs. solvent exposure at individual positions.Use cases where other models are preferable: Soluble ProteinMPNN is not an embedding model, does not predict structures, and does not generate entirely new folds. It is suboptimal for: (1) large-scale screening where sequence ranking or structure prediction is the bottleneck; (2) tasks requiring evolutionary signal or fitness prediction; and (3) design of explicitly membrane-embedded proteins or complexes, where diffusion-based backbone generators, structure-prediction-guided design pipelines, or membrane-aware ProteinMPNN variants are generally more appropriate.
How We Use It¶
Soluble ProteinMPNN enables redesign of complex or membrane-derived backbones into soluble, expressible proteins that can be evaluated in silico alongside structure prediction models, 3D scoring, and downstream property predictors. Through scalable, standardized APIs, teams integrate soluble sequence redesign into loops that explore GPCR-like and other membrane folds as soluble scaffolds, generate soluble analogues of challenging targets, and iteratively filter redesigned variants for developability, stability, and biophysical profiles before synthesis.
Integrates with backbone-generating and structure-validation models (e.g., AlphaFold-style workflows) to expand accessible soluble fold space for ligand binding, enzymology, and display systems.
Connects to BioLM predictive tools (such as structure-derived metrics, charge and hydrophobicity profiles, and thermostability proxies) so soluble variants can be triaged, prioritized, and tracked across multi-round engineering campaigns.
References¶
Goverde, C. A., Pacesa, M., Dornfeld, L. J., Georgeon, S., Rosset, S., Dauparas, J., Schellhaas, C., Kozlov, S., Baker, D., Ovchinnikov, S., & Correia, B. E. (2023). Computational design of soluble analogues of integral membrane protein structures. bioRxiv. https://doi.org/10.1101/2023.05.09.540044
