Pro4S Classification is a multimodal protein solubility predictor that fuses ESM-based sequence embeddings, AlphaFold-derived 3D structure graphs, and MaSIF-style surface descriptors using GNNs and contrastive learning. The API takes amino acid sequences plus a required single-chain PDB or mmCIF structure (up to 2048 residues, batch size ≤ 4), runs GPU-accelerated inference, and returns a 0–1 solubility score with optional class labels. Pro4S reaches AUC ≈ 0.83 on external benchmarks and supports enzyme and antibody developability assessment, expression screening, and filtering of de novo designs.
Predict¶
Predict protein solubility scores and labels from amino acid sequence plus 3D structure in PDB or mmCIF format for specified chain.
- POST /api/v3/pro4s-classification/predict/¶
Predict endpoint for Pro4S Classification.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
batch_size (int, default: 4) — Maximum number of items processed per request
max_sequence_len (int, default: 2048) — Maximum allowed sequence length in residues
items (array of objects, min: 1, max: 4) — Input sequences and structures:
sequence (string, min length: 1, max length: 2048, required) — Amino acid sequence using unambiguous residue codes
pdb (string, min length: 1, max length: 5000000, conditionally required) — PDB-format structure content; required if cif is not provided
cif (string, min length: 1, max length: 5000000, conditionally required) — mmCIF-format structure content; required if pdb is not provided
chain_id (string, length: 1, default: “A”) — Chain identifier corresponding to the provided structure
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
solubility_score (float, range: 0.0-1.0) — Predicted normalized solubility score
solubility_label (string, optional) — Predicted discrete solubility class label when available
Example response:
Performance¶
Predictive accuracy (classification vs BioLMSol/SoluProt/SWI-like sequence-only models): - On curated E. coli expression datasets (NetSolP/PLM_Sol protocol), Pro4S attains AUROC ≈ 0.72–0.73 and MCC ≈ 0.33, typically improving MCC by ~40–50 % over sequence-only solubility classifiers such as SoluProt and NetSolP-like BioLMSol variants (MCC ≈ 0.18–0.23) - On external low-homology test sets (<25 % identity to training), Pro4S maintains AUROC ≈ 0.83 where legacy sequence-based tools (e.g., SoluProt, SWI, PLM_Sol-style BioLMSol) plateau around AUROC 0.62–0.74, indicating better generalization to unseen folds and designed sequences
Predictive accuracy (regression vs structure-aware BioLMSol-style models): - On the standard eSOL benchmark, Pro4S reaches R² ≈ 0.56 with RMSE ≈ 0.21, exceeding previous graph-based structural solubility models in the BioLM family (GraphSol/HybridGCN/GATSol-like regressors at R² ≈ 0.48–0.52) by ~0.04–0.08 absolute R² - On cross-species external data (S. cerevisiae PURE system), Pro4S attains R² ≈ 0.43, slightly but consistently higher than prior graph-based BioLMSol variants (R² ≈ 0.36–0.42), showing that adding surface information yields measurable off-distribution gains
Expression / de novo design screening vs other BioLM solubility/developability predictors: - On EGFR de novo design data, Pro4S-based variants achieve AUROC ≈ 0.76–0.78 with MCC up to ≈ 0.53–0.58, compared with ≈ 0.34–0.47 MCC for the strongest non-Pro4S baselines (e.g., GATSol-like graph models and Protein-Sol/SWI-like sequence indices) - In practical terms, fine-tuned Pro4S can reduce the fraction of non-expressing designs by ~50 % while retaining ~97 % of highly expressed sequences; more stringent sequence-only indices (e.g., SWI-style) remove slightly more non-expressers but at the cost of discarding ~25 % of high-expression designs
Architectural and deployment behavior vs other BioLM models: - Multimodal fusion (sequence + backbone structure + surface graph) and contrastive alignment with ESM embeddings typically add ≈ 0.01–0.02 to AUROC and ≈ 0.02–0.03 to R² over comparable BioLMSol models that omit surface or contrastive components, and they reduce overfitting on <25 % identity external sets - Pro4S uses an ESM-3B-scale sequence encoder plus shallow structure/surface GNNs; inference cost per item is modestly higher (<2×) than sequence-only BioLMSol classifiers with the same backbone, and broadly similar to other GNN-heavy structural predictors (e.g., GraphSol/GATSol) while remaining substantially cheaper than full 3D generative or structure-prediction models (ProteinMPNN, AlphaFold2, ESMFold) that must generate coordinates rather than consume precomputed structures
Applications¶
Up-front solubility screening of therapeutic protein leads (e.g., enzymes, cytokines, scaffolds) before expression campaigns, using Pro4S classification or regression scores to prioritize constructs with higher predicted aqueous solubility and reduce non-expressing or strongly aggregating variants going into wet-lab testing; particularly valuable when E. coli or cell-free expression capacity and material costs are the main bottlenecks
Solubility-aware protein engineering workflows, where rational or generative variant libraries are ranked by Pro4S scores to filter out poorly soluble designs while retaining variants with higher predicted expressibility, helping process-development teams avoid candidates likely to fail downstream in purification or formulation due to aggregation; the multimodal sequence–structure–surface architecture enables more sensitive detection of surface-exposed hydrophobic patches than sequence-only models
Optimization of expression constructs and domain architectures for industrial proteins (e.g., multi-domain catalysts, binding proteins, fusion constructs) by comparing Pro4S predictions across alternative domain boundaries, linker sequences, and tag configurations for a given chain in a supplied PDB/mmCIF structure, enabling teams to down-select designs more likely to express in soluble form in microbial or cell-free systems rather than relying on extensive tag-based rescue strategies
Developability risk assessment for protein-based products during lead selection, using Pro4S regression scores as one quantitative feature in a broader developability panel (e.g., stability, aggregation propensity, charge distribution) to flag candidates with intrinsically low predicted solubility that may cause formulation or manufacturing issues; most appropriate when assessing proteins intended for standard aqueous conditions similar to those represented in the training data
Solubility-guided de novo protein design campaigns, where large numbers of designed sequences (e.g., from generative models or combinatorial design) with associated 3D structures are pre-filtered with Pro4S to enrich for designs with higher likelihood of expression in screening systems, leveraging the model’s learned alignment between solubility and expression levels; not optimal for projects where solubility in non-aqueous media or highly non-standard formulation environments is the primary objective, since the model is trained on aqueous, near-physiological conditions
Limitations¶
Maximum Sequence Length: Each
sequencemust be an unambiguous amino-acid string of length1to2048characters. Longer proteins must be truncated or split upstream, which can remove N-/C-terminal context that influences solubility (for example signal peptides or disordered tails).Batch Size and Throughput: Each request can include
1to4entries initems. Large libraries must be sharded across multiple calls and results merged client-side. The model is relatively heavy (ESM + GNNs) and is better suited for ranking or triaging candidate sets than for brute-force screening of tens of millions of sequences in a tight loop.Structure Requirements: For every item, either
pdborcifmust be provided (but not both), andchain_idmust be exactly one non-blank character (for example"A"). Structures are expected to be reasonably accurate monomeric models for the specified chain; unusual oligomeric states, membrane insertions, large missing regions, poor coordinate quality, or severe misfolds can degrade accuracy. The model was trained mainly on AlphaFold2-like monomeric structures and may underperform on complexes or highly flexible/disordered proteins.Score Semantics and Calibration:
solubility_scoreis a continuous value in[0.0, 1.0]andsolubility_label(when present) is a binary classification derived from this score using a fixed internal threshold. Scores are not guaranteed to be proportional to experimental yield or expression titer, and thresholds tuned on E. coli solubility and cell-free expression benchmarks may not transfer directly to other hosts, constructs, tags, or assay conditions. Users should validate task- and system-specific cutoffs.Scope of Biological Validity: Pro4S is optimized for soluble expression in recombinant E. coli-like settings and trained mainly on natural or near-natural, globular proteins with well-defined structures. Predictions are less reliable for very short peptides, large multi-domain fusions, intrinsically disordered or highly flexible proteins, membrane proteins, heavily engineered or non-natural sequences, or constructs dominated by tags/linkers. For sequence-only, very high-throughput filtering without structural data, simpler sequence-based solubility models or heuristics may be more cost-effective.
Pipeline Positioning and Complementarity: Pro4S is best used as a mid- to late-stage filter when reasonably confident structures are available and candidate counts are moderate. It is not intended to replace upstream generative design, expression-system–specific modeling, aggregation- or stability-focused predictors, or final structural/biophysical characterization. For projects driven primarily by other developability liabilities (for example stability, aggregation hotspots, immunogenicity), Pro4S should be combined with complementary models as part of an integrated pipeline.
How We Use It¶
Pro4S classification enables rapid, structure-informed solubility screening as a standard decision point in protein engineering campaigns, so teams can down-select designs before expression and scale experimental work on the most developable candidates. Via scalable, standardized APIs, Pro4S integrates with generative design models, antibody and enzyme optimization workflows, and downstream developability predictors (for example, aggregation or stability models), allowing data scientists and ML engineers to treat solubility as a first-class feature in automated pipelines. In practice, Pro4S classification scores are combined with BioLM sequence embeddings, structure-based metrics, and lab results from expression rounds to drive multi-parameter optimization, refine project-specific models, and iteratively improve hit quality and success rates in wet-lab validation.
Used early in de novo and variant libraries to filter low-solubility designs, increasing expression success and reducing wasted cloning and assay capacity.
Combined with Pro4S regression and finetuned variants to prioritize clones for multi-round optimization of antibodies, enzymes, and other biologics, aligning in silico ranking with observed expression and manufacturability.
References¶
Qian, J., Yang, L., Wang, R., & Qi, Y. Pro4S: prediction of protein solubility by fusing sequence, structure, and surface. Manuscript in preparation / submitted.
