SoluProt predicts the probability that a protein will be expressed in a soluble form in E. coli using a gradient boosting classifier trained on a curated, length-balanced subset of the TargetTrack database and evaluated on an independent NESG benchmark (AUC 0.62, 58.5% accuracy on a balanced 3 100-sequence test set). The API accepts unambiguous amino acid sequences (20–5 000 residues, up to 100 per request) and returns a calibrated solubility probability and binary call, enabling sequence ranking and filtering in high-throughput enzyme discovery and protein engineering workflows.
Predict¶
Predict probability of soluble expression in E. coli for multiple protein sequences
- POST /api/v3/soluprot/predict/¶
Predict endpoint for SoluProt.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
batch_size (integer, range: 1-100, default: 100) — Maximum number of sequences allowed per request
max_sequence_len (integer, range: 20-5000, default: 5000) — Maximum allowed sequence length in amino acids
min_sequence_len (integer, range: 20-5000, default: 20) — Minimum allowed sequence length in amino acids
items (array of objects, min: 1, max: 100) — Input sequences:
sequence (string, min length: 20, max length: 5000, required) — Protein sequence using unambiguous standard amino acid codes
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
soluble (float, range: 0.0-1.0) — Predicted probability of soluble expression in Escherichia coli
is_soluble (boolean) — Binary soluble expression classification at threshold ≥ 0.5
Example response:
Performance¶
Predictive task and outputs:
Estimates the probability that a protein will be both expressed and recovered in soluble form in E. coli, based solely on primary sequence
Model class: gradient boosting machine over 96 engineered sequence features (composition, physicochemical descriptors, transmembrane propensity, homology to soluble E. coli proteins)
Benchmark predictive accuracy (independent NESG-based SoluProt test set; 3,100 sequences; balanced classes; ≤25% global identity to training set):
Area under ROC curve (AUC): 0.62
Accuracy: 58.5% at probability threshold 0.5; Matthews correlation coefficient (MCC): 0.17
Confusion matrix: 939 TP, 873 TN, 677 FP, 611 FN
Comparative accuracy versus other sequence-based solubility predictors on the same test set:
SoluProt achieved the highest AUC (0.62) and accuracy (58.5%) among 12 tools evaluated (including PROSO II, SWI, CamSol, DeepSol, SKADE)
Deep learning methods DeepSol and SKADE underperformed SoluProt (AUC ≈ 0.55–0.51; accuracies 52.9% and 49.2%), despite substantial training-set overlap with the SoluProt test data, indicating stronger generalization from SoluProt’s curated training set
Practical enrichment and role in BioLM workflows:
Ranking sequences by SoluProt score and keeping only the top 10% yields ~49.7% more truly soluble proteins than random selection on a balanced set (232 vs. 155), making it effective for hit prioritization even with <60% absolute accuracy
Relative to BioLM transformer-based solubility models (for example BioLMSol, TemBERTure variants), SoluProt is narrower in scope but typically cheaper and faster per sequence and provides better enrichment for E. coli soluble-expression campaigns; stability-oriented models (for example TemBERTure Regression, ESM2StabP) are complementary because they estimate thermodynamic stability rather than expression outcome
Applications¶
Pre-screening enzyme variant libraries for soluble overexpression in E. coli before DNA synthesis or assembly, using SoluProt probabilities to focus cloning and screening on variants with higher predicted soluble expression and to deprioritize sequences likely to form inclusion bodies or fail to express
Selecting soluble orthologs or homologs from metagenomic or public sequence databases as starting points for industrial enzyme development (e.g., hydrolases, oxidoreductases, transferases), by ranking thousands of candidates with SoluProt scores to reduce failed E. coli expression campaigns in discovery or scale-up
Prioritizing soluble protein constructs (e.g., domain boundaries, truncations, tag-only fusions) for structural biology and assay reagent production, by comparing SoluProt outputs across construct designs to reduce the number of E. coli expression trials required to obtain milligram-scale soluble protein for biophysical or functional assays
Filtering computationally designed or ML-generated enzyme sequences for predicted soluble expression in E. coli, using SoluProt as a downstream gate in in silico design pipelines to discard low-solubility designs before ordering DNA, with the understanding that the gradient-boosted model is trained specifically on E. coli intracellular expression and does not capture alternative hosts or fermentation conditions
Portfolio-level triage of protein engineering projects that assume E. coli intracellular expression (e.g., early feasibility for new industrial biocatalysts), by aggregating SoluProt predictions across proposed targets to estimate relative expression risk and prioritize programs, noting that the model is not intended for membrane proteins, secreted proteins, or non-E. coli hosts
Limitations¶
Input size and format: Each request must provide one or more protein sequences in
items, where each element is aSoluProtSequenceItemwith a singlesequencefield containing only unambiguous amino acid codes. Individual sequences must be between20and5000residues (inclusive); shorter or longer inputs are rejected.Batching constraints: The
itemslist inSoluProtPredictRequestmust contain at least1and at most100sequences per call. Larger datasets must be split across multiple requests and batched client-side.Output interpretation: For each input sequence, the API returns a
SoluProtPredictResultwithsoluble(a float between 0 and 1) andis_soluble(a boolean using a fixed0.5decision threshold). Thesolublescore is calibrated for ranking and prioritization rather than as an absolute probability of successful expression; use it comparatively within a dataset, not as a hard pass/fail guarantee.Organism and expression-system specificity: SoluProt is trained to predict soluble overexpression in Escherichia coli only. Predictions may be misleading for other hosts (for example, yeast, insect, mammalian, or cell‑free systems) or for conditions that differ substantially from typical E. coli pipelines (for example, unusual fusion tags, extreme temperatures, specialized chaperone systems).
Structural and protein-type coverage: The model operates on primary sequence and was trained after filtering out transmembrane proteins, very short fragments, and sequences with undefined residues. It is not optimized for membrane proteins, secreted or complex multidomain constructs, heavily disordered fusions, or antibody-like scaffolds; for these, structure-aware or domain-specific models are often more appropriate, or SoluProt should be used only as an early coarse filter.
Predictive performance and role in pipelines: On an independent balanced test set, reported accuracy is about
58.5%with substantial false positives and false negatives. SoluProt is best used as an enrichment step in large-scale pipelines (for example, ranking many candidates and selecting a top fraction for experimental testing), not as the sole decision-maker for individual high‑value constructs; for critical designs, pair it with other biophysical, structural, and task-specific models.
How We Use It¶
BioLM uses SoluProt as a standardized solubility filter in protein design and optimization workflows to prioritize E. coli–expressible candidates before investing in cloning, expression, and assay campaigns. SoluProt scores are combined with BioLM sequence-embedding models, structural predictors, and developability/liability metrics in multi-parameter ranking schemes for enzyme discovery, antibody engineering, and iterative lead optimization. Via scalable APIs, SoluProt predictions plug into automated ML pipelines that generate, score, and narrow thousands of variants per cycle, so experimental teams focus on constructs with higher probability of soluble expression while data and ML teams refine sequence design models using feedback from lab results.
In enzyme mining and engineering, SoluProt integrates with generative sequence models and functional predictors to exclude low-solubility sequences early, raising expression-screen hit rates and reducing re-synthesis cycles.
In antibody and biologics optimization, SoluProt complements structure-based aggregation and stability metrics with an orthogonal, sequence-level solubility signal that feeds into composite developability scores for portfolio and project decisions.
References¶
Hon, J., Marusiak, M., Martinek, T., Kunka, A., Zendulka, J., Bednar, D., & Damborsky, J. (2021). SoluProt: prediction of soluble protein expression in Escherichia coli. Bioinformatics, 37(10), 1559–1566. https://doi.org/10.1093/bioinformatics/btaa1102
