ProGen2 is a large-scale autoregressive protein language model suite for protein sequence generation and fitness prediction. BioLM provides scalable API access to multiple ProGen2 variants (OAS, Medium, Large, BFD90) for modern protein engineering, antibody/nanobody design, and synthetic biology workflows.
Generate¶
This endpoint generates protein sequences from a given context using ProGen2.
- POST /api/v3/progen2-oas/generate/¶
Generate protein sequences from a given context using ProGen2.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request JSON Object
params (object) — Generation parameters:
temperature (float) — Sampling temperature (default: 0.8, range: 0.0–8.0)
top_p (float) — Nucleus sampling probability (default: 0.9, range: 0.0–1.0)
num_samples (int) — Number of sequences to generate per context (default: 1, range: 1–3)
max_length (int) — Maximum length of generated sequence (default: 128, range: 12–512)
items (array of objects, max. 1) — List of input contexts:
context (string, length 1–512) — Protein sequence context (AAs, unambiguous)
Example request:
- Status Codes:
200 OK – Successful. Returns generated sequences and log-likelihoods.
400 Bad Request – Invalid input (parameter, context, or batch size).
401 Unauthorized – Unauthorized.
500 Internal Server Error – Internal server error.
Response JSON Object
results (array of arrays) — For each input context, a list of generated sequence objects:
sequence (string) — Generated protein sequence
ll_sum (float) — Sum of log-likelihoods for the sequence
ll_mean (float) — Mean log-likelihood per token
Example response:
Performance¶
Batch size: 1 context per request (per schema)
Max sequence length: 512 residues (input context and output)
CPU/GPU: OAS/Medium: 2 CPUs, 8GB RAM; Large/BFD90: 4 CPUs, 16GB RAM; GPU (T4) for Medium/Large/BFD90
Typical latency: seconds per request
Applications¶
De novo protein sequence generation from motifs or partial sequences
Antibody/nanobody library design (e.g., CDR diversification)
Enzyme engineering and synthetic protein design
Fitness landscape exploration and zero-shot fitness prediction
Large-scale protein library curation for screening
Limitations¶
Input context: Only unambiguous amino acids (no ambiguous or non-standard codes)
Max sequence length: 512 residues (input and output)
Batch size: 1 context per request
No structure conditioning: Sequence-only generation (no structure input)
How BioLM Uses ProGen2¶
BioLM enables scalable, programmatic access to ProGen2 for advanced protein design workflows:
Antibody/nanobody library design: Generate diverse CDR or full-length antibody/nanobody sequences for library construction, using context motifs or frameworks as prompts.
Enzyme and protein engineering: Design novel enzymes or functional proteins by seeding with active site motifs or partial sequences, generating variants for screening.
Fitness landscape exploration: Sample and score large numbers of variants for zero-shot fitness prediction, prioritizing candidates for experimental validation.
Dataset curation: Generate synthetic protein libraries for ML training, benchmarking, or augmentation of sparse datasets.
References¶
Madani, A., et al. (2023). ProGen2: Exploring the Boundaries of Protein Language Models. bioRxiv. https://doi.org/10.1101/2023.06.21.546019
