ESM C 300M is a GPU-accelerated, protein language model for predicting protein structure, function, and evolutionary relationships from amino acid sequences. Trained on evolutionary-scale data, it generates embeddings capturing structural and functional properties for tasks such as annotation transfer, variant effect prediction, and protein engineering. ESM C 300M supports single-sequence and batch processing, with outputs provided in standard numerical and file formats via API endpoints.
Predict¶
Predict properties or scores for input sequences
- POST /api/v3/esmc-300m/predict/¶
Predict endpoint for ESM C 300M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (array of strings, default: [“mean”]) — Types of embeddings or logits to include in the response. Allowed values: “mean”, “per_token”, “logits”.
items (array of objects, min: 1, max: 5) — Input sequences:
sequence (string, required, min length: 1, max length: 2048) — Protein sequence using standard amino acid codes (ACDEFGHIKLMNPQRSTVWY, plus B, U, Z, O).
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of floats, size: 1536, optional) — Mean embedding vector representing the entire protein sequence.
per_token_embeddings (array of arrays of floats, shape: [sequence_length, 1536], optional) — Embedding vectors for each residue position in the input protein sequence.
logits (array of arrays of floats, shape: [sequence_length, 29], optional) — Logit scores for each residue position over the 29-token amino acid vocabulary.
vocab_tokens (array of strings, length: 29, optional) — Ordered list of amino acid vocabulary tokens corresponding to the logits array.
results (array of objects) — One result per input item, in the order requested:
pdb (string) — Predicted 3D atomic structure in standard PDB format.
mean_plddt (float, range: 0.0–100.0) — Mean predicted Local Distance Difference Test (pLDDT) confidence score across all residues.
results (array of objects) — One result per input item, in the order requested:
log_prob (float, range: negative infinity to 0.0) — Log probability of the input sequence under the ESM3 model.
Example response:
Encode¶
Generate embeddings for input sequences
- POST /api/v3/esmc-300m/encode/¶
Encode endpoint for ESM C 300M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
include (array of strings, default: [“mean”]) — Types of embeddings or logits to include in the response. Allowed values: “mean”, “per_token”, “logits”.
items (array of objects, min: 1, max: 5) — Input sequences:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using unambiguous amino acid codes.
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of floats, size: 1536, optional) — Mean protein embeddings for the input sequence.
per_token_embeddings (array of arrays of floats, shape: [sequence_length, 1536], optional) — Embeddings for each residue position in the input sequence.
logits (array of arrays of floats, shape: [sequence_length, 29], optional) — Logits for each residue position over the 29-token amino acid vocabulary.
vocab_tokens (array of strings, size: 29, optional) — Amino acid vocabulary tokens corresponding to the logits array.
Example response:
Performance¶
ESM C 300M provides rapid inference for protein sequence embedding and representation tasks, optimized for high-throughput applications.
GPU-accelerated deployment on NVIDIA L4 GPUs ensures efficient processing, enabling typical completion times of approximately 2-4 seconds per batch.
Achieves comparable embedding quality to ESM-2 650M with significantly faster inference, making it ideal for large-scale sequence analysis and screening tasks.
For embedding and representation tasks, ESM C 300M outperforms smaller models such as ESM-2 150M and ESM-2 35M, providing improved accuracy and representational power.
While ESM C 300M is optimized for speed and throughput, larger models like ESM-2 650M and ESM3 Open Small may offer slightly higher accuracy at the cost of longer inference times.
BioLM’s optimized deployment architecture ensures consistent performance and scalability, enabling reliable integration into automated bioinformatics pipelines and large-scale protein engineering workflows.
Input: Protein sequences (single-letter amino acid codes); Output: Sequence embeddings, per-residue embeddings, sequence logits.
For structural prediction tasks, ESM C 300M embeddings can be used as inputs to downstream models such as ESMFold or AlphaFold2, providing efficient preprocessing and improved overall pipeline performance.
Applications¶
Designing novel fluorescent proteins with low sequence similarity to known variants, enabling researchers to create unique labeling tools for microscopy or biosensor applications; for example, generating new GFP variants with distinct spectral properties for multiplexed imaging experiments; not optimal for designing proteins requiring post-translational modifications beyond chromophore formation.
Engineering structurally diverse protein scaffolds around predefined functional motifs, enabling biotech companies to rapidly prototype therapeutic candidates or molecular sensors; for instance, creating novel ligand-binding proteins by scaffolding known active sites onto new protein architectures; not suitable for tasks requiring precise control over protein dynamics or conformational flexibility.
Generating protein sequences with specified secondary structure and solvent accessibility constraints, allowing researchers to design stable, globular proteins for therapeutic or industrial applications; for example, designing thermostable proteins for biocatalysis or robust scaffolds for peptide display; not optimal for predicting detailed protein-protein interaction interfaces without additional structural context.
Rational protein editing to alter specific structural or functional properties, enabling precise modifications such as exposing buried epitopes or compressing protein size; for instance, redesigning a protease to maintain catalytic activity while significantly reducing sequence length for improved manufacturability; limited applicability when extensive domain rearrangements or highly flexible regions must be engineered.
Predicting protein function keywords from sequence alone, providing rapid functional annotation for large-scale protein screening workflows; for example, enabling biotech companies to prioritize candidate proteins based on predicted enzymatic or binding activities without experimental characterization; less accurate for proteins with novel or poorly characterized functions lacking close homologs in training data.
Limitations¶
Maximum Sequence Length: The algorithm accepts sequences of up to
2048
amino acids. Longer sequences must be truncated or split into smaller segments.Batch Size: The maximum batch size per request is
5
sequences. Larger batches must be divided into multiple requests.Model Variant: Currently, only the
open-small
(1.4B parameters) variant is available, which may have lower accuracy and representational power compared to larger ESM3 variants described in the publication (7B and 98B).Input Sequence Constraints: Input sequences must contain only unambiguous amino acid letters. Ambiguous or unknown residues are not supported and must be removed or replaced.
Prediction Accuracy: While ESM3-open delivers strong performance for general protein structure and function prediction tasks, it has reduced accuracy on viral proteins and select agent toxins due to intentional data filtering during training. For tasks specifically involving viral proteins or toxins, alternative algorithms or models may be more appropriate.
Output Types: The
include
parameter in encoding requests supports optionsmean
,per_token
, andlogits
. When requestinglogits
, the response includes additional fieldslogits
andvocab_tokens
. However, requestinglogits
significantly increases response size and processing time, so it should be used only when necessary.
How We Use It¶
ESM C 300M enables researchers to accelerate protein engineering workflows by rapidly generating and optimizing novel protein sequences and structures. Its generative capabilities facilitate targeted exploration of sequence space, enabling iterative refinement and multi-round optimization of enzymes and antibodies. The model integrates seamlessly with predictive tools such as ESMFold, sequence embedding algorithms, and biophysical property predictors, allowing teams to efficiently identify and rank promising protein candidates for synthesis and lab testing.
Accelerates iterative protein design and optimization cycles
Integrates with predictive models and downstream ranking tools for targeted engineering outcomes
References¶
Hayes, T., Rao, R., Akin, H., Sofroniew, N. J., Oktay, D., Lin, Z., Verkuil, R., Tran, V. Q., Deaton, J., Wiggert, M., Badkundri, R., Shafkat, I., Gong, J., Derry, A., Molina, R. S., Thomas, N., Khan, Y., Mishra, C., Kim, C., Bartie, L. J., Nemeth, M., Hsu, P. D., Sercu, T., Candido, S., & Rives, A. (2024). Simulating 500 million years of evolution with a language model. bioRxiv.