ESM C 600M is a GPU-accelerated protein language model trained on evolutionary-scale protein sequence data (UniRef, MGnify, JGI), using a masked language modeling objective. With 600 million parameters (36 layers, 1152 hidden units, 18 attention heads), it generates biologically meaningful embeddings for protein representation learning. The API supports high-throughput embedding generation to facilitate protein design, functional annotation, and bioinformatics analysis workflows.
Predict¶
Predict masked residues in input sequences
- POST /api/v3/esmc-600m/predict/¶
Predict endpoint for ESM C 600M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Indices of representation layers to return embeddings from
include (array of strings, default: [“mean”]) — Types of embeddings or logits to include; allowed values: “mean”, “per_token”, “logits”
items (array of objects, min: 1, max: 8) — Input sequences:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid codes plus “-” character; must contain at least one character
items (array of objects, min: 1, max: 8) — Input sequences for prediction:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid codes plus “<mask>” token; must contain at least one “<mask>” token
items (array of objects, min: 1, max: 8) — Input sequences for log probability calculation:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using unambiguous amino acid codes only
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of objects, optional) — Mean embeddings per requested layer:
layer (int) — Layer index as specified in request
embedding (array of floats, size: 960 for 300m, 1152 for 600m) — Mean embedding vector for the sequence
per_token_embeddings (array of objects, optional) — Per-token embeddings per requested layer:
layer (int) — Layer index as specified in request
embeddings (array of arrays of floats, shape: [sequence_length, embedding_size], embedding_size: 960 for 300m, 1152 for 600m) — Embedding vectors for each token in the sequence
logits (array of arrays of floats, optional, shape: [sequence_length, vocab_size]) — Raw logits for each token position; vocab_size: 33 (20 standard amino acids + additional tokens)
vocab_tokens (array of strings, optional, size: 33) — Vocabulary tokens corresponding to logits indices
results (array of objects) — One result per input item, in the order requested:
logits (array of arrays of floats, shape: [num_masked_positions, vocab_size]) — Raw logits for masked positions; vocab_size: 33 (20 standard amino acids + additional tokens)
sequence_tokens (array of strings, size: sequence_length) — Tokenized input sequence including predicted tokens
vocab_tokens (array of strings, size: 33) — Vocabulary tokens corresponding to logits indices
Example response:
Encode¶
Generate embeddings for input sequences
- POST /api/v3/esmc-600m/encode/¶
Encode endpoint for ESM C 600M.
- Request Headers:
Content-Type – application/json
Authorization – Token YOUR_API_KEY
Request
params (object, optional) — Configuration parameters:
repr_layers (array of integers, default: [-1]) — Layers from which embeddings are extracted
include (array of strings, default: [“mean”]) — Types of embeddings or logits to include; allowed values: “mean”, “per_token”, “logits”
items (array of objects, min: 1, max: 8) — Input sequences:
sequence (string, min length: 1, max length: 2048, required) — Protein sequence using extended amino acid alphabet plus “-” character
Example request:
- Status Codes:
200 OK – Successful response
400 Bad Request – Invalid input
500 Internal Server Error – Internal server error
Response
results (array of objects) — One result per input item, in the order requested:
embeddings (array of objects, optional) — Mean embeddings per requested layer:
layer (int) — Layer index as requested (e.g. -1 for last layer)
embedding (array of floats, size: 1280 for 300m model, 1536 for 600m model) — Mean embedding vector for the sequence
per_token_embeddings (array of objects, optional) — Per-token embeddings per requested layer:
layer (int) — Layer index as requested (e.g. -1 for last layer)
embeddings (array of arrays of floats, shape: [sequence_length, embedding_size]) — Embedding vectors for each token position in the sequence; embedding_size is 1280 for 300m model, 1536 for 600m model
logits (array of arrays of floats, optional, shape: [sequence_length, vocab_size]) — Logit scores for each token position; vocab_size = 33 (20 standard amino acids + additional tokens); scores are unbounded floats
vocab_tokens (array of strings, optional, size: 33) — Vocabulary tokens corresponding to logits indices; included only when logits are requested
Example response:
Performance¶
ESM C 600M provides significantly improved inference speed and memory efficiency compared to previous-generation ESM-2 models:
Matches predictive accuracy of ESM-2 3B (3 billion parameters) model, with approximately 5x fewer parameters.
Approaches the predictive performance of the substantially larger ESM-2 15B model, enabling high-quality predictions at substantially reduced computational cost.
GPU-accelerated inference on NVIDIA T4 GPUs ensures reliable, consistent performance for protein embedding and masked sequence prediction tasks.
Optimized transformer architecture (Pre-LN, rotary embeddings, SwiGLU activations, bias-free linear layers and layer norms) significantly reduces memory footprint and improves inference throughput relative to previous ESM architectures.
Ideal for high-throughput protein representation learning, embedding generation, and masked residue prediction tasks, providing state-of-the-art accuracy with significantly lower computational overhead compared to larger ESM-2 models.
ESM C 600M offers superior scalability and throughput relative to BioLM’s ESM-2 650M model, delivering substantially higher predictive accuracy at comparable inference speeds.
Recommended for applications where near state-of-the-art predictive performance is required, but computational resources or inference latency constraints preclude the use of larger models such as ESM-2 3B or 15B.
Input and Output Types:
Encoding Tasks:
Input: Amino acid sequences (single-letter code, up to 2048 residues per sequence).
Output: Embeddings (mean or per-token representations), logits over amino acid vocabulary.
Masked Sequence Prediction Tasks:
Input: Amino acid sequences containing one or more masked positions (“<mask>” tokens).
Output: Predicted logits for masked positions, sequence tokens, and vocabulary tokens.
Applications¶
Protein embedding generation to support downstream predictive modeling tasks, enabling researchers to rapidly screen protein libraries for desirable functional properties, such as thermostability or catalytic efficiency; useful in enzyme engineering pipelines but not optimal for direct antibody affinity maturation.
Unsupervised clustering of protein sequence space to identify novel functional families, facilitating discovery of previously unknown enzyme classes or protein scaffolds; valuable for biotech companies exploring novel biocatalysts but less effective for high-resolution structural predictions.
Representation learning for protein variant effect prediction, enabling accurate identification of beneficial mutations that enhance protein stability or activity; beneficial for protein engineering teams optimizing industrial enzymes, though not suitable for antibody specificity optimization.
Protein similarity search and retrieval based on learned embeddings, allowing rapid identification of homologous sequences with desired properties across large databases; practical for enzyme discovery and protein design workflows, but not recommended for precise antibody-antigen binding predictions.
Generating protein sequence embeddings for downstream machine learning models aimed at predicting protein-protein interactions, helping biotech companies accelerate the design of novel protein therapeutics or scaffolds; effective for general protein engineering applications but limited in predicting highly specific antibody-antigen interactions.
Limitations¶
Maximum Sequence Length: Input sequences are limited to
2048amino acids; longer sequences must be truncated or processed in segments.Batch Size: The maximum allowed batch size per request is
8sequences; larger datasets must be split into multiple requests.GPU Type: Inference is performed on
T4GPUs; performance may vary depending on computational complexity and batch size.ESM C 600M is optimized for representation learning and embeddings; it does not perform generative tasks or structure prediction directly.
The model is trained exclusively on protein sequences; it may not accurately represent non-natural amino acids or sequences containing ambiguous residues.
For antibody-specific structural predictions (e.g., CDR3 loops), specialized models like NanobodyBuilder or ABodybuilder typically yield better accuracy than general-purpose models such as ESM C 600M.
How We Use It¶
BioLM’s integration of the ESM C 600M model into its workflows significantly enhances protein engineering and research acceleration by enabling advanced protein sequence analysis and representation learning. This model allows for the effective design and optimization of proteins by leveraging its ability to predict and analyze the biological properties embedded within protein sequences. By integrating ESM C 600M with BioLM’s existing tools and pipelines, researchers can efficiently filter and rank candidate proteins, thereby streamlining the protein design process. The model’s capabilities are further amplified when used in conjunction with other BioLM services, such as sequence embeddings and predictive models, which collectively provide a robust framework for innovative scientific solutions.
ESM C 600M facilitates scalable and standardized API access, enabling rapid deployment in diverse bioengineering tasks.
The model’s integration within BioLM workflows supports multi-round protein optimization, enhancing experimental accuracy and reducing time to market for new bioproducts.
References¶
ESM Team (2024). ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning. EvolutionaryScale Website. December 4, 2024.
