Free 2k Requests / mo.
Predict structure using Meta's SOTA language model.
Create viable sequences from the suite of models.
Mature molecules like antibodies and enzymes.
Learn how to use our AI tools.
Try out different models & apps.
Get started with BioLM
View our library of models
Learn more about the API
Use the Python SDK
Finetune models yourself
ESM1v N1 API ESM1v N2 API ESM1v N3 API
Learn how to perform an in-silico deep mutational scan with three of the five ESM-1v models. This method is suitable for finding the most likely mutations 1x - 3x distance from a parent sequence. For going beyond 3x mutations, other methods such as inverse-folding or finetuning a generative model are more appropriate and less time-bound.
While ESM2 has similar performance to ESM-1v, the latter is still a great option for performing an in silico deep mutational scan. ESM-1v is a collection of five models, trained on sequences from UniRef, each with different seeds. Ensembling these five models' scores can produce better results than a single ESM-1v model, for such broad topics like identifying stabilizing mutations.
Since these models were only trained on functional molecules, they are capable of assessing whether a new molecule might also be functional, or whether it has a disastrous mutation.
Imagine only learning and speaking English your whole life - could you identify whether a new English word is English? Probably. What about whether a Korean word is Korean? Or whether a French word is French? You might not be able to identify a Korean word as specifically Korean since you were only exposed to English; but you can tell that some foreign words are at least not-English. The idea is similar with this zero-shot model, ESM-1v: how likely is a sequence functional, since the model was only exposed to functional sequences.
In order to use the BioLM API, you need to have a token. You can get one from the User API Tokens page.
Paste the API token you generated in the cell below, as the value of the variable BIOLMAI_TOKEN.
BIOLMAI_TOKEN
BIOLMAI_TOKEN = " " # YOUR TOKEN GOES HERE!!!
We do this by masking each position and requesting the scores for each residue at that position. For a sequence of even a couple hundred residues, this is thousands of predictions from each of the 5 GPU-based models. BioLM has APIs to get these results at scale much faster than it would take to spin up these models on multiple GPUs alone.
# Install or import `requests` package try: # Install packages to make API requests in JLite import micropip await micropip.install('requests') await micropip.install('pyodide-http') # Patch requests for in-browser support import pyodide_http pyodide_http.patch_all() await micropip.install('pandas') await micropip.install('matplotlib') await micropip.install('seaborn') except ModuleNotFoundError: pass # Won't be using micropip outside of JLite import requests
# Let's use the sequence from the paper # https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2.full.pdf WILDTYPE = "ASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" print("Sequence length: {}".format(len(WILDTYPE)))
Sequence length: 238
We can copy working Python code for making a request to BioLM.ai's ESM-1v API:
def esm1v_pred(seq, model_num=1): """Get un-masking predictions from any of the 5 ESM-1v models via REST API. From API docs: https://api.biolm.ai/#54d3367a-b5d1-4f18-aa44-23ae3914a446 """ SLUG = f'esm1v-n{model_num}' url = f'https://biolm.ai/api/v2/{SLUG}/predict/' assert '<mask>' in seq assert 5 >= model_num >= 1 data = { "items": [ { "sequence": seq } ] } headers = { "Content-Type": "application/json", "Authorization": f"Token {BIOLMAI_TOKEN.strip()}", } # Make the request! response = requests.post( url=url, headers=headers, json=data, ) return response
Now test our function on the first residue and first model.
result = esm1v_pred('<mask>' + WILDTYPE[1:], model_num=1) # First ESM-1v model pred = result.json() pred
{'results': [[{'token': 20, 'token_str': 'M', 'score': 0.90923011302948, 'sequence': 'M S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 4, 'token_str': 'L', 'score': 0.00893514882773161, 'sequence': 'L S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 7, 'token_str': 'V', 'score': 0.007254290860146284, 'sequence': 'V S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 6, 'token_str': 'G', 'score': 0.0071580721996724606, 'sequence': 'G S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 12, 'token_str': 'I', 'score': 0.006673584226518869, 'sequence': 'I S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 8, 'token_str': 'S', 'score': 0.00605518976226449, 'sequence': 'S S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 15, 'token_str': 'K', 'score': 0.005998397711664438, 'sequence': 'K S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 11, 'token_str': 'T', 'score': 0.00533581618219614, 'sequence': 'T S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 5, 'token_str': 'A', 'score': 0.00529194762930274, 'sequence': 'A S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 18, 'token_str': 'F', 'score': 0.005088561214506626, 'sequence': 'F S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 13, 'token_str': 'D', 'score': 0.005014363676309586, 'sequence': 'D S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 9, 'token_str': 'E', 'score': 0.004764892160892487, 'sequence': 'E S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 17, 'token_str': 'N', 'score': 0.004458808805793524, 'sequence': 'N S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 19, 'token_str': 'Y', 'score': 0.004456241149455309, 'sequence': 'Y S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 14, 'token_str': 'P', 'score': 0.0035370574332773685, 'sequence': 'P S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 10, 'token_str': 'R', 'score': 0.0032345277722924948, 'sequence': 'R S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 16, 'token_str': 'Q', 'score': 0.0027003914583474398, 'sequence': 'Q S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 21, 'token_str': 'H', 'score': 0.0017103749560192227, 'sequence': 'H S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 22, 'token_str': 'W', 'score': 0.0015011748764663935, 'sequence': 'W S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}, {'token': 23, 'token_str': 'C', 'score': 0.0014352151192724705, 'sequence': 'C S K G E E L F T G V V P I L V E L D G D V N G H K F S V S G E G E G D A T Y G K L T L K F I C T T G K L P V P W P T L V T T F S Y G V Q C F S R Y P D H M K R H D F F K S A M P E G Y V Q E R T I F F K D D G N Y K T R A E V K F E G D T L V N R I E L K G I D F K E D G N I L G H K L E Y N Y N S H N V Y I M A D K Q K N G I K V N F K I R H N I E D G S V Q L A D H Y Q Q N T P I G D G P V L L P D N H Y L S T Q S A L S K D P N E K R D H M V L L E F V T A A G I T H G M D E L Y K'}]]}
from IPython.display import JSON JSON(pred)
<IPython.core.display.JSON object>
We can see that "M", or Methionine is the most likely starting AA. Methionine is a common starting AA amongst many organisms.
"M"
Next, We can get predictions from the other ESM-1v models, then ensemble the likelihood scores for best results.
res = esm1v_pred('<mask>' + WILDTYPE[1:], model_num=2) # Second ESM-1v model pred = res.json() JSON(pred)
You can observe the probabilities for each token (amino acid residue) differing between the two models.
Loading the data into a DF is simple.
import pandas as pd
pd.DataFrame.from_dict(pred['results'][0], orient='columns')
Let's scan the whole sequence now, using several of the five ESM-1v models. This should be fairly fast sequentially. Let's simply loop through each model and position. For a final workflow, we would want to average predictions from all five models (the ESM-1v all endpoint does this directly), but for the sake of speed, we'll only use three here.
import copy results = [] def f(idx, seq, mdl): if idx % 20 == 0: print("Submitting twenty positions from {}/{} with ESM-1v Model {}".format(idx, len(seq), mdl)) # Mask that position seq_masked = list(copy.copy(seq)) seq_masked[idx] = '<mask>' seq_masked = ''.join(seq_masked) r = esm1v_pred(seq_masked, model_num=mdl) pred = r.json() # Request the probabilities of each AA at that mask r_df = pd.DataFrame.from_dict(pred['results'][0], orient='columns') r_df['model_num'] = mdl r_df['mask_pos'] = idx return r_df # We'll just query the first 3 models for now, that's enough for mdl in range(1, 4): ops = [f(i, WILDTYPE, mdl) for i in range(len(WILDTYPE))] results = ops
Submitting twenty positions from 0/238 with ESM-1v Model 1 Submitting twenty positions from 20/238 with ESM-1v Model 1 Submitting twenty positions from 40/238 with ESM-1v Model 1 Submitting twenty positions from 60/238 with ESM-1v Model 1 Submitting twenty positions from 80/238 with ESM-1v Model 1 Submitting twenty positions from 100/238 with ESM-1v Model 1 Submitting twenty positions from 120/238 with ESM-1v Model 1 Submitting twenty positions from 140/238 with ESM-1v Model 1 Submitting twenty positions from 160/238 with ESM-1v Model 1 Submitting twenty positions from 180/238 with ESM-1v Model 1 Submitting twenty positions from 200/238 with ESM-1v Model 1 Submitting twenty positions from 220/238 with ESM-1v Model 1 Submitting twenty positions from 0/238 with ESM-1v Model 2 Submitting twenty positions from 20/238 with ESM-1v Model 2 Submitting twenty positions from 40/238 with ESM-1v Model 2 Submitting twenty positions from 60/238 with ESM-1v Model 2 Submitting twenty positions from 80/238 with ESM-1v Model 2 Submitting twenty positions from 100/238 with ESM-1v Model 2 Submitting twenty positions from 120/238 with ESM-1v Model 2 Submitting twenty positions from 140/238 with ESM-1v Model 2 Submitting twenty positions from 160/238 with ESM-1v Model 2 Submitting twenty positions from 180/238 with ESM-1v Model 2 Submitting twenty positions from 200/238 with ESM-1v Model 2 Submitting twenty positions from 220/238 with ESM-1v Model 2 Submitting twenty positions from 0/238 with ESM-1v Model 3 Submitting twenty positions from 20/238 with ESM-1v Model 3 Submitting twenty positions from 40/238 with ESM-1v Model 3 Submitting twenty positions from 60/238 with ESM-1v Model 3 Submitting twenty positions from 80/238 with ESM-1v Model 3 Submitting twenty positions from 100/238 with ESM-1v Model 3 Submitting twenty positions from 120/238 with ESM-1v Model 3 Submitting twenty positions from 140/238 with ESM-1v Model 3 Submitting twenty positions from 160/238 with ESM-1v Model 3 Submitting twenty positions from 180/238 with ESM-1v Model 3 Submitting twenty positions from 200/238 with ESM-1v Model 3 Submitting twenty positions from 220/238 with ESM-1v Model 3
results[0] # Look at the first result DF we made
Concatenate the resulting DFs together, horizontally.
results_df = pd.concat(results, axis=0) results_df.shape
(4760, 6)
results_df.sample(5) # Sample some of the results to get a preview
Let's quickly fill in the WT character for each row so we can plot the results.
results_df['wt_token_str'] = results_df.mask_pos.apply(lambda x: WILDTYPE[x]) results_df.sample(3)
Lastly, make a DF displaying a heatmap of the most probable stabilizing variants from the above in silico deep mutational scan.
heatmap_df = [] # Each row represents the full sequence, with a single token_str at each position for residue in results_df.token_str.drop_duplicates().to_list(): tmp_row = [] for pos in range(len(WILDTYPE)): token_position_score = results_df.query('token_str == @residue & mask_pos == @pos').score.iloc[0] tmp_row.append(token_position_score) heatmap_df.append(tmp_row)
heatmap_df = pd.DataFrame(heatmap_df) heatmap_df.shape
(20, 238)
heatmap_df.head(1).iloc[0][0]
0.8598923683166504
from matplotlib import pyplot as plt import seaborn as sns fig = plt.figure(figsize=[15, 4]) ax = sns.heatmap(heatmap_df) _ = ax.set_xticklabels(list(WILDTYPE)[::int(len(WILDTYPE)/50)]) _ = ax.set_yticklabels(results_df.token_str.drop_duplicates()) _ = ax.set(xlabel='Wildtype Residue', ylabel='Variant')
This is a quick depiciton of the results. What you want to look for in the DF is mutations where the score is greater than the score of the wild-type residue at the same location. These can be extracted with some quick slicing of results_df, which we'll leave up to you to explore. There appear to be several mutations with greater likelihood that the wild-type sequence.
results_df
Check out additional tutorials at jupyter.biolm.ai, or head over to our BioLM Documentation to explore additional models and functionality.