Learn how to perform an in-silico deep mutational scan with three of the five ESM-1v models. This method is suitable for finding the most likely mutations 1x - 3x distance from a parent sequence. For going beyond 3x mutations, other methods such as inverse-folding or finetuning a generative model are more appropriate and less time-bound.
Predict Stabilizing Variants via Deep Mutational Scan¶
While ESM2 has similar performance to ESM-1v, the latter is still a great option for performing an in silico deep mutational scan. ESM-1v is a collection of five models, trained on sequences from UniRef, each with different seeds. Ensembling these five models' scores can produce better results than a single ESM-1v model, for such broad topics like identifying stabilizing mutations.
Since these models were only trained on functional molecules, they are capable of assessing whether a new molecule might also be functional, or whether it has a disastrous mutation.
Imagine only learning and speaking English your whole life - could you identify whether a new English word is English? Probably. What about whether a Korean word is Korean? Or whether a French word is French? You might not be able to identify a Korean word as specifically Korean since you were only exposed to English; but you can tell that some foreign words are at least not-English. The idea is similar with this zero-shot model, ESM-1v: how likely is a sequence functional, since the model was only exposed to functional sequences.
Set Your API Token¶
In order to use the BioLM API, you need to have a token. You can get one from the User API Tokens page.
Paste the API token you generated in the cell below, as the value
of the variable BIOLMAI_TOKEN
.
BIOLMAI_TOKEN = " " # YOUR TOKEN GOES HERE!!!
We do this by masking each position and requesting the scores for each residue at that position. For a sequence of even a couple hundred residues, this is thousands of predictions from each of the 5 GPU-based models. BioLM has APIs to get these results at scale much faster than it would take to spin up these models on multiple GPUs alone.
# Install or import `requests` package
try:
# Install packages to make API requests in JLite
import micropip
await micropip.install('requests')
await micropip.install('pyodide-http')
# Patch requests for in-browser support
import pyodide_http
pyodide_http.patch_all()
await micropip.install('pandas')
await micropip.install('matplotlib')
await micropip.install('seaborn')
except ModuleNotFoundError:
pass # Won't be using micropip outside of JLite
import requests
# Let's use the sequence from the paper
# https://www.biorxiv.org/content/10.1101/2021.07.09.450648v2.full.pdf
WILDTYPE = "ASKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
print("Sequence length: {}".format(len(WILDTYPE)))
We can copy working Python code for making a request to BioLM.ai's ESM-1v API:
def esm1v_pred(seq, model_num=1):
"""Get un-masking predictions from any of the 5 ESM-1v models via REST API.
From API docs: https://api.biolm.ai/#54d3367a-b5d1-4f18-aa44-23ae3914a446
"""
SLUG = f'esm1v-n{model_num}'
url = f'https://biolm.ai/api/v2/{SLUG}/predict/'
assert '<mask>' in seq
assert 5 >= model_num >= 1
data = {
"items": [
{
"sequence": seq
}
]
}
headers = {
"Content-Type": "application/json",
"Authorization": f"Token {BIOLMAI_TOKEN.strip()}",
}
# Make the request!
response = requests.post(
url=url,
headers=headers,
json=data,
)
return response
Now test our function on the first residue and first model.
result = esm1v_pred('<mask>' + WILDTYPE[1:], model_num=1) # First ESM-1v model
pred = result.json()
pred
from IPython.display import JSON
JSON(pred)
We can see that "M"
, or Methionine is the most likely starting AA. Methionine is a common starting AA amongst many organisms.
Next, We can get predictions from the other ESM-1v models, then ensemble the likelihood scores for best results.
res = esm1v_pred('<mask>' + WILDTYPE[1:], model_num=2) # Second ESM-1v model
pred = res.json()
JSON(pred)
You can observe the probabilities for each token (amino acid residue) differing between the two models.
Loading the data into a DF is simple.
import pandas as pd
pd.DataFrame.from_dict(pred['results'][0], orient='columns')
Let's scan the whole sequence now, using several of the five ESM-1v models. This should be fairly fast sequentially. Let's simply loop through each model and position. For a final workflow, we would want to average predictions from all five models (the ESM-1v all endpoint does this directly), but for the sake of speed, we'll only use three here.
import copy
results = []
def f(idx, seq, mdl):
if idx % 20 == 0:
print("Submitting twenty positions from {}/{} with ESM-1v Model {}".format(idx, len(seq), mdl))
# Mask that position
seq_masked = list(copy.copy(seq))
seq_masked[idx] = '<mask>'
seq_masked = ''.join(seq_masked)
r = esm1v_pred(seq_masked, model_num=mdl)
pred = r.json()
# Request the probabilities of each AA at that mask
r_df = pd.DataFrame.from_dict(pred['results'][0], orient='columns')
r_df['model_num'] = mdl
r_df['mask_pos'] = idx
return r_df
# We'll just query the first 3 models for now, that's enough
for mdl in range(1, 4):
ops = [f(i, WILDTYPE, mdl) for i in range(len(WILDTYPE))]
results = ops
results[0] # Look at the first result DF we made
Concatenate the resulting DFs together, horizontally.
results_df = pd.concat(results, axis=0)
results_df.shape
results_df.sample(5) # Sample some of the results to get a preview
Let's quickly fill in the WT character for each row so we can plot the results.
results_df['wt_token_str'] = results_df.mask_pos.apply(lambda x: WILDTYPE[x])
results_df.sample(3)
Lastly, make a DF displaying a heatmap of the most probable stabilizing variants from the above in silico deep mutational scan.
heatmap_df = [] # Each row represents the full sequence, with a single token_str at each position
for residue in results_df.token_str.drop_duplicates().to_list():
tmp_row = []
for pos in range(len(WILDTYPE)):
token_position_score = results_df.query('token_str == @residue & mask_pos == @pos').score.iloc[0]
tmp_row.append(token_position_score)
heatmap_df.append(tmp_row)
heatmap_df = pd.DataFrame(heatmap_df)
heatmap_df.shape
heatmap_df.head(1).iloc[0][0]
from matplotlib import pyplot as plt
import seaborn as sns
fig = plt.figure(figsize=[15, 4])
ax = sns.heatmap(heatmap_df)
_ = ax.set_xticklabels(list(WILDTYPE)[::int(len(WILDTYPE)/50)])
_ = ax.set_yticklabels(results_df.token_str.drop_duplicates())
_ = ax.set(xlabel='Wildtype Residue', ylabel='Variant')
This is a quick depiciton of the results. What you want to look for in the DF is mutations where the score is greater than the score of the wild-type residue at the same location. These can be extracted with some quick slicing of results_df
, which we'll leave up to you to explore. There appear to be several mutations with greater likelihood that the wild-type sequence.
Next Steps¶
Check out additional tutorials at jupyter.biolm.ai, or head over to our BioLM Documentation to explore additional models and functionality.
See more use-cases and APIs on your BioLM Console Catalog.¶
BioLM hosts deep learning models and runs inference at scale. You do the science.¶
Enzyme Engineering | Antibody Engineering | Biosecurity |
Single-Cell Genomics | DNA Sequence Modelling | Finetuning |
Contact us to learn more.¶