BioLMTox Toxin Similarity¶

Enhance biosecurity predictions with the BioLMTox embedding endpoint.

from IPython.display import JSON  # Helpful UI for JSON display
import time
from biolmai import BioLM
import requests  # Will use to make calls to BioLM.ai
import csv  # To read example data
import numpy as np
from numpy.linalg import norm

lines = []
with open('data/protein/data/PLA2.csv', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        lines.append(row)
print(lines)

[['label', 'sequence'], ['toxin', 'MHPAHLLVLLAVCVSLLGASDIPPLPLNLAQFGFMIRCANGGSRSPLDYTDYGCYCGKGGRGTPVDDLDRCCQVHDECYGEAEKRLGCSPFVTLYSWKCYGKAPSCNTKTDCQRFVCNCDAKAAECFARSPYQKKNWNINTKARCK'], ['toxin', 'MRTLWIMAVLLVGVEGSLVELGKMILQETGKNPVTSYGAYGCNCGVLGRGKPKDATDRCCYVHKCCYKKLTDCNPKKDRYSYSWKDKTIVCGENNSCLKELCECDKAVAICLRENLDTYNKKYKNNYLKPFCKKADPC']]

Load the example toxin sequences from the CSV file

SEQ1 = lines[1][1]
SEQ2 = lines[2][1]

print("Sequence length 1: {}".format(len(SEQ1)))
print("Sequence length 2: {}".format(len(SEQ2)))

Sequence length 1: 146
Sequence length 2: 138

SEQ1 is https://www.uniprot.org/uniprotkb/Q45Z47/entry and SEQ2 is https://www.uniprot.org/uniprotkb/P82114/entry. Both are snake venoms

Define Endpoint Params¶

Let's make a secure REST API request to BioLM API to quickly make the prediction on GPU.

# Make the request - let's time it!
start = time.time()
result = BioLM(entity="biolmtox2", action="encode", type="sequence", items=[SEQ1, SEQ2])
end = time.time()
print(f"BioLMTox prediction took {end - start:.4f} seconds.")

# If you wish to view the full result, you can expand the tree in the cell below
JSON(result)

BioLMTox prediction took 0.5072 seconds.

<IPython.core.display.JSON object>

The response is a list of embedding vectors, one vector of static length 640 for each input instance.

# Define similarity measure
def cos_similarity(a, b):
    return np.dot(a,b)/(norm(a)*norm(b))

# convert sequence embeddings to numpy arrays
em_1 = np.asarray(result[0]["mean_representation"])
em_2 = np.asarray(result[1]["mean_representation"])

# compute similarity measures
em_similarity = cos_similarity(em_1, em_2)
print(f'sequence embedding cosine similarity:\n{em_similarity}')

sequence embedding cosine similarity:
0.9651337988386686

The cosine similarity between the two toxin sequences is quite high, as expected, since one sequence is Phospholipase A2 OS2 and the other is Basic phospholipase A2 homolog MjTX-I. Both are related to Phospholipases and are snake venoms.

Next Steps¶

Check out additional tutorials at jupyter.biolm.ai, or head over to our BioLM Documentation to explore additional models and functionality.

All Guides

BioLMTox Toxin Similarity¶

Define Endpoint Params¶

Next Steps¶

See more use-cases and APIs on your BioLM Console Catalog.¶

BioLM hosts deep learning models and runs inference at scale. You do the science.¶

Contact us to learn more.¶

Accelerate yourLead generation