{
"cells": [
{
"cell_type": "markdown",
"id": "9f9b5d3b-8fb0-4c4c-a7e9-32b0faa736df",
"metadata": {},
"source": [
"# Exploring related sequences and structures\n",
"\n",
"Ensuring an effective and global response to potential pandemic emergencies requires the development of broad-spectrum antiviral therapeutics, capable of targeting multiple protein variants. Advancing these therapeutics requires designing ligands that can effectively interact with multiple protein targets. \n",
"In response to this need, we have developed the `drugforge-spectrum` tool. This tool simplifies the design of broad-spectrum antivirals by enabling users to: 1) search for all the related sequences of a protein target of interest, and 2) generate and align the structures of these related targets for further visualization and testing."
]
},
{
"cell_type": "markdown",
"id": "9092aeac",
"metadata": {},
"source": [
"\n",
"This tutorial will guide you through the process of querying a protein sequence on BLAST to find related proteins and align their sequences and structures.\n",
"The notebook is divided as follows:\n",
"\n",
"1. Runing a BLAST search on a query protein sequence.\n",
"\n",
"2. Performing a multi-sequence alignment between the reference and a sub-set of the matching sequences, based on a selection criteria.\n",
"\n",
"3. Preparing a job on ColabFold to generate structures of the related proteins.\n",
"\n",
"4. Generating a PyMol session with the structures of reference and related proteins.\n",
"\n",
"5. How to do all the above steps in the command-line!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "99d35e83",
"metadata": {},
"outputs": [],
"source": [
"from pathlib import Path\n",
"import pandas as pd\n",
"\n",
"from drugforge.spectrum.blast import get_blast_seqs\n",
"from drugforge.spectrum.seq_alignment import Alignment, do_MSA\n",
"\n",
"from drugforge.data.testing.test_resources import fetch_test_file # fetch test files\n",
"\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\", ResourceWarning)\n",
"from Bio import BiopythonDeprecationWarning\n",
"warnings.simplefilter(\"ignore\", BiopythonDeprecationWarning)\n",
"\n",
"# Print info logs\n",
"import logging\n",
"logging.basicConfig(\n",
" level=logging.INFO, \n",
" format='%(levelname)s: %(message)s', \n",
" force=True \n",
")"
]
},
{
"cell_type": "markdown",
"id": "20e0d4b7",
"metadata": {},
"source": [
"We will save our results from this run on a separate directory, because we expect a lot of output files! "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "04b7bb09",
"metadata": {},
"outputs": [],
"source": [
"output_dir = \"tutorial_results\"\n",
"output_dir = Path(output_dir)\n",
"output_dir.mkdir(parents=True, exist_ok=True) "
]
},
{
"cell_type": "markdown",
"id": "e35d3ecd",
"metadata": {},
"source": [
"## BLAST search on a query protein sequence"
]
},
{
"cell_type": "markdown",
"id": "5cfd8a00",
"metadata": {},
"source": [
"We start with an input fasta file, containing the sequence of the protein of interest. For example, the fasta `sars-cov2-mpro.fasta` contains the sequence of one of the chains in SARS-CoV-2 Mpro."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "103ac266",
"metadata": {},
"outputs": [],
"source": [
"input_fasta = output_dir/\"sars-cov2-mpro.fasta\"\n",
"\n",
"with open(input_fasta, \"w\") as f:\n",
" f.write(\">SARS-CoV-2 Mpro Chain A, B\\n\")\n",
" f.write(\n",
" \"SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIR\" \\\n",
" \"KSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNG\" \\\n",
" \"SPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGN\" \\\n",
" \"FYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYE\" \\\n",
" \"PLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "69f1c0c9",
"metadata": {},
"source": [
"Now, time to run the BLAST search! \n",
"Our code call the `Bio.Blast.NCBIWWW` module from BioPython, which queries from the NCBI BLAST server. There's a few parameters from BLAST we are interested in, because they control the number of matches we get from our search:\n",
"\n",
"- `nalign`: The number of database sequences BLAST will show alignments for.\n",
"- `nhits`: The number of hit sequences to return.\n",
"- `e_val`: Expect value (E) for saving hits.\n",
"\n",
"In general, we want to maximize the number of sequence matches! In the next step we will learn how to apply other filters to our BLAST output, but we wold like to have as big of a pool to select from as possible. \n",
"\n",
"We will use the defaults `nalign=500` and `nhits=100` for this tutorial to limit the size of our output, but we recommend using >1000 for the alignments and a similar number for the hits for production runs (by definition, keep `nalign`>`nhits`). The BLAST default for `e_val` is 10, which is a good value for exploratory searches. You can improve the quality of your search by decreasing the expect value, at the risk of limiting the size of your output. "
]
},
{
"cell_type": "markdown",
"id": "c65bed62",
"metadata": {},
"source": [
"We will also save the output of the BLAST run to a CSV file for book-keeping."
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "c46d35ea",
"metadata": {},
"outputs": [],
"source": [
"save_csv = \"blast_log.csv\""
]
},
{
"cell_type": "markdown",
"id": "4a949aa5",
"metadata": {},
"source": [
"And now we are ready to run our search! \n",
"_(This should take a few minutes because we have to submit and retrieve our search to/from the BLAST servers)_"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "54bf7992-718e-45cd-94ee-f198baa44d4a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"BLAST search with 500 alignments, expect 10, 100 hitlist_size and 100 descriptions\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" query | \n",
" ID | \n",
" description | \n",
" sequence | \n",
" host | \n",
" organism | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009725295.1| | \n",
" ORF1a polyprotein [Severe acute respiratory sy... | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... | \n",
" | \n",
" | \n",
" 100.00 | \n",
"
\n",
" \n",
" | 1 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009724389.1| | \n",
" ORF1ab polyprotein [Severe acute respiratory s... | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... | \n",
" | \n",
" | \n",
" 100.00 | \n",
"
\n",
" \n",
" | 2 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009725301.1| | \n",
" 3C-like proteinase [Severe acute respiratory s... | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... | \n",
" | \n",
" | \n",
" 100.00 | \n",
"
\n",
" \n",
" | 3 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009944365.1| | \n",
" ORF1a polyprotein [SARS coronavirus Tor2] | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... | \n",
" | \n",
" | \n",
" 96.08 | \n",
"
\n",
" \n",
" | 4 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|NP_828849.7| | \n",
" ORF1ab polyprotein [SARS coronavirus Tor2] | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... | \n",
" | \n",
" | \n",
" 96.08 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" query ID \\\n",
"0 SARS-CoV-2 Mpro Chain A, B ref|YP_009725295.1| \n",
"1 SARS-CoV-2 Mpro Chain A, B ref|YP_009724389.1| \n",
"2 SARS-CoV-2 Mpro Chain A, B ref|YP_009725301.1| \n",
"3 SARS-CoV-2 Mpro Chain A, B ref|YP_009944365.1| \n",
"4 SARS-CoV-2 Mpro Chain A, B ref|NP_828849.7| \n",
"\n",
" description \\\n",
"0 ORF1a polyprotein [Severe acute respiratory sy... \n",
"1 ORF1ab polyprotein [Severe acute respiratory s... \n",
"2 3C-like proteinase [Severe acute respiratory s... \n",
"3 ORF1a polyprotein [SARS coronavirus Tor2] \n",
"4 ORF1ab polyprotein [SARS coronavirus Tor2] \n",
"\n",
" sequence host organism score \n",
"0 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... 100.00 \n",
"1 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... 100.00 \n",
"2 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... 100.00 \n",
"3 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... 96.08 \n",
"4 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... 96.08 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"matches_df = get_blast_seqs(\n",
" input_fasta,\n",
" output_dir,\n",
" input_type=\"fasta\",\n",
" save_csv=save_csv,\n",
" nalign=500,\n",
" nhits=100,\n",
" e_val_thresh=10,\n",
" verbose=False,\n",
" xml_file=\"results.xml\"\n",
")\n",
"matches_df.head()"
]
},
{
"cell_type": "markdown",
"id": "ef72d6f5",
"metadata": {},
"source": [
"Alternatively, we could already have a PDB file with our reference's crystal structure. We can initialize the BLAST search with the PDB file and generate a `fasta` file from our PDB structure, by changing the `input_type` to `\"pdb\"`. \n",
" \n",
"_**WARNING:** This run will take longer than the previous one, because BLAST discourages multiple queries of the same sequence within a short time frame. You can skip this cell if you'd like, the rest of the notebook will run!_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6cfe00d5",
"metadata": {},
"outputs": [],
"source": [
"input_pdb = fetch_test_file(\"Mpro-P2660_0A_bound.pdb\") # SARS-Cov-2 MPRO structure\n",
"'''\n",
"matches_df_pdb = get_blast_seqs(\n",
" input_pdb,\n",
" output_dir,\n",
" input_type=\"pdb\",\n",
" save_csv=save_csv,\n",
" nalign=500,\n",
" nhits=100,\n",
" e_val_thresh=10,\n",
" verbose=False,\n",
" xml_file=\"results_pdb.xml\"\n",
")\n",
"matches_df_pdb.head()\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "70307791",
"metadata": {},
"source": [
"## Aligning the sequences from our BLAST search\n",
"Now that we have a list of protein sequencies that are related to our protein of interest, we want to visualize the similarities between the sequences. We will do this by aligning and visualizing multiple sequences from the BLAST output."
]
},
{
"cell_type": "markdown",
"id": "48a26094",
"metadata": {},
"source": [
"### Filtering our sequences based on a keyword match\n",
"The number of sequences returned by BLAST may get quite large, making it hard to find similarities between the protein sequences. Additionally, we may not even be interested in all of these! Some entries often share the same sequence (for example, diferent chains of the same protein), and others may simply not be relevant to us. For example, we could be interested in viral proteins that infect humans or we could be interested in MERS-CoV variants specifically. \n",
"\n",
"Information about a given sequence record can be found in the description provided by BLAST. The easiest way to filter our sequence is then to search for a specific keyword in this description, such as \"human\" or \"MERS\"."
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "50c83e04",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 ORF1a polyprotein [Severe acute respiratory sy...\n",
"1 ORF1ab polyprotein [Severe acute respiratory s...\n",
"2 3C-like proteinase [Severe acute respiratory s...\n",
"3 ORF1a polyprotein [SARS coronavirus Tor2]\n",
"4 ORF1ab polyprotein [SARS coronavirus Tor2]\n",
" ... \n",
"95 ORF1ab polyprotein [Lucheng Rn rat coronavirus]\n",
"96 nsp5 [Canada goose coronavirus]\n",
"97 ORF 1ab polyprotein [Beluga whale coronavirus ...\n",
"98 polyprotein ORF1a [Camel alphacoronavirus]\n",
"99 polyprotein ORF1ab [Camel alphacoronavirus]\n",
"Name: description, Length: 100, dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"matches_df['description']"
]
},
{
"cell_type": "markdown",
"id": "b52d6853",
"metadata": {},
"source": [
"Once we decide the keyword for our filter, we proceed to align sequences based on our filter. For this we define an `Alignment` object, which will take care of both tasks, given a query sequence (_\"sars-cov2-mpro\"_) and the DataFrame we obtained from the BLAST search:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "b0e4500b",
"metadata": {},
"outputs": [],
"source": [
"query = matches_df[\"query\"][0]\n",
"alignment = Alignment(matches_df, query, output_dir)"
]
},
{
"cell_type": "markdown",
"id": "850b062c",
"metadata": {},
"source": [
"From this alignment object, we will filter and align the resulting sequences. The MAFFT multi-sequence aligment tool will take care of the latter. Both steps will be run by the `do_MSA()` function, which takes the `Alignment` object and the selection parameter. We also set the parameter `n_chains=1`, because our PDB structure contains only one chain. This parameter is not relevant for the visualization of the sequences (all BLAST entries correspond to a single chain), but will be useful for the next structure-alignment step of this tutorial. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "5f923dd9",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: A fasta file tutorial_results/sars_tutorial.fasta have been generated with the selected sequences\n",
"INFO: A fasta file tutorial_results/sars_tutorial_alignment.fasta have been generated with the multi-seq alignment\n",
"INFO: A csv file tutorial_results/sars_tutorial.csv have been generated with the selected sequences\n",
"INFO: The multi-sequence alignment returns the following matches:\n",
"INFO: group: 181/307\n",
"INFO: exact: 89/307\n",
"INFO: none: 37/307\n",
"INFO: Aligning 5 sequences of lenght 307\n",
"INFO: A html file tutorial_results/sars_tutorial_alignment.html have been generated with the aligned sequences\n"
]
}
],
"source": [
"sel_key = \"human\"\n",
"file_prefix = \"sars_tutorial\"\n",
"alignment = do_MSA(\n",
" alignment=alignment, \n",
" select_mode=sel_key, \n",
" file_prefix=file_prefix, \n",
" plot_width=2000, \n",
" n_chains=2,\n",
" color_by_group=True,\n",
" start_alignment_idx=1,\n",
" max_mismatch=2,\n",
" custom_order=\"\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "4a599c3f",
"metadata": {},
"source": [
"This step should be completed pretty fast! You should see the following files in your `output_dir`:\n",
"- A `fasta` file with the ID, description and sequence of the entries selected via the provided keyword.\n",
"- An `.html` file with a colored representation of the multi-sequence alignment (mismatches in red, and group matches up to `max_mismatches` in yellow). The width of the plot can be adjusted via `plot_width` option.\n",
"- A `fasta` file with the aligned sequences (with “-” representing gaps inserted)\n",
"- A CSV file with the ID and sequence of the entries selected via the keyword. We will use this output in the next step of this tutorial."
]
},
{
"cell_type": "markdown",
"id": "8f72e391",
"metadata": {},
"source": [
"### Sequences can also be selected based on the virus host, using the Entrez database\n",
"\n",
"Entrez is a molecular biology database system that provides integrated access to nucleotide and protein sequence data. Importantly, it provides information on the organism and host of a given viral protein sequence. We are now going to query the Entrez database to filter our sequences based on viral host information. "
]
},
{
"cell_type": "markdown",
"id": "a2bb34c1",
"metadata": {},
"source": [
"We retrieve Entrez host information when we process the BLAST search, and the _host_ and _organism_ information will be included in the BLAST CSV output. In the previous call to the `get_blast_seqs()` function, however, no query was made to the Entrez database. The function will perform the database search only if a valid email address is provided via the `email` parameter. This email address is required by the Entrez API, and is used as way of controlling how much a single user queries the database. \n",
"\n",
"We have to re-run the function, but there is not need to run the entire BLAST search from scratch! If you look through your results folder, you will notice a new file `results.xml` was saved. This file contains a backup of the most recent BLAST search and can be used to retrieve the results from the run. This is really convenient as BLAST will take longer each time we submit the (same) sequence, so after a few queries the run will take 10+ mins. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa524fb5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" query | \n",
" ID | \n",
" description | \n",
" sequence | \n",
" host | \n",
" organism | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009725295.1| | \n",
" ORF1a polyprotein [Severe acute respiratory sy... | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... | \n",
" Homo sapiens | \n",
" Severe acute respiratory syndrome coronavirus 2 | \n",
" 100.00 | \n",
"
\n",
" \n",
" | 1 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009724389.1| | \n",
" ORF1ab polyprotein [Severe acute respiratory s... | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... | \n",
" Homo sapiens | \n",
" Severe acute respiratory syndrome coronavirus 2 | \n",
" 100.00 | \n",
"
\n",
" \n",
" | 2 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009725301.1| | \n",
" 3C-like proteinase [Severe acute respiratory s... | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... | \n",
" Homo sapiens | \n",
" Severe acute respiratory syndrome coronavirus 2 | \n",
" 100.00 | \n",
"
\n",
" \n",
" | 3 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|YP_009944365.1| | \n",
" ORF1a polyprotein [SARS coronavirus Tor2] | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... | \n",
" Homo sapiens; patient #2 with severe acute res... | \n",
" SARS coronavirus Tor2 | \n",
" 96.08 | \n",
"
\n",
" \n",
" | 4 | \n",
" SARS-CoV-2 Mpro Chain A, B | \n",
" ref|NP_828849.7| | \n",
" ORF1ab polyprotein [SARS coronavirus Tor2] | \n",
" SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... | \n",
" Homo sapiens; patient #2 with severe acute res... | \n",
" SARS coronavirus Tor2 | \n",
" 96.08 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" query ID \\\n",
"0 SARS-CoV-2 Mpro Chain A, B ref|YP_009725295.1| \n",
"1 SARS-CoV-2 Mpro Chain A, B ref|YP_009724389.1| \n",
"2 SARS-CoV-2 Mpro Chain A, B ref|YP_009725301.1| \n",
"3 SARS-CoV-2 Mpro Chain A, B ref|YP_009944365.1| \n",
"4 SARS-CoV-2 Mpro Chain A, B ref|NP_828849.7| \n",
"\n",
" description \\\n",
"0 ORF1a polyprotein [Severe acute respiratory sy... \n",
"1 ORF1ab polyprotein [Severe acute respiratory s... \n",
"2 3C-like proteinase [Severe acute respiratory s... \n",
"3 ORF1a polyprotein [SARS coronavirus Tor2] \n",
"4 ORF1ab polyprotein [SARS coronavirus Tor2] \n",
"\n",
" sequence \\\n",
"0 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... \n",
"1 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... \n",
"2 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTS... \n",
"3 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... \n",
"4 SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTA... \n",
"\n",
" host \\\n",
"0 Homo sapiens \n",
"1 Homo sapiens \n",
"2 Homo sapiens \n",
"3 Homo sapiens; patient #2 with severe acute res... \n",
"4 Homo sapiens; patient #2 with severe acute res... \n",
"\n",
" organism score \n",
"0 Severe acute respiratory syndrome coronavirus 2 100.00 \n",
"1 Severe acute respiratory syndrome coronavirus 2 100.00 \n",
"2 Severe acute respiratory syndrome coronavirus 2 100.00 \n",
"3 SARS coronavirus Tor2 96.08 \n",
"4 SARS coronavirus Tor2 96.08 "
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_presaved = output_dir / \"results.xml\" \n",
"\n",
"matches_df = get_blast_seqs(\n",
" input_presaved,\n",
" output_dir,\n",
" input_type=\"pre-calc\",\n",
" save_csv=save_csv,\n",
" verbose=False,\n",
" email=\"your.email@email.com\" # Uncomment with your own email!!\n",
")\n",
"matches_df.head()"
]
},
{
"cell_type": "markdown",
"id": "9dc30b3a",
"metadata": {},
"source": [
"To select based on Entrez information we will use a specific format:\n",
" \n",
"`host: OR organism: `\n",
" \n",
"For example, `host: Homo sapiens OR organism: human` is a valid `sel_key` input that will look for sequences that infect humans by making sure either the corresponding *host* or *organism* entry contain the human keywords. **It is important to specify both** because sometimes Entrez entries do not contain host information but will specify the organism. \n",
"Let's run it!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6afeb41b",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: A fasta file tutorial_results/sars_tutorial_human_host.fasta have been generated with the selected sequences\n",
"INFO: A fasta file tutorial_results/sars_tutorial_human_host_alignment.fasta have been generated with the multi-seq alignment\n",
"INFO: A csv file tutorial_results/sars_tutorial_human_host.csv have been generated with the selected sequences\n",
"INFO: The multi-sequence alignment returns the following matches:\n",
"INFO: group: 128/310\n",
"INFO: exact: 86/310\n",
"INFO: none: 96/310\n",
"INFO: Aligning 7 sequences of lenght 310\n",
"INFO: A html file tutorial_results/sars_tutorial_human_host_alignment.html have been generated with the aligned sequences\n"
]
}
],
"source": [
"query = matches_df[\"query\"][0]\n",
"alignment = Alignment(matches_df, query, output_dir)\n",
"\n",
"sel_key = \"host: Homo sapiens OR organism: human\"\n",
"file_prefix = \"sars_tutorial_human_host\"\n",
"alignment = do_MSA(\n",
" alignment=alignment, \n",
" select_mode=sel_key, \n",
" file_prefix=file_prefix, \n",
" plot_width=2000, \n",
" n_chains=2,\n",
" color_by_group=True,\n",
" start_alignment_idx=1,\n",
" max_mismatch=2,\n",
" custom_order=\"\"\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "0fc8d66a",
"metadata": {},
"source": [
"## Using ColabFold to generate PDB structures of our sequence matches\n",
"\n",
"So far we have went from a reference protein sequence, to find all related viral proteins that infect humans, and visualizing how much the sequences relate to each other. To analyze how these differences translate into structural differences, we will feed the sequences into [ColabFold](https://github.com/sokrypton/ColabFold/tree/main?tab=readme-ov-file) to generate their corresponding structures with respect to a common template (in this case, the initial reference protein). Then, we will align and visualize the generated sequences with the reference and save a PyMOL session that will allow us to point to specific structural differences. "
]
},
{
"cell_type": "markdown",
"id": "03ab282e",
"metadata": {},
"source": [
"You can install ColabFold in your local machine by following the instructions [here](https://github.com/YoshitakaMo/localcolabfold). The command-line `colabfold_batch` program takes as input a CSV file containing the ID and sequence of the structures we wish to fold, as well as the path to the directory all results will be stored. \n",
"Optionally, we can also provide ColabFold with the PDB of a structure to use as template, as well as adjust the number of times the folded structure will be recycled, and the number of repetitions we want to generate (_e.g._, for improving the statistics of our model). \n",
"\n",
"The fold model can also be specified via `--model-type`. For monomers, _\"alphafold2_ptm\"_ is often appropiate, while multimers (`n_chains`>1 in the previous step) can be folded using the _\"alphafold2_multimer_v3\"_ model. We will run a job with 3 recycles, a single repetition and a monomer model."
]
},
{
"cell_type": "markdown",
"id": "b8fd0d2f",
"metadata": {},
"source": [
"The following is an example bash script for running ColabFold with the CSV file generated in the previous step, and the reference PDB saved in the directory `template_dir/`:\n",
"\n",
" csv_fn=\"/sars_tutorial_human_host.csv\"\n",
" template_dir=\"template_dir\" # Ref pdb must be saved here\n",
" out_dir=\"sars_cf_results\"\n",
" \n",
" colabfold_batch $csv_fn $out_dir \\\n",
" --templates --custom-template-path $template_dir \\\n",
" --num-recycle 3 --num-models 1 --model-type \"alphafold2_multimer_v3\""
]
},
{
"cell_type": "markdown",
"id": "1383f919",
"metadata": {},
"source": [
"## Saving the aligned reference and related sequences on a PyMol session"
]
},
{
"cell_type": "markdown",
"id": "c5398b21",
"metadata": {},
"source": [
"The following section assumes you have completed the run on ColabFold and have generated structures for all the sequences saved in the `.csv` file, created previously."
]
},
{
"cell_type": "markdown",
"id": "5d00f472",
"metadata": {},
"source": [
"We will have all the files generated by ColabFold in a folder, for each molecule in the CSV file. If you chose to generate multiple repetitions of each protein fold (via the `-random-seed` option), you should have multiple PDB files per sequence, all following the same format: _`\"_unrelaxed_rank_001__model_1_seed_.pdb\"`_, where a `` is assigned to each repetition and `` is the model used for the fold.\n",
"\n",
"Let's define a function that will go through each `*.pdb` file in the directory, align and calculate the RMSD of each with respect to the reference and save a PyMOL session of the reference and folded proteins. If we have multiple seed repetitions saved, only those with the least RMSD with respect to the reference will be saved. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "3e5f386d",
"metadata": {},
"outputs": [],
"source": [
"from drugforge.spectrum.calculate_rmsd import (\n",
" save_alignment_pymol,\n",
" select_best_colabfold,\n",
")\n",
"def gen_alignment_vis(sequence_df, cf_folder, ref_pdb, save_dir, pymol_save):\n",
" \n",
" aligned_pdbs = []\n",
" seq_labels = []\n",
" for index, row in sequence_df.iterrows():\n",
" # iterate over each csv entry\n",
" mol = row[\"id\"]\n",
" results = Path(cf_folder)\n",
" final_pdb = save_dir / f\"{mol}_aligned.pdb\"\n",
"\n",
" # The output file format in ColabFold depends on the folding method used. \n",
" # Here the PDB file format is defined assuming the AlphaFold2_ptm model \n",
"\n",
" # Select best seed repetition\n",
" min_rmsd, min_file = select_best_colabfold(\n",
" results_dir=results,\n",
" seq_name=mol,\n",
" pdb_ref=input_pdb, # reference pdb\n",
" chain=\"A\",\n",
" final_pdb=final_pdb,\n",
" fold_model=\"alphafold2_multimer_v3\",\n",
" )\n",
"\n",
" aligned_pdbs.append(min_file)\n",
" seq_labels.append(mol)\n",
"\n",
" session_save = save_dir / pymol_save\n",
" # Save the alignment in a PyMOL session. Here we display only the chain \"A\", but both can be displayed by setting align_chain=\"both\"\n",
" save_alignment_pymol(\n",
" aligned_pdbs, seq_labels, ref_pdb, session_save, align_chain=\"A\", color_by_rmsd=True\n",
" )\n",
"\n",
" return "
]
},
{
"cell_type": "markdown",
"id": "d4bc7543",
"metadata": {},
"source": [
"If you followed the previous guide, all of the ColabFold results should be saved on the folder `sars_cf_results/`. For this tutorial, we will fetch previously prepared ColabFold PDBs, but feel free to use your own. Let's also use the PDB for SARS-CoV-2 introduced in the beginning of the notebook as reference. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0e6a1407",
"metadata": {},
"outputs": [],
"source": [
"# grab cf_results\n",
"\n",
"def all_cfold_dir_fns():\n",
" return [\n",
" \"spectrum_dir/YP_009725295_1_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb\",\n",
" \"spectrum_dir/YP_009047217_1_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb\",\n",
" \"spectrum_dir/YP_009555250_1_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb\",\n",
" \"spectrum_dir/NP_835346_1_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb\",\n",
" \"spectrum_dir/YP_009944365_1_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb\",\n",
" \"spectrum_dir/YP_009944273_1_unrelaxed_rank_001_alphafold2_multimer_v3_model_1_seed_000.pdb\"\n",
" ]\n",
"def cfold_dir():\n",
"\n",
" all_paths = [fetch_test_file(f) for f in all_cfold_dir_fns()]\n",
" return all_paths[0].parent, all_paths\n",
"\n",
"sars_cf_results, _ = cfold_dir()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "fb9f84ff",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: RMSD for seed 000 is 1.9166725274687486 A\n",
"INFO: YP_009725295_1 seed with least RMSD is 000 with RMSD 1.9166725274687486 A\n",
"INFO: RMSD for seed 000 is 3.1352285647682567 A\n",
"INFO: YP_009944365_1 seed with least RMSD is 000 with RMSD 3.1352285647682567 A\n",
"INFO: RMSD for seed 000 is 2.5524218376956935 A\n",
"INFO: YP_009047217_1 seed with least RMSD is 000 with RMSD 2.5524218376956935 A\n",
"INFO: RMSD for seed 000 is 2.4124092362555603 A\n",
"INFO: YP_009944273_1 seed with least RMSD is 000 with RMSD 2.4124092362555603 A\n",
"INFO: RMSD for seed 000 is 2.0554156780668813 A\n",
"INFO: YP_009555250_1 seed with least RMSD is 000 with RMSD 2.0554156780668813 A\n",
"INFO: RMSD for seed 000 is 2.2915132888118075 A\n",
"INFO: NP_835346_1 seed with least RMSD is 000 with RMSD 2.2915132888118075 A\n",
"WARNING: No ColabFold entry for YP_010229075_1 and model alphafold2_multimer_v3 found.\n"
]
}
],
"source": [
"# grab a reference SARS-CoV-2\n",
"input_pdb = fetch_test_file(\"Mpro-P2660_0A_bound.pdb\") \n",
"\n",
"save_dir = output_dir / \"aligned_sars_tutorial/\"\n",
"save_dir.mkdir(parents=True, exist_ok=True)\n",
"\n",
"csv_file = output_dir / \"sars_tutorial_human_host.csv\"\n",
"cf_folder = sars_cf_results \n",
"seq_df = pd.read_csv(csv_file)\n",
"\n",
"gen_alignment_vis(seq_df, cf_folder, input_pdb, save_dir, \"aligned_proteins.pse\")"
]
},
{
"cell_type": "markdown",
"id": "19268bf1",
"metadata": {},
"source": [
"After running the above block, the aligned proteins and the PyMOL session will be saved in the folder `aligned_ColabFold/`. Let's look at the results, which will show the aligned first chains (chain A), colored by RMSD with respect to the reference protein. \n",
"\n",
""
]
},
{
"cell_type": "markdown",
"id": "f2552876",
"metadata": {},
"source": [
"We can clearly see a few differences in the protein structure! These regions are colored in blue. For example, **NP_073550_1** and **YP_010229075_1**, Human coronavirus 229E and NL63, repectively, have a longer aminoacid sequence, as we saw in the sequence alignment step. However, most of the structure is still conserved, especially on the binding site, which is mostly on red."
]
},
{
"cell_type": "markdown",
"id": "50083b3b",
"metadata": {},
"source": [
"## Scoring ligand affinity in protein-ligand complexes\n",
"This tutorial will now explore the protein-ligand scoring features of the spectrum package. As an example, we are going to score the score different properties of a SARS-CoV-2-Ensitrelvir complex, obtained by querying a SARS-Cov-2 sequence and folding (according to the previous steps), and docking and Ensitrelvir ligand. We will determine the strength of interaction based on 3 scores:\n",
"\n",
"1) A ligand binding affinity estimated with ChemGauss4 (empirical model)\n",
"2) The RMSD with respect to the SARS-CoV-2 reference\n",
"3) A ligand binding affinity estimated with AutoDock Vina (empirical model). [**This calculation requires installation of other packages, and it's only possible on a Ubuntu system**]\n",
"4) The protein-ligand interaction fingerprints (**COMING SOON**)"
]
},
{
"cell_type": "markdown",
"id": "10792284",
"metadata": {},
"source": [
"1) ChemGauss score"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "641b4bca",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"No BioAssembly transforms found, using input molecule as biounit: _chain1._chain2_UNK\n",
"Processing BU # 1 with title: _chain1._chain2_UNK, chains 123\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The ligand in SARS-CoV-2_ensitrelvir_folded was docked with a confidence of 0.99. The ChemGauss score is -9.96858024597168 kcal/mol.\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | score_type | \n",
" ligand_id | \n",
" docking-structure-POSIT | \n",
" docking-confidence-POSIT | \n",
" docking-score-POSIT | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" test | \n",
" SARS-CoV-2-Mpro | \n",
" 0.99 | \n",
" -9.96858 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"score_type ligand_id docking-structure-POSIT docking-confidence-POSIT \\\n",
"0 test SARS-CoV-2-Mpro 0.99 \n",
"\n",
"score_type docking-score-POSIT \n",
"0 -9.96858 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from drugforge.spectrum.score import dock_and_score\n",
"from drugforge.docking.scorer import ChemGauss4Scorer\n",
"\n",
"logging.basicConfig(\n",
" level=logging.ERROR, \n",
" format='%(levelname)s: %(message)s', \n",
" force=True \n",
")\n",
"warnings.filterwarnings(\"ignore\", category=UserWarning)\n",
"\n",
"input_pdb = fetch_test_file(\"Mpro-P2660_0A_bound.pdb\") # Reference PDB for alignment\n",
"scorers = [ChemGauss4Scorer()]\n",
"comp_name = \"test\"\n",
"target = \"SARS-CoV-2-Mpro\"\n",
"mers_complex = fetch_test_file(\"SARS-CoV-2_ensitrelvir_folded.pdb\")\n",
"scores_df, prepped_cmp, ligand_pose, aligned = dock_and_score(\n",
" mers_complex,\n",
" comp_name=comp_name,\n",
" target_name=target,\n",
" scorers=scorers,\n",
" label=Path(mers_complex).stem,\n",
" allow_clashes=True,\n",
" pdb_ref=input_pdb,\n",
" aligned_folder=output_dir,\n",
" align_chain=\"1\",\n",
" align_chain_ref=\"A\",\n",
" )\n",
"\n",
"print(f\"The ligand in {mers_complex.stem} was docked with a confidence of {scores_df['docking-confidence-POSIT'].values[0]}. The ChemGauss score is {scores_df['docking-score-POSIT'].values[0]} kcal/mol.\")\n",
"\n",
"scores_df = scores_df[['ligand_id', 'docking-structure-POSIT', 'docking-confidence-POSIT', 'docking-score-POSIT']]\n",
"scores_df.head()"
]
},
{
"cell_type": "markdown",
"id": "1e49fc7a",
"metadata": {},
"source": [
"2. Ligand and binding site RMSD with respect to the reference PDB (as implemented in the OpenEye tools)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "083bf83d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING: Ref(53) and target(54) ligands have different number of atoms\n",
"INFO: The attribute(s) types have already been read from the topology file. The guesser will only guess empty values for this attribute, if any exists. To overwrite it by completely guessed values, you can pass the attribute to the force_guess parameter instead of the to_guess one\n",
"INFO: There is no empty types values. Guesser did not guess any new values for types attribute\n",
"INFO: attribute masses has been guessed successfully.\n",
"INFO: The attribute(s) types have already been read from the topology file. The guesser will only guess empty values for this attribute, if any exists. To overwrite it by completely guessed values, you can pass the attribute to the force_guess parameter instead of the to_guess one\n",
"INFO: There is no empty types values. Guesser did not guess any new values for types attribute\n",
"INFO: attribute masses has been guessed successfully.\n",
"INFO: The attribute(s) types have already been read from the topology file. The guesser will only guess empty values for this attribute, if any exists. To overwrite it by completely guessed values, you can pass the attribute to the force_guess parameter instead of the to_guess one\n",
"INFO: There is no empty types values. Guesser did not guess any new values for types attribute\n",
"INFO: attribute masses has been guessed successfully.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Calculated RMSD of POSIT ligand pose = 6.794498029514078 A.\n",
"\n",
"Calculated RMSD of binding site = 3.3841235637664795 A, with ref Mpro-P2660_0A_bound\n"
]
}
],
"source": [
"from drugforge.spectrum.score import get_ligand_rmsd\n",
"from drugforge.spectrum.calculate_rmsd import get_binding_site_rmsd\n",
"\n",
"from Bio.PDB.PDBExceptions import PDBConstructionWarning\n",
"warnings.filterwarnings(\"ignore\", category=PDBConstructionWarning)\n",
"\n",
"lig_rmsd = get_ligand_rmsd(\n",
" str(aligned), str(input_pdb), True, rmsd_mode=\"oechem\",\n",
" )\n",
"df_print = scores_df.copy()\n",
"df_print.loc[0,\"Lig-RMSD\"] = lig_rmsd\n",
"pout = f\"Calculated RMSD of POSIT ligand pose = {lig_rmsd} A.\\n\"\n",
"\n",
"bsite_rmsd = get_binding_site_rmsd(\n",
" aligned,\n",
" input_pdb,\n",
" bsite_dist=5,\n",
" ligres=\"LIG\",\n",
" chain_mob=\"1\",\n",
" chain_ref=\"A\",\n",
" rmsd_mode=\"heavy\",\n",
" aligned_temp=aligned,\n",
" )\n",
"df_print.loc[0,\"Pocket-RMSD\"] = bsite_rmsd\n",
"pout += f\"\\nCalculated RMSD of binding site = {bsite_rmsd} A, with ref {input_pdb.stem}\"\n",
"print(pout)"
]
},
{
"cell_type": "markdown",
"id": "3bea6804",
"metadata": {},
"source": [
"3. AutoDock Vina affinity\n",
"\n",
"This part of the tutorial requires the installation of the ADFR suite package, which is only available on Ubutu and older versions of MacOS. This is not required by AutoDock Vina per-se, but to convert to the input types it requires for the protein and ligand (`.pdbqt`). If you have a better way of obtaining those files, you can provide your own path to the prepared files instead."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "14066d82",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"INFO: Prepped target provided\n",
"INFO: Prepped ligand provided\n",
"INFO: Score before minimization: -9.197 (kcal/mol)\n",
"INFO: Score after minimization: -9.892 (kcal/mol)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computing Vina grid ... done.\n",
"Performing local search ... done.\n"
]
}
],
"source": [
"from drugforge.spectrum.score import score_autodock_vina\n",
"warnings.filterwarnings(\"ignore\", category=UserWarning)\n",
"\n",
"# Save prepped files for Vina \n",
"pdb_prep = output_dir / (aligned.stem + \"_target.pdb\")\n",
"sdf_ligand = output_dir / (aligned.stem + \"_ligand.sdf\")\n",
"prepped_cmp.target.to_pdb_file(pdb_prep)\n",
"ligand_pose.to_sdf(sdf_ligand)\n",
"\n",
"# If available, the protein and ligands can be given on pdbqt format, which should work with the current installation\n",
"pdb_prep = fetch_test_file(\"SARS_model_target_prepped.pdbqt\")\n",
"sdf_ligand = fetch_test_file(\"SARS_model_ligand_prepped.pdbqt\")\n",
"vina_box = [-22, 6, 25]\n",
"\n",
"df_vina, out_pose = score_autodock_vina(\n",
" pdb_prep,\n",
" sdf_ligand,\n",
" vina_box,\n",
" box_size=[20, 20, 20],\n",
" dock=False, # There is an option to dock with Vina, but we will not use it here,\n",
" )\n",
"\n",
"df_print = pd.concat([df_print, df_vina], axis=1, join=\"inner\")"
]
},
{
"cell_type": "markdown",
"id": "1d764522",
"metadata": {},
"source": [
"We stored all the scores in the `df_print` DataFrame. There is also an implementation for Gnina scoring (CLI-based) which facilitates the processing of the results, but it won't be done on this tutorial, as it requires a working Gnina installation."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "d6d81d90",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" ligand_id | \n",
" docking-structure-POSIT | \n",
" docking-confidence-POSIT | \n",
" docking-score-POSIT | \n",
" Lig-RMSD | \n",
" Pocket-RMSD | \n",
" Vina-score-premin | \n",
" Vina-score-min | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" test | \n",
" SARS-CoV-2-Mpro | \n",
" 0.99 | \n",
" -9.96858 | \n",
" 6.794498 | \n",
" 3.384124 | \n",
" -9.197 | \n",
" -9.892 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ligand_id docking-structure-POSIT docking-confidence-POSIT \\\n",
"0 test SARS-CoV-2-Mpro 0.99 \n",
"\n",
" docking-score-POSIT Lig-RMSD Pocket-RMSD Vina-score-premin \\\n",
"0 -9.96858 6.794498 3.384124 -9.197 \n",
"\n",
" Vina-score-min \n",
"0 -9.892 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_print.head()"
]
},
{
"cell_type": "markdown",
"id": "5ab7349c",
"metadata": {},
"source": [
"## How to run the sequence/structure alignment in the command-line"
]
},
{
"cell_type": "markdown",
"id": "26e55ff1",
"metadata": {},
"source": [
"The sequence and structure alignment pipelines we went through in the previous steps can also be run from the command line. The `drugforge-spectrum` CLI includes both pipelines: `seq-alignment` and `struct-alignment`, respectively. We will go through the usage and a few examples for each option, using the test case adopted in the tutorial above (SARS-CoV-2 Mpro Chain A)."
]
},
{
"cell_type": "markdown",
"id": "d7dd695f",
"metadata": {},
"source": [
"### Sequence alignment usage\n",
"\n",
"Starting from an input protein seqence file, _e.g._, `.fasta`, `.pdb` or pre-computed `.xml` file, the `seq-alignment` command will generate all of the alignment files for the target protein and related protein sequences. \n",
"\n",
"The `drugforge-spectrum seq-alignment` CLI takes as input the sequence file and the path to the folder where results will be stored, along with a series of options to control the BLAST search and alignment outputs. Below is a list of these options, which can also be accessed via the `drugforge-spectrum seq-alignment --help` option:\n",
"\n",
" -f, --seq-file FILE File containing reference sequences\n",
" -t, --seq_type [fasta|pdb|pre-calc]\n",
" Type of input from which the sequence will be read. [default: fasta]\n",
" --output-dir DIRECTORY The directory to output results to.\n",
" --nalign INTEGER Number of alignments that BLAST search will output.\n",
" --e-thr FLOAT Threshold to select BLAST results.\n",
" --save-blast TEXT Optional file name for saving result of BLAST search\n",
" --sel-key TEXT Selection key to filter BLAST output.\n",
" Provide either a keyword, or 'host: '\n",
" --plot-width INTEGER Width for the multi-alignment plot.\n",
" --blast-json FILE Path to a json file containing parameters for the blast search.\n",
" --email TEXT Email for Entrez search.\n",
" --multimer Store the output sequences for a multimer ColabFold run (from identical chains).\n",
" If not set, \"--n-chains\" will not be used.\n",
" --n-chains INTEGER Number of repeated chains that will be saved in csv file.\n",
" Requires calling the \"--multimer\" option first.\n",
" --gen-ref-pdb Fetch PDB of crystal structure for reference target, from the RCSB database.\n",
" --plot-width INTEGER Width for the multi-alignment plot.\n",
" --color-seq-match Color aminoacid matches in html alignment:\n",
" Red for exact match and yellow for same-\n",
" group match.\n",
" --align-start-idx INTEGER Start index for reference aminoacids in html\n",
" alignment (Useful when matching idxs to\n",
" PyMOL labels)\n",
" --max-mismatches INTEGER Maximum number of aminoacid group\n",
" missmatches to be allowed in color-seq-match\n",
" mode.\n",
" --custom-order TEXT Custom order of aligned sequences (not\n",
" including ref) can be provided as a string\n",
" with comma-sep indexes.\n",
" --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]\n",
" The log level to use. [default: INFO]\n",
" --help Show this message and exit.\n"
]
},
{
"cell_type": "markdown",
"id": "46b77987",
"metadata": {},
"source": [
"The following optional arguments control the BLAST search:\n",
"\n",
"- The number of alignments is controled through the `--nalign` option and is set to 1000 by default. Other BLAST API parameters such as the number of hits and the number of descriptions are calculated internally to match the number given for `--nalign`. \n",
"- The expect value cutoff is controlled via `--e-thr`, which is 10 by default. \n",
"- The blast output is saved to a CSV file, with the path specified via the `--save-blast` option. \n",
"\n",
"The selection string has the same format as the previous tutorial, allowing both keyword and Entrez selections, via the `--sel-key` argument. Make sure to include the email in`--email` when requesting a an Entrez selection. The CLI will throw an error if this is not provided. "
]
},
{
"cell_type": "markdown",
"id": "30f7bbe6",
"metadata": {},
"source": [
"For the test case of a single protein chain of SARS-CoV-2 MPro, which we previously saved as `sars-cov2-mpro.fasta`, we run the following command:\n",
"\n",
"```bash\n",
"# Starting from a .fasta file\n",
"drugforge-spectrum seq-alignment -f sars-cov2-mpro.fasta -t fasta \\\n",
"--output-dir tutorial_results \\\n",
"--sel-key \"host: Homo sapiens OR organism: human\" \\\n",
"--email your.email@email.com \\\n",
"--color-seq-match\n",
"```\n",
"\n",
"```bash\n",
"# Starting from a .pdb file\n",
"drugforge-spectrum seq-alignment -f 8ya5.pdb -t pdb \\\n",
"--output-dir tutorial_results \\\n",
"--sel-key \"host: Homo sapiens OR organism: human\" \\\n",
"--email your.email@email.com \\\n",
"--color-seq-match\n",
"```\n",
"\n",
"```bash\n",
"# Starting from a pre-computed BLAST search (.xml file)\n",
"drugforge-spectrum seq-alignment -f results.xml -t pre-calc \\\n",
"--output-dir tutorial_results \\\n",
"--sel-key \"host: Homo sapiens OR organism: human\" \\\n",
"--email your.email@email.com \\\n",
"--color-seq-match\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "c6b2f7cc",
"metadata": {},
"source": [
"All of the output files generated are named according to the title of the sequence given in the input fasta file, with an additional prefix that can be added through the `--aln-output` option if desired. The following files are saved in `--results-folder`:\n",
"\n",
"- A CSV file with the results from the BLAST search on the reference sequence (default path is `blast.csv`)\n",
"- A `.fasta` file with the ID, description and sequence of the sequences selected via the provided keyword.\n",
"- a CSV file with the ID and sequence of the sequences selected via the provided keyword, required for the structure alignment pipeline.\n",
"- An `.html` file with a colored representation of the multi-sequence alignment. The width of the plot can be adjusted via `--plot-width`.\n",
"- A `.fasta` file with the aligned sequences (with “-” representing gaps inserted)\n",
"- If the `--gen-ref-pdb` flag is used, a PDB file corresponding to the crystal structure of the reference sequence. The RCSB entry with the highest resolution is retrieved.\n"
]
},
{
"cell_type": "markdown",
"id": "240d1c71",
"metadata": {},
"source": [
"The default setting to generate a ColabFold run CSV input is for monomers, but this can be adjusted for folding multimers. For example, starting from the `fasta` input file:\n",
"\n",
"```bash\n",
"# Starting from a .fasta file, generate ColabFold input for a dimer fold\n",
"drugforge-spectrum seq-alignment -f sars-cov2-mpro.fasta -t fasta \\\n",
"--output-dir tutorial_results \\\n",
"--sel-key \"host: Homo sapiens OR organism: human\" \\\n",
"--email your.email@email.com \\\n",
"--multimer --nchains 2\n",
"```\n",
"The above will generate the necessary input for folding the related protein sequences as dimers."
]
},
{
"cell_type": "markdown",
"id": "3ef74775",
"metadata": {},
"source": [
"Finally, the user can fetch a PDB crystal structures of the reference by optionally providing the flag `--gen-ref-pdb`. Note that this functionality requires previous installation of the ProDy module in the `asapdiscovery` environment. "
]
},
{
"cell_type": "markdown",
"id": "4e5dfd83",
"metadata": {},
"source": [
"### Structure alignment usage\n",
"\n",
"Based on a ColabFold run, as described in the tutorial, the `struct-alignment` command aligns the PDBs of the folded sequences listed in a `.csv` file and generates a PyMOL session for visualization. \n",
"\n",
"The `drugforge-spectrum struct-alignment` CLI takes as input the csv file, generated from the `seq-alignment` run, the PDB file of the protein that we used as a reference for the two prior steps, the path to the folder where the results from the ColabFold run are stored (the `$out_dir` environment variable assigned to the run on the folding step), and the path to the directory where alignment results will be stored. Below is a list of these options, which can also be accessed via the `drugforge-spectrum struct-alignment --help` option:\n",
"\n",
" -f, --seq-file FILE File containing reference sequences\n",
" --pdb-file FILE Path to a pdb file containing a structure\n",
" --output-dir DIRECTORY The directory to output results to.\n",
" --cfold-results DIRECTORY Path to folder where all ColabFold results are stored.\n",
" --pdb-align TEXT Path to PDB to align. Not needed when\n",
" --cfold-results is given.\n",
" --struct-dir DIRECTORY Path to folder where structures to align are\n",
" stored. Not needed when --cfold-results or\n",
" --pdb-align is given.\n",
" --pymol-save TEXT Path to save pymol session with aligned\n",
" proteins.\n",
" --chain TEXT Chains to display on visualization ('A', 'B'\n",
" or 'both'). The default 'both' will align\n",
" wrt chain A but display both chains.\n",
" --color-by-rmsd Option to generate a PyMOL session were\n",
" targets are colored by RMSD with respect to\n",
" ref.\n",
" --cf-format TEXT Model used with ColabFold. Either\n",
" 'alphafold2_ptm' or 'alphafold2_multimer_v3'\n",
" --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL]\n",
" The log level to use. [default: INFO]\n",
" --help Show this message and exit."
]
},
{
"cell_type": "markdown",
"id": "ba739315",
"metadata": {},
"source": [
"Optionally, a custom name for the PyMOL session can be provided, with the default being `aligned_proteins.pse`. The format of the PDB file of the folded protein that ColabFold generates can also be controlled by the argument `--cf-format`. The default format is _`\"alphafold2_ptm\"`_, which corresponds to a run using the _alphafold2_ptm_ model and no relaxation step. If a multimer model was used for ColabFold, _`\"alphafold2_multimer_v3\"`_ would be the correct format, although the string can be verified by looking at the contents of the `--cfold-results` directory. "
]
},
{
"cell_type": "markdown",
"id": "5fba1ebf",
"metadata": {},
"source": [
"The following command will create a PyMOL session based on the ColabFold results stored in the directory `sars_cf_results/`:\n",
"```bash\n",
"# From the csv file and ColabFold files generated in the previous tutorial steps\n",
"drugforge-spectrum struct-alignment -f sars_tutorial_human_host.csv \\\n",
"--pdb-file 8dz0.pdb \\\n",
"--output-dir aligned_sars_tutorial \\\n",
"--cfold-results sars_cf_results \\\n",
"--cf-format alphafold2_multimer_v3 \\\n",
"--chain \"A\" \\\n",
"--color-by-rmsd\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "f6b1c9b6",
"metadata": {},
"source": [
"It is also possible to align protein complexes already stored on a directory (as PDB files), or from a single PDB file:\n",
"```bash\n",
"# From a single PDB file\n",
"drugforge-spectrum struct-alignment -f sars_tutorial_human_host.csv \\\n",
"--pdb-file 8dz0.pdb \\\n",
"--output-dir \"aligned_sars_redo\" \\\n",
"--pdb-align \"aligned_sars_tutorial/YP_009725295_1_aligned.pdb\" \\\n",
"--chain A \\\n",
"--color-by-rmsd \\\n",
"--pymol-save aligned_single_pdb.pse\n",
"```\n",
"\n",
"```bash\n",
"# From the csv file and ColabFold files generated in the previous tutorial steps (dimer run)\n",
"drugforge-spectrum struct-alignment -f sars_tutorial_human_host.csv \\\n",
"--pdb-file 8dz0.pdb \\\n",
"--output-dir \"aligned_sars_redo\" \\\n",
"--struct-dir \"aligned_sars_tutorial\" \\\n",
"--chain A \\\n",
"--color-by-rmsd \\\n",
"--pymol-save aligned_structure_file.pse\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "a86168f6",
"metadata": {},
"source": [
"### Scoring usage\n",
"\n",
"The CLI can be used to score a protein-ligand complex, or a directory with multiple structures. Different options can be given, that will determine what scoring methods will be used. By default, only the ChemGauss score and ligand pose RMSD are calculated, while the Binding pocket RMSD, AutoDock Vina score (works only if ADFR software suite and Meeko have been installed, see [Requirements](https://autodock-vina.readthedocs.io/en/latest/docking_requirements.html)), and Gnina score (if installed), can be requested as well. Running this call generate a `csv` file containing all the requested scores.\n",
"`--docking-csv` provides the option to add the docking score from a previous `drugforge-docking` run, by providing the `docking_scores_filtered_sorted.csv` file generated from that run, but can take an empty string (`\" \"`, noting the space) as input if that feature is not desired.\n",
"\n",
"The `csv` file with the scores will be have a column with the protein and/or ligand ID identifiers, which are determined based on the file names of the structures in `--docking-dir`. The defaults are set to work with the identifiers returned by BLAST in the sequence alignment tutorial (e.g., \"YP_009725295\", which is SARS-CoV-2), but they can be specified with `--protein-regex`, which takes a regex pattern string. For example, `--protein-regex '^([A-Za-z0-9\\-]+)'` will extract the target name from `MERS-CoV_ensitrelvir_folded.pdb`, and `--ligand-regex '^[^_]+_([^_]+)'` extracts the ligand (ensitrelvir).\n",
"\n",
"Note that the `--ml-score` flag requests scoring through ML models implemented within `drugforge-ml`, but this functionality is **currenlty unavailable**. \n",
"\n",
" Options:\n",
" -d, --docking-dir PATH Path to directory where docked structures\n",
" are stored.\n",
" -f, --pdb_ref PATH Path to directory/file where crystal\n",
" structures are stored.\n",
" -o, --out-dir TEXT Path to directory where scoring results will\n",
" be stored.\n",
" --docking-csv PATH Path to csv files with docking results.\n",
" --target [EV-D68-Capsid|SARS-CoV-2-N-protein|EV-A71-3Cpro|EV-A71-2Apro|EV-A71-Capsid|DENV-NS2B-NS3pro|SARS-CoV-2-Mpro|EV-D68-3Cpro|ZIKV-RdRppro|SARS-CoV-2-Mac1|MERS-CoV-Mpro|ZIKV-NS2B-NS3pro]\n",
" The target for the workflow [required]\n",
" --vina-score Whether to run vina scoring.\n",
" --vina-box-x FLOAT coordinate x of vina box.\n",
" --vina-box-y FLOAT coordinate y of vina box.\n",
" --vina-box-z FLOAT coordinate z of vina box.\n",
" --path-to-grid-prep PATH Path to .py file that calculates grid for\n",
" Vina.\n",
" --docking-vina Whether to run docking on vina.\n",
" --ligand-regex TEXT Pattern for extracting ligand ID from file\n",
" name.\n",
" --protein-regex TEXT Pattern for extracting protein ID from file\n",
" name.\n",
" --minimize Whether to minimize the pdb structures\n",
" before running scoring.\n",
" --md-openmm-platform TEXT The OpenMM platform to use for MD\n",
" minimization.\n",
" [CPU|CUDA|OpenCL|Reference|Fastest]\n",
" --ml-score Whether to employ implemented ML models to\n",
" score poses.\n",
" --bsite-rmsd Whether to calculate binding site RMSD (only\n",
" relevant when the ref pdb is the same target\n",
" as the docked complex).\n",
" --dock-chain TEXT Chain ID of main chain in docked complex\n",
" pdb(s).\n",
" --ref-chain TEXT Chain ID of main chain in reference pdb(s).\n",
" --lig-resname TEXT Residue name of ligand in reference pdb(s).\n",
" --gnina-score Whether to run gnina scoring.\n",
" --gnina-script TEXT Path to bash script that runs Gnina CLI.\n",
" --gnina-out-dir PATH Directory for gnina output.\n",
" --log-level TEXT Logging level.\n",
" --input-json FILE Path to a json file containing the inputs to\n",
" the workflow, WARNING: overrides all other\n",
" inputs.\n",
" --help Show this message and exit."
]
},
{
"cell_type": "markdown",
"id": "e094754f",
"metadata": {},
"source": [
"The following command will run the scoring workflow without AutoDock Vina, assuming `MERS-CoV_Lig-ensitrelvir_folded.pdb` is saved on a folder `docked_folder`, with other complexes like `SARS-CoV-2_Lig-ensitrelvir_folded.pdb`, or `SARS-CoV-2_Lig-someligand_folded.pdb`:\n",
"\n",
"```bash\n",
"drugforge-spectrum score -d docked_folder/ \\\n",
"-f 8dz0.pdb \\\n",
"-o score_results/ \\\n",
"--docking-csv \" \" \\\n",
"--target SARS-CoV-2-Mpro \\\n",
"--dock-chain \"1\" \\\n",
"--ref-chain \"A\" \\\n",
"--bsite-rmsd \\\n",
"--protein-regex '^([A-Za-z0-9\\\\-]+)' \\\n",
"--ligand-regex 'Lig-[^_]+'\n",
"```\n",
"Note that `--dock-chain` and `--ref-chain` need to correspond to the respective pdb files. The run will throw an error if it can't find a chain with that identifier in the PDB file. To run Vina scoring, if all requirements are installed, with the center of the ligand box specified via the flag `--vina-box-[x,y,z]`:\n",
"\n",
"```bash\n",
"drugforge-spectrum score -d docked_folder/ \\\n",
"-f 8dz0.pdb \\\n",
"-o score_results/ \\\n",
"--docking-csv \" \" \\\n",
"--target SARS-CoV-2-Mpro \\\n",
"--dock-chain \"1\" \\\n",
"--ref-chain \"A\" \\\n",
"--bsite-rmsd \\\n",
"--protein-regex '^([A-Za-z0-9\\\\-]+)' \\\n",
"--ligand-regex 'Lig-[^_]+' \\\n",
"--vina-score \\\n",
"--vina-box-x -22.0 \\\n",
"--vina-box-y 6.0 \\\n",
"--vina-box-z 25.0\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "spectrum",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}