Docking and Scoring

Intro

This notebook will show you how to dock and score molecules using the drugforge-docking module. You should have already gone through the interfacing with databases and systems tutorial.

Our docking pipeline primarily focuses on the use-case for a structure-enabled drug discovery program, in which we have crystal structures of early molecules to use for reference-based docking. This approach has been demonstrated to be effective for prioritising designed compounds.

To this end, we have implemented an api that wraps the OpenEye POSIT docking algorithm, which, through its use of the HYBRID and SHAPEFIT algorithms, enables reference-based docking.

The scope of this guide

This guide will show you how to dock and score molecules. For the extremely necessary precursor step of data loading and prepping, please see protein_prep

Setting up example data

We will use files we use for testing, since these molecules have already been prepped for docking.

[ ]:
from drugforge.data.testing.test_resources import fetch_test_file
from drugforge.data.schema.complex import Complex
from drugforge.data.schema.ligand import Ligand
from drugforge.modeling.schema import PreppedComplex

# grab a pre-prepared complex
prepped_complex = PreppedComplex.from_oedu_file(
        fetch_test_file("Mpro-P2660_0A_bound-prepped_receptor.oedu"),
        ligand_kwargs={"compound_name": "test"},
        target_kwargs={"target_name": "test", "target_hash": "mock_hash"},
    )

# make a ligand from an SDF file
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )

Docking with the POSITDocker

There are a ton of choices we can make for docking, which will not be enumerated here. But in order to get a flavor for the options, we can examine the class attributes of the POSITDocker:

[ ]:
from drugforge.docking.openeye import POSITDocker
[ ]:
POSITDocker?

A quick overview of some important options

  • relax: Whether to allow relaxation when generating docked structures

  • posit_method: which POSIT method to use, see here for a complete treatment of POSIT theory. The default (ALL) selects the best method possible iteratively.

  • use_omega: Use OpenEye’s OMEGA conformer generation to enumerate conformers before docking. Should vastly improve the quality of predicted poses.

  • omega_dense: Use dense OMEGA sampling.

  • allow_retries: Try several configurations of docking parameters to attempt to obtain a result.

[ ]:
# lets go ahead and make a Docker object
docker = POSITDocker()
[ ]:
# and check out its defaults.
docker.dict()

POSITDocker.dock() requires:

  1. a list of DockingInputBase objects

  2. an output directory

  3. and some dask options

Currently, we have 2 kinds of DockingInputBase objects implemented:

  1. a complex-ligand pair (DockingInputPair)

  2. a one-to-many ligand:complexes object (DockingInputMultiStructure)

Running simple docking

[ ]:
from drugforge.docking.docking import DockingInputPair
input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)
[ ]:
results = docker.dock(inputs=[input_pair])

This returns a list of POSITDockingResults objects!

[ ]:
# lets grab one of the results objects
result = results[0]
[ ]:
# and dump it to disk
result.write_docking_files("docking_test")

Scoring

We decouple the pose prediction and scoring parts of docking, which enables us to score docked poses with different scoring functions.

This flexibility allows us to implement our own scoring functions and capture information important to our discovery process.

To this end, we have written a few “scorer” classes, including:

  1. A traditional phyiscs based docking scorer: ChemGauss4Scorer

  2. A score which tries to capture information about the potential for the binding site to evolve: FINTScorer (see working_with_fitness_data tutorial)

  3. A 2D Graph Attention based scorer trained on Covid Moonshot data that predicts pIC50s directly: GATScorer

  4. 3D equivariant scorers that use 3D pose information and are trained on Covid Moonshot data to predict pIC50s: E3NNScorer and SchnetScorer

  5. And finally, a MetaScorer which can run all the other scorers easily.

Currently these live in the docking module, but we are planning to move some of the scorers to other subpackages so that docking doesn’t have to depend on dataviz, spectrum, ml, etc.

[ ]:
from drugforge.docking.scorer import ChemGauss4Scorer
from drugforge.docking.fint_scorer import FINTScorer
# The ML scorers are not imported for now as the ml module is not working.
# from drugforge.docking.ml_scorer import GATScorer, E3NNScorer, SchnetScorer
from drugforge.docking.meta_scorer import MetaScorer

Targets

Several of our scorers require target-specific information. We can find out the targets that the repo “knows about” like so:

[ ]:
from drugforge.data.services.postera.manifold_data_validation import TargetTags
[ ]:
TargetTags.get_values()

Since we’re working with a known target, we can set that as a variable and use that throughout

[ ]:
target = TargetTags("SARS-CoV-2-Mpro")

ChemGauss4 Scorer

[ ]:
chemgauss_scorer = ChemGauss4Scorer()
[ ]:
scores = chemgauss_scorer.score(results)
scores

We can see this returns an array of score objects. If we want a dataframe, we can ask it to run this instead:

[ ]:
scores_df = chemgauss_scorer.score(results, return_df=True)
scores_df

FINTScore

In antiviral drug discovery, potential for mutation of target binding sites is very high. Thus it is important to avoid sidechain-ligand interactions with highly mutable residues when evaluating potential designs.

FINTScore attempts to compress information about the mutability of the ligand binding site into a single number between 0 and 1. It rewards interactions with non-mutable sidechains and backbone atoms, while penalising interactions with mutable sidechains. See the implementation for more details. You can also view this information in 3D. See the visualising ASAP targets tutorial for more information.

For the FINT score, we need fitness data (normally obtained by deep mutational scanning experiments or from phylogenetic data), which means we can only work on a target for which we have vendored fitness data. To check which targets those are, we can use:

[ ]:
from drugforge.spectrum.fitness import target_has_fitness_data

# does our target have fitness data?
target_has_fitness_data(target) # yes!
[ ]:
fint_scorer = FINTScorer(target=target)
[ ]:
scores = fint_scorer.score(results, return_df=True)
scores

ML Scorers - WARNING THIS SECTION CURRENTLY DOES NOT WORK. We hope to fix this in the next release.

Our ML scorers are trained to predict pIC50s from 2D graph or 3D equivariant representations of ligands and target-ligand complexes respectively. This is enabled by the MTENN framework that abstracts and modularize the task of structure-based machine learning.

Currently, we have ML scorers for some targets separately, but are exploring PDBBind based foundational models for multi-target prediction.

[ ]:
# import our ML model registry
from drugforge.ml.models import ASAPMLModelRegistry
[ ]:
ASAPMLModelRegistry.get_implemented_model_types()
[ ]:
from drugforge.docking.scorer import MLModelScorer
ml_scorers = [MLModelScorer.from_latest_by_target_and_type(target, model_type)
           for model_type in ASAPMLModelRegistry.get_implemented_model_types()]
[ ]:
gat_scores = ml_scorers[0].score(results, return_df=True)
[ ]:
gat_scores

MetaScorer

We can use the MetaScorer to run all the scoring for us and combine everything into a dataframe we can save easily

[ ]:
scorers = [chemgauss_scorer, fint_scorer, *ml_scorers]
[ ]:
metascorer = MetaScorer(scorers=scorers)
[ ]:
scores_df = metascorer.score(results, return_df=True)
[ ]:
scores_df

Under the hood, this uses this function: drugforge.docking.scorer.Score._combine_and_pivot_scores_df to return the scores in a dataframe. As of version 0.4, this uses drugforge.docking.scorer._SCORE_MANIFOLD_ALIAS to change the column names to conform to the standard names used within the ASAPDiscovery Consortium. You can examine which column names correspond to which score here:

[ ]:
from drugforge.docking.scorer import _SCORE_MANIFOLD_ALIAS
_SCORE_MANIFOLD_ALIAS

Advanced Topics: Selectors

The ASAPDiscovery Consortium regularly operates in a regime where we have many experimental structures to choose from as references for docking. To accelerate the process of choosing which structures to use, we have generated a series of Selector objects which take a set of ligands and complexes and choose which set of ligand-complex pairs to use for docking.

[ ]:
from drugforge.docking.selectors.selector_list import StructureSelector
[ ]:
StructureSelector.get_values()
[ ]:
# mock a fragalysis dump from diamond light source

from drugforge.data.services.fragalysis.fragalysis_reader import FragalysisFactory
all_mpro_fns = [
        "metadata.csv",
        "aligned/Mpro-x11041_0A/Mpro-x11041_0A_bound.pdb",
        "aligned/Mpro-x1425_0A/Mpro-x1425_0A_bound.pdb",
        "aligned/Mpro-x11894_0A/Mpro-x11894_0A_bound.pdb",
        "aligned/Mpro-x1002_0A/Mpro-x1002_0A_bound.pdb",
        "aligned/Mpro-x10155_0A/Mpro-x10155_0A_bound.pdb",
        "aligned/Mpro-x0354_0A/Mpro-x0354_0A_bound.pdb",
        "aligned/Mpro-x11271_0A/Mpro-x11271_0A_bound.pdb",
        "aligned/Mpro-x1101_1A/Mpro-x1101_1A_bound.pdb",
        "aligned/Mpro-x1187_0A/Mpro-x1187_0A_bound.pdb",
        "aligned/Mpro-x10338_0A/Mpro-x10338_0A_bound.pdb",
    ]
all_paths = [fetch_test_file(f"frag_factory_test/{fn}") for fn in all_mpro_fns]
parent_dir = all_paths[0].parent


[ ]:
ff = FragalysisFactory.from_dir(parent_dir)
complexes = ff.load()
[ ]:
ligands = [complex.ligand for complex in complexes]

To illustrate what the selectors do

Selectors take a list of Complexes and Ligands and return Pairs based on some criteria.

For example the SelfDockingSelector only selects ligands that match those already present in the complexes.

SelfDockingSelector

[ ]:
selector = StructureSelector('SelfDockingSelector').selector_cls()
[ ]:
pairs = selector.select(ligands, complexes)
[ ]:
len(pairs)
[ ]:
# are all the ligands the same as those in the complexes?
all(pair.complex.ligand == pair.ligand for pair in pairs)

PairwiseSelector

The PairwiseSelector produces the full outer product of complexes and ligands e.g in the example below we have 10 complexes and 10 ligands. The resulting outer product of pairs contains 100 elements.

[ ]:
selector = StructureSelector('PairwiseSelector').selector_cls()
[ ]:
pairs = selector.select(ligands, complexes)
[ ]:
len(pairs)

MCSSelector

The MCSSelector selects complexes that closely match the structures in the query ligand by Maximum Common Substructure (MCS). This is very useful when docking new designs and you need to determine a chemically similar reference.

[ ]:
selector = StructureSelector('MCSSelector').selector_cls()
[ ]:
# the n_select parameter controls how many matches per ligand
pairs = selector.select(ligands, complexes, n_select=5)
[ ]:
len(pairs)

LeaveSimilarOutSelector

The LeaveSimilarOutSelector filters out pairs where the ligand is a stereoisomer / tautomer / protonation state isomer / etc of the complex ligand. It can take a while though because it has to do an len(ligands) * len(complexes) pairwise comparison of all those chemical possibilities.

[ ]:
selector = StructureSelector('LeaveSimilarOutSelector').selector_cls()
[ ]:
pairs = selector.select(ligands, complexes)
[ ]:
len(pairs)

Advanced Topics: Multi-Structure Docking

Some docking protocols (i.e., POSIT) will accept multiple receptor structures and choose for themselves which to dock to. For these docking protocols, we pass a different kind of input:

[ ]:
from drugforge.docking.docking import DockingInputMultiStructure
from drugforge.docking.selectors.selector_list import StructureSelector
[ ]:
cached_dus = {
        "Mpro-x1002": "du_cache/Mpro-x1002_0A_bound.oedu",
        "Mpro-x0354": "du_cache/Mpro-x0354_0A_bound.oedu",
    }
prepped_complexes = [
        PreppedComplex.from_oedu_file(
            fetch_test_file(cached_du),
            ligand_kwargs={"compound_name": "test"},
            target_kwargs={"target_name": name, "target_hash": "mock_hash"},
        )
        for name, cached_du in cached_dus.items()
    ]
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )

Let’s assume we had gotten this subset of ligand-protein pairs from the selector logic from above. This would look something like this:

[ ]:
selector = StructureSelector('LeaveSimilarOutSelector').selector_cls()
pairs = selector.select([ligand], prepped_complexes)
len(pairs)

We then collapse these pairs into a single MultiStructure set:

[ ]:
inputs = DockingInputMultiStructure.from_pairs(pairs) # Returns a list since multiple sets could be generated

If we already knew exactly what we wanted to do, we could just create the set directly:

[ ]:
alternate_inputs = DockingInputMultiStructure(ligand=ligand, complexes=prepped_complexes)

We can see that two are equivalent in this case:

[ ]:
inputs[0] == alternate_inputs

Now we run docking as before:

[ ]:
docker = POSITDocker() # let's just use defaults for now
[ ]:
results = docker.dock(inputs) # we won't use dask or write an output, takes ~3 minutes on a Macbook Pro
[ ]:
result = results[0]
[ ]:
result.write_docking_files("multi_structure_docking_test")

Since we input multiple structures, we don’t know which one it actually used. We can find this out by examining the results:

[ ]:
result.input_pair.complex.target.target_name

Advanced Topics: Multi-pose Docking

Note: this is functionality that was most recently added. Please make an issue if you encounter problems :)

We’ll use the same docking scheme as above

[ ]:
from drugforge.docking.docking import DockingInputPair
prepped_complex = PreppedComplex.from_oedu_file(
        fetch_test_file("Mpro-P2660_0A_bound-prepped_receptor.oedu"),
        ligand_kwargs={"compound_name": "test"},
        target_kwargs={"target_name": "test", "target_hash": "mock_hash"},
    )
ligand = Ligand.from_sdf(
        fetch_test_file("Mpro-P0008_0A_ERI-UCB-ce40166b-17.sdf"), compound_name="test"
    )
input_pair = DockingInputPair(ligand=ligand, complex=prepped_complex)
[ ]:
docker = POSITDocker(num_poses=50) # we set the number of poses when we create the docker
[ ]:
results = docker.dock([input_pair]) # we won't use dask or write an output, takes ~1 min on a Macbook Pro
[ ]:
len(results)
[ ]:
print([result.probability for result in results])

A few side notes: Dask and Target Specific Workflows

Dask

We make heavy use of Dask throughout our code, which helps automate parallel processing and provides a nice dashboard for evaluating the progress of large scale docking efforts. Due to the way in which Dask automates error handling, this has occasionally led to situations where the behaviour of our code is different depending on whether you have enabled Dask. We have tried to stamp out any instances of this, but if you find another, please make an issue!

Target-specific workflows

We have implemented our library code within the asapdiscovery-workflows module, which puts everything together in a command-line interface (cli). Unfortunately, as of version 0.4, these workflows only work if you are using the targets specified for ASAP. We plan on changing this for version 0.5

To find out which targets can be passed to these workflows, you can use this:

[ ]:
from drugforge.data.services.postera.manifold_data_validation import TargetTags
[ ]:
TargetTags.get_values()