2.2.2 | Preparation of targets and model templates.
A FASTA file of the receptor sequence is prepared by the CASP
organizers. For known small molecules, SMILES are retrieved from the PDB
component dictionary. In the case of novel small molecules (not present
in the PDB component dictionary), SMILES are provided by the
experimentalists. In both cases, SMILES are compared and modified based
on those derived from the PDB coordinates. If necessary, stereochemistry
is assigned using the AssignStereochemistryFrom3D function from
RDKit, and the protonation state adjusted by manually editing the SMILES
based on the visual inspection of protein-ligand interactions.
The relevance of each small molecule is decided case by case for each
target. Only biologically relevant small molecules are retained. Common
crystallographic reagents and ions are ignored if not interacting with
the small molecules or part of a structural motif (e.g., zinc binding
motif).
A script to prepare prediction templates (MDL files) is provided by the
CASP organizers. It is implemented in python 3 and RDKit python bindings
(http://www.rdkit.org/) . The
script initially converts the input SMILES strings to RDKit Mol objects
using the rdkit.Chem.MolFromSmiles method. At this stage, the Mol
objects contain only the information related to small molecule
properties, like atom types and bond formation. Coordinate section is
added to the Mol objects using the RDKit’s ETKDG method38. Subsequently, the
Mol objects are written to the MDL-formatted file33, which can be used
as a ligand submission template.
2.2.3 | Setting up the acceptance system. Validation
of ligand predictions is performed with scripts written in python 2.7
and RDKit. Initial checks verify the CASP header section (availability
and correctness of PFRMAT, TARGET, AUTHOR, and MODEL/END records). Once
submissions have passed this phase, ligand models are converted to RDKit
Mol objects and compared with the template for downstream evaluation.
Each molecule in the submitted file is validated by comparison with a
reference Mol object generated from the corresponding SMILES string as
described above. To validate the submissions, comparisons of the
following parameters are undertaken:
- number of atoms and their types,
- number of bonds,
- bond types and atom types in bond pairs (e.g., C-C Single, or C=O
Double).
Additionally, to account for atom connectivity and chirality in
submitted models, the maximum common substructures between the submitted
and reference ligands are calculated using the FindMCS function
in RDKit. To pass the validation, a molecule must have a maximum common
substructure equal to the number of atoms in the reference model.
Finally, a validation report is created showing the results of the
validation process to aid in troubleshooting invalid submissions.
2.2.4 | Macromolecule-ligand complex evaluation
measures. Previous ligand docking challenges like Teach Discover Treat
(TDT) 39, Continuous
Evaluation of Ligand Prediction Performance (CELPP)40 and Drug Discovery
Data Resource (D3R)34-37 have used two
main types of metrics to assess how well participants can model
receptor-ligand complexes. These evaluated how close a predicted ligand
is to the target within the binding site in absolute terms with the RMSD
metric, and how well the native receptor-ligand interactions are
reproduced. CASP experiment brings additional assessment challenges: (1)
because the receptor structure is not given but rather modeled, ligands
in the model and reference complexes can be bound to different
configurations of binding sites, and thus calculation of any
superposition-based scores requires preliminary alignment of binding
pockets with ligands in two complexes, which is not a trivial task; (2)
chain mapping needs to be established; (3) incomplete ligands in some
targets require partial graph matching for the symmetry correction; and
(4) multiple copies of ligands in the targets and models have to be
mapped (assigned) uniquely, in order to avoid scoring target or
predicted ligands multiple times.
To address these challenges, we developed two scores, which are
described in more detail in the CASP15 Ligand Assessment paper7. The Binding-Site
Superposed, Symmetry-Corrected Pose Root Mean Square Deviation
(BiSyRMSD) score defines the binding sites and the superpositions to
compute RMSDs between target’s and model’s ligands. The Local Distance
Difference Test for Protein-Ligand Interactions (lDDT-PLI) measure
assesses how well native contacts between the receptor and the ligand
are reproduced in the model with an lDDT-based metric and symmetry
correction. When used in combination, these scores give a better account
of modeling receptor-ligand complexes.