Alexandre Bonvin bio photo

Computational Structural Biology group focusing on dissecting, understanding and predicting biomolecular interactions at the molecular level.

Email Github Youtube Subscribe


Supported by:



HADDOCK2.2 manual

Chemical Shift-Based Protein DockingTutorial


In order to validate the HADDOCK program, we initially performed the docking on three known complexes (see the HADDOCK paper. We are providing in the haddock/examples directory the data for all three complexes. This tutorial deals however only with the E2A-HPr complex. The structures in the free form1,2 and NMR titration data3,4 are available. The complex structure had been solved experimentally by NMR5 (PDB entry 1GGR).
    Note that the example uses the NMR solution structure of HPr (entry 1HDN) to allow the docking from an ensemble of conformations. Because of that, the definition of active and passive residues might differ from the one in the HADDOCK paper since the filtering is based on the average accessibilities (see the AIR restraints section).
You should follow the following steps to run this tutorial:
  1. Setup
  2. Defining the AIR restraints
  3. Setting up the project for HADDOCK
  4. Defining the parameters for the docking and running HADDOCK
  5. Analyzing the docking results
  6. Rerunning the HADDOCK analysis on a cluster basis
  7. Comparing the clusters and the reference structure


1. Setup

To start this tutorial, first copy the protein-protein directory from the haddock2.2/examples directory:
  cp -r $HADDOCK/examples/protein-protein .
    Note: The $HADDOCK environment variable should be defined if haddock was properly installed.
Go into the protein-protein directory.

You will find in there the PDB files of the E2A and the HPr proteins:
 - e2a_1F3G.pdb
 - e2aP_1F3G.pdb  (with phosphorylated histidine)

For HPr, we will use the first 10 conformations from the PDB entry 1HDN. These can be found in the hpr> directory:
 - hpr_1.pdb
 - ...
 - hpr_10.pdb

The file containing the listing of those 10 structures for docking from an ensemble of structures is provided (Note: edit it to modify the directory path!):
 - hpr-files.list

The PDB file of the reference complex can be found in the ana_scripts directory and is called
 - e2a-hpr_1GGR.pdb. 

The (average) per residue solvent accessibilities calculated with NACCESS (see the AIR restraints section) are also provided for convenience
 - e2a_1F3G.rsa
 - hpr/hpr_rsa_ave.lis

You also find a number of additional files containing the AIR restraints that we used for calculations,
 - e2a-hpr_air.tbl

a number of Rasmol scripts for visualization (see the "Defining AIRs" section) of the active and passive residues for E2A and HPr,
 - e2a_rasmol_active.script
 - e2a_rasmol_active-passive.script
 - hpr_rasmol_active.script
 - hpr_rasmol_active-passive.script

two example files for setting up a new project
 - new.html-refe

and to define all the parameters for docking
 - run.cns-refe

(see the Defining the parameters for the docking and running HADDOCK section for usage and parameter definitions)

Various analysis scripts can be found in the ana_scripts directory. They will allow you to calculate the interface RMSD (i-RMSD) (RMSD calculated on backbone atoms of all residues within 10A from the partner molecule) and ligand RMSD (l-RMSD) (RMSD calculated on the smallest component (hpr) after fitting on the largest component (e2a)). These are standard CAPRI definitions. A script is also provided that allows you to calculate the fraction of native contacts. For usage and details refer to the README file in the ana_scripts directory.

Finally, a script to copy data from an example HADDOCK run is provided as:
 - copy_hadddock_files.csh


2. Defining the AIR restraints

Some general explanations about how to define the interface can be found in the HADDOCK AIR restraints section.

For this particular example, based on the available NMR titration data3,4, and the relative solvent accessibilities, we defined 11 active residues for E2A:
  • active: D38, V40, I45, V46, K69, F71, S78, E80, D94, V96 and S141
and 10 active residues for HPr:
  • active: H15, T16, R17, A20, F48, K49, Q51, T52, G54 and T56

    Note: The active residues and the filtered ones can be viewed in Rasmol using the provided Rasmol scripts, e2a_rasmol_active.script and hpr_rasmol_active.script. For example for E2A, start rasmol with
        rasmol e2a_1F3G.pdb
    
    and then, at the rasmol prompt type:
        RasMol> source e2a_rasmol_active.script
    
    The active residues are colored in red and the filtered-out active residues (if any) in yellow. For Hpr, three active residues were filtered out because they are not solvent accessible. These are residues 50,53 and 55. Control in the hpr_rsa_ave.lis file giving the relative solvent accessible surface area per residue for hpr that these three residues are indeed not solvent accessible (The criteria used is +SD > 50%).
You will now have to define the passive residues. For this, for each structure in turn (E2A and HPr), load a PDB file in Rasmol, highlight all active residues (in Rasmol source the corresponding e2a/hpr_rasmol_active.script) and pick all neighbors of active residues. Filter then the selected residues with their solvent accessibility (see the filtering residues or filtering from an ensemble of structures sections).
    Note: Rasmol scripts containing all the active and passive residues that we defined for E2A and HPr are provided for comparison (e2a_rasmol_active-passive.script and hpr_rasmol_active-passive.script. For both E2A and HPr we defined 9 passives residues:

    • E2A passive: P37, V39, E43 G68, E72, E97, E109 and K132

    • HPr passive: P11, N12, Q21, K24, K40, S41, L47, Q57 and E85
Having defined active and passive residues for both molecules, you can now generate the AIR restraint file for use in HADDOCK. For this go to the HADDOCK project setup section, click on "generate AIR restraint file" and follow the instructions.


3. Setup a new project for HADDOCK

To start a new project in HADDOCK you have to define a number of files such as input PDB files for both molecules, various restraint files and the path and name of the project. For this, go again to the project setup page, click on "start a new project" and follow the instructions.

After saving the new.html file to disk, type haddock2.2 in the same directory. This will generate a run directory containing all necessary information to run haddock. See the "a new project" section on the HADDOCK home page for more information and for a description of the content and organization of the generated run directory.
milou.abonvin.examples/protein-protein> haddock2.2

##############################################################################
#                                                                            #
# starting HADDOCK                                                           #
# Author: Alexandre Bonvin, Utrecht University                               #
#                                                                            #
#         N-components version of HADDOCK (current maximum is 6)             #
#                                                                            #
##############################################################################
    
Starting HADDOCK on: 2014-09-18 11:59:09
 
HADDOCK version:   2.2
Python version: 2.7.2 (default, Apr  4 2012, 13:50:40) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)]
PYTHONPATH system variable contains:
['/home/abonvin/haddock_git/haddock2.2/Haddock', '/home/software/software/python2.7/lib/python2.7/site-packages/setuptools-0.9.8-py2.7.egg', '/home/software/software/python2.7/lib/python2.7/site-packages/argparse-1.2.1-py2.7.egg', '/home/software/software/modeller_v9.11/lib/x86_64-intel8', '/home/software/software/modeller_v9.11/modlib', '/home/software/software/numpy_stable/lib/python', '/home/software/software/scipy/lib/python', '/home/software/software/biopython/build/lib.linux-x86_64-2.7', '/home/software/software/nose/lib/python', '/home/enmr/services/HADDOCK/spyder', '/home/software/software/cython/build/lib.linux-x86_64-2.7', '/home/software/software/matplotlib/lib/python2.7/site-packages', '/home/software/software/moderna_source_1.7.1/build/lib', '/home/abonvin/haddock_git/haddock2.2', '/home/software/software/python2.7/lib/python27.zip', '/home/software/software/python2.7/lib/python2.7', '/home/software/software/python2.7/lib/python2.7/plat-linux2', '/home/software/software/python2.7/lib/python2.7/lib-tk', '/home/software/software/python2.7/lib/python2.7/lib-old', '/home/software/software/python2.7/lib/python2.7/lib-dynload', '/home/software/software/python2.7/lib/python2.7/site-packages']
reading parameters from the file ./new.html
  setting some variables:
  submit_save set to: Save updated parameters
  N_COMP set to: 2
  PROJECT_DIR set to: .
  HADDOCK_DIR set to: ../..
  DANI1_FILE set to: ./dani.tbl
  RUN_NUMBER set to: 1
  PROT_SEGID_2 set to: B
  PROT_SEGID_1 set to: A
  PDB_FILE1 set to: ./e2aP_1F3G.pdb
  AMBIG_TBL set to: ./e2a-hpr_air.tbl
  PDB_LIST2 set to: ./hpr-files.list
  PDB_FILE2 set to: ./hpr/hpr_1.pdb
project /home/abonvin/haddock_git/haddock2.2/examples/protein-protein exists, run 1 does not exist
create the new run 1
  copying ./dani.tbl 
    to /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/dani/dani1.tbl
   ./e2aP_1F3G.pdb  copied to  /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/sequence after removing the chain and segIDs
   ./hpr/hpr_1.pdb  copied to  /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/sequence after removing the chain and segIDs
editing run.cns: setting the default values
copying new.html to /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/new.html
created new run1 for the project /home/abonvin/haddock_git/haddock2.2/examples/protein-protein
  copying AIR restraint data to ./run1/data/distances/ambig.tbl
  copying AIR restraint data to ./run1/structures/it0/ambig.tbl
  copying AIR restraint data to ./run1/structures/it1/ambig.tbl
  copying PDB file list of molecule  2  to  ./run1/data/sequence/file_B.list
  copying PDB files of the list of molecule  2  to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_1.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_2.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_3.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_4.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_5.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_6.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_7.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_8.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_9.pdb  of list to  ./run1/data/ensemble-models/
  removing chain and segIDs and copying PDB file  ./hpr/hpr_10.pdb  of list to  ./run1/data/ensemble-models/
  Enforcing PDB file absolute path for ./run1/data/ensemble-models/
new project . has been set up.
Now you have to edit run.cns in ./run1
##############################################################################
Finishing HADDOCK on:  2014-09-18 11:59:09
    Note:An example of a new.html file can be found in the haddock/examples/protein-protein directory as new.html-refe.


4. Defining the parameters for the docking and running HADDOCK

The next step consists in editing the various parameters that will govern the docking process in the run.cns file in the newly generated run directory. For this, again go to the project setup page, enter the path of the run.cns file and click on "edit. Refer to the run.cns section for a description of the various parameters.
    Note: Most of the parameters in run.cns are set by default for the e2a-hpr example and you will only need to modify a few of those.

  • Check the various paths to your project directory and the filenames to be used ("Filenames" section)

  • Check the definition of the semi-flexible interface ("Definition of the interface" section). The default in HADDOCK2.2 is -1, meaning that the semi-flexible interface will be automatically defined by HADDOCK from a contact analysis.

  • Check the number of structures to generate at the various steps and decrease those if needed (strongly recommended if you are going to run all the structure calculations on a single processor). ("Number of structures to dock" section)

    For the present tutorial, keep the default parameters (1000 structures for the rigid body docking stage and 200 afterwards). You will not perform a complete HADDOCK run but simply copy the data from the provided example run directory and move on with the analysis of the results

  • Check the queue commands for running the jobs, the location of the cns executable and the number of jobs to run simultaneously. ("Parallel jobs" section)

    For the present tutorial, use csh as queue command and 1 as number of jobs. The CNS executable path should be properly set.

Once you have properly set all parameters, save the modified run.cns file to disk.
    Note:An example of a run.cns file can be found in the haddock/examples/protein-protein directory as run.cns-refe.
You would now be ready to run HADDOCK by simply typing haddock2.2 in the same directory and the docking will start.

A description of what happens in the various HADDOCK steps can be found in the the docking section of the online HADDOCK manual.

Examine now the content of the various directories. The generated structures of the complex can be found in:
  • structures/it0: structures after rigid-body docking
  • structures/it1: structures after semi-flexible simulated annealing
  • structures/it1/water: structures after water refinement
In each of these directories you will also find three files containing the list of sorted structures in various format for use by CNS and HADDOCK:
  • file.nam: list of PDB filenames only
  • file.list: list of PDB filenames for use in CNS with the haddock score that has been used for sorting
  • file.cns: list of pdb filenames in yet another cns format.


5. analyzing the docking results

in structures/it1 and structures/it1/water you will find an analysis directory containing the results of the analysis automatically performed by haddock. refer to the analysis section of the online haddock manual for a description of the various analysis performed and inspect the content of some of the described files (mainly those with extensions .disp and .lis.

An important part come now: the manual analysis of the results (see also the related manual analysis section of the online haddock manual). since typically the restraints used to drive the docking are highly ambiguous, several solutions will be generated. clustering the solutions and performing a manual analysis is therefore required. a number of scripts are provided in the tools directory for this purpose. you can either copy them to the directory into which you will run the manual analysis or call them using:
  $HADDOCKTOOLS/script_name
a first typical step is to gather various energetics violation and rmsd from lowest energy structure statistics. the ana_structures.csh script can be used for this purpose. it makes use of profit for rmsd calculations if present and defined.

go into the structures/it1/water directory and run the analysis:
  cd structures/it1/water
  $HADDOCKTOOLS/ana_structures.csh 
the csh script will extract information from the headers of the pdb files and calculate the backbone rmsd from the lowest energy structure. it generates a number of files, file.nam_bsa, file.nam_ener, file.nam_rmsd and file.nam_viol containing respectively the buried surface area, various energy terms, the backbone rmsd from the lowest energy structure and the violations statistics.

the generated files are combined into one file called
  • structures_haddock-sorted.stat
that keeps the pdb file order defined in file.nam. a number of additional files are then generated with various sorting schemes:
  • structures_air-sorted.stat: sorting based on distance restraint energy
  • structures_airviol-sorted.stat: sorting based on number of distance violations
  • structures_bsa-sorted.stat: sorting based on buried surface area
  • structures_Edesolv-sorted.stat: sorting based on the empirical desolvation energy
  • structures_dh-sorted.stat: sorting based on total "binding" energy (sum of the total energies (both bonded and non-bonded terms) of the individual molecules - total energy of the complex)
  • structures_ene-sorted.stat: sorting based on total energy (sum of various restraints energies and intermolecular van der waals and electrostatic energies)
  • structures_nb-sorted.stat: sorting based on non-bonded intermolecular energy
  • structures_nbw-sorted.stat: sorting based on weighted non-bonded intermolecular energy ( 1*evdw + 0.1*eelec)
  • structures_rmsd-sorted.stat: sorting based on rmsd from lowest haddock score structure
These files are useful to generate various plots such as for example intermolecular energy versus rmsd from the lowest energy structure to visualize the clusters. The graphic program Xmgr/Grace can be used for this purpose. To facilitate things, we are providing a csh script that will generate a xmgrace file called ene_rmsd.xmgr by specifying an input file and the column numbers to be plotted:
   $HADDOCKTOOLS/make_ene-rmsd_graph.csh 3 2 structures_haddock-sorted.stat
Columns 2 and 3 correspond to the HADDOCK score and backbone rmsds from the lowest HADDOCK score structure, respectively. You can then visualize it with:
   xmgrace ene_rmsd.xmgr
    Question:Considering the plot of the HADDOCK scores versus rmsds from the lowest HADDOCK score structure, how many clusters do you expect?
You could in a similar manner make plots of other terms. The first line of the structures_....stat files gives a description of the various columns. You could also repeat this kind of analysis after rigid-body docking (it0) and after semi-flexible annealing (it1). Note that the rmsd calculations for the rigid-body solutions might take a while considering that 1000 structures where calculated at that stage.

The next step consists in clustering the solutions based on the backbone rmsd matrix calculated by HADDOCK in the analysis directory. The RMSDs are calculated by first superimposing the structures on the flexible interface backbone atoms of molecule A and calculating the RMSD on the flexible interface backbone atoms of the other molecules. This RMSD can be termed: "ligand interface RMSD". The matrix (half of it) is stored in the analysis directory in a file called fileroot_rmsd.disp where fileroot is the root name used for HADDOCK and defined in the run.cns file.

Go into the water/analysis, locate that file and perform the clustering with the cluster_struc program provided with HADDOCK.
   $HADDOCKTOOLS/cluster_struc e2a-hpr_rmsd.disp 7.5 4
e2a-hpr_rmsd.disp is the file containing the rmsd matrix, 7.5 the rmsd distance cutoff in Angstrom and 4 the minimum number of members to define a cluster. See also the pairwise RMSD matrix section of the analysis manual page.

The output looks like:
Cluster 1 -> 3 1 2 4 5 6 7 8 10 11 12 13 14 17 19 20 21 22 23 25 26 27 28 29 30 36 37 38 42 43 45 47 53 54 55 56 57 58 60 61 63 67 68 69 70 71 72 73 75 76 77 78 79 81 83 84 86 88 89 90 93 94 95 96 97 98 100 101 102 103 104 105 106 108 110 115 116 119 121 122 123 125 126 127 129 130 131 133 134 135 136 137 138 139 141 142 143 146 147 148 149 151 152 153 156 157 158 161 162 163 164 165 166 167 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 188 189 190 192 193 194 195 197 199 200
Cluster 2 -> 15 16 18 31 32 39 40 41 46 48 49 50 51 62 65 66 74 80 82 92 107 109 111 117 145 154 159 196
Cluster 3 -> 114 9 24 44 52 59 112 113 120 124 140 155
Cluster 4 -> 33 34 35 85 91 99 132 150 168
Cluster 5 -> 144 87 118 187 191 198
where the structure numbers in the clusters correspond to the position of the PDB file in the file.nam file (or directly to the number of the structures in the analysis directory.

Try different cutoffs and compare the results. With a too small cutoff no clusters or only very small clusters are found. When increasing the cutoff, the clusters will grow and some might merge. There is no ideal value. It depends on the degree of flexibility present in the structure, the restraints used for docking and, therefore should be optimized for each system. A common cutoff value we are using is 7.5A. For more details see the clustering section of the analysis manual page.
    Note: When searching for a proper cutoff, it is a good idea to check how many structures do fall into a cluster. This number can be obtained for example by executing the following command and substract 3 times the number of lines, e.g.:
       $HADDOCKTOOLS/cluster_struc e2a-hpr_rmsd.disp 7.5 4 | wc
          5     210     738
    
    The first number give the number of lines and second number corresponds to the numbers of words. This means in this example that 210-(3*5)=195 structures do cluster. Ideally, you would like a large number of structures to cluster (at least 50%). The structure numbers correspond to the rank of the HADDOCK-sorted structure. It is also advised to check if the best structures do appear in one cluster.

Once you have found a proper cutoff, repeat the clustering, saving this time the output to a file, e.g.:
   $HADDOCKTOOLS/cluster_struc e2a-hpr_rmsd.disp 7.5 4 >cluster7.5.out

This file can now be used to analyze the solutions on a cluster basis and obtain cluster average values for the various energy terms, violations, ... For this, go back into the water directory and use yet another provided analysis script called ana_clusters.csh:
   ana_clusters.csh: calculates average energies and RMSDs on a cluster basis
 
     usage: ana_clusters.csh [-best num] clustering_file_output
 
     The -best option allows to specify how many structures are being used to
     to calculate the average cluster values. If specified, additional files with
     extension .stat_best_num will be created.

Run the analysis, specifying how many of the best (lowest energy) structures are being used to calculate cluster averages and the path and name of the file containing the output of the cluster_struc program:
   $HADDOCKTOOLS/ana_clusters.csh -best 4 analysis/cluster7.5.out

This script generates a number of files containing for each cluster the average backbone rmsd from the lowest energy structure of the cluster, the average backbone rmsd from the overall lowest energy structure, the number of members of the cluster, the various average energies, violations, buried surface area with the respective standard deviations:
  • clusters_haddock-sorted.stat contains the various cluster averages, sorted accordingly to the HADDOCK score
  • clusters_ene-sorted.stat contains the various cluster averages, sorted accordingly to the intermolecular energy (restraints+vdw+elec)
  • clusters_dH-sorted.stat contains the various cluster averages, sorted accordingly to the binding energy (sum of the total energies (both bonded and non-bonded terms) of the individual molecules - total energy of the complex)
  • clusters_Edesolv-sorted.stat contains the various cluster averages, sorted accordingly to the empirical desolvation energy
  • clusters_nb-sorted.stat contains the various cluster averages, sorted accordingly to the intermolecular non-bonded energy (vdw+elec)
  • clusters_nbw-sorted.stat contains the various cluster averages, sorted accordingly to the weighted intermolecular non-bonded energy (vdw+0.1*elec)
  • clusters_air-sorted.stat contains the various cluster averages, sorted accordingly to the AIR energy
  • clusters_bsa-sorted.stat contains the various cluster averages, sorted accordingly to the buried surface area
  • clusters_sani-sorted.stat contains the various cluster averages, sorted accordingly RDC (direct, SANI) restraint energy
  • clusters_vean-sorted.stat contains the various cluster averages, sorted accordingly to the RDC (intervector projection angles, VEAN) restraint energy
and for each cluster, the three files used by CNS and HADDOCK containing the energy sorted cluster members:
  • file.nam_clust#
  • file.list_clust#
  • file.cns_clust#
If the -best option is used additional files are produced with _best#num as extension containing the average values and the PDB filenames for the #num best structures of each cluster.
    Question: Perform the cluster analysis and look at the HADDOCK-sorted results. What we would consider the best solution would be the cluster with the lowest HADDOCK score. Since clusters typically differ in size, it is best to compare the results of the cluster averages obtained from the XX best structures only (so that a similar number of structures are considered per cluster). Which cluster would be the best? How many members does it count?


6. Rerunning the HADDOCK analysis on a cluster basis

Having performed the cluster analysis, you can now rerun the HADDOCK analysis for the best structures of each cluster to obtain various statistics on a "cluster bests" basis. One needs for this the cluster-specific file.nam_clust#, file.list_clust# and file.cns_clust# files. A csh script called make_links.csh is provided that will move the original file.nam, file.list and file.cns files to file.nam_all, file.list_all, file.cns_all and the same with the analysis directory. It will then create links to the appropriate files (file.nam_clust#,...) and to a new analysis_clust# directory.

For example, to rerun the analysis for the best 10 structures of the first cluster type in the water directory:
   $HADDOCKTOOLS/make_links.csh clust1_best4
   cd ../../..
   haddock2.2
The cd command brings you back into the main run directory from where you start again HADDOCK. Only the analysis of the best 4 structures of the first cluster in the water will be run. Once finished, go to the respective analysis directory and inspect the various files. The rmsd from the average structures should now be low (check rmsave.disp).

Repeat this analysis for the second cluster.


7. Comparing the clusters and the reference structure

Having run the HADDOCK analysis on a cluster basis you should now have two new directories in the water directory:
  • analysis_clust1_best4
  • analysis_clust2_best4
Each analysis directory contains now cluster specific statistics. You can also visualize the clusters. For Rasmol, use first the joinpdb perl script to concatenate the various PDb files into one singe file:
   $HADDOCKTOOLS/joinpdb -o e2a-hpr_clust1.pdb e2a-hprfit_*.pdb
and then start Rasmol with:
   rasmol -nmrpdb e2a-hpr_clust1.pdb
Use Display Backbone to display only the backbone trace and Colours Group for a color coding from N- to C-terminus.

From the average cluster HADDOCK score, cluster 1 is clearly the best.

It can however happen that the difference in score is small. Since the AIR restraints defined from NMR chemical shift perturbation data (or other sources) are highly ambiguous, they are typically easily satisfied and might not discriminate well between various orientations, e.g. a 180 degree rotation of one protein with respect to the other around an axis perpendicular to the interface. Some degree of asymmetry either in the interface (electrostatic, shape) or in additional experimental data is thus required, or even better, additional experimental information could be used to discriminate between clusters (e.g. mutagenesis or conservation data).

    Inspect in Rasmol or any graphic program the best solution of each cluster (it is the first one in the respective analysis directories, e2a-hprfit_1.pdb) and try to figure out the differences between them. Check in particular the relative orientation of the two protein.

For this particular test case, the reference structure is known and has been solved by NMR5. In order to compare our docking solutions with the reference structure we are providing csh scripts called i-rmsd_to_target.csh for interface RMSD calculation and l-rmsd_to_target.csh for ligand RMSD calculations. These can be found in the e2a-hpr/ana_scripts directory. These script takes as arguments PDB filenames and calculates various rmsd values from the reference structure using ProFit.

Copy the script in the clusters respective analysis directories and run them with as argument *.pdb:
   ./i-rmsd_to_target.csh *.pdb
   ./l-rmsd_to_target.csh *.pdb
Four files will be created:
  • i-RMSD.disp: inteface RMSD for the PDB files in the same order as passed as arguments
  • i-RMSD-sorted.disp: interface RMSD sorted from smallest to largest
  • l-RMSD.disp: ligand RMSD for the PDB files in the same order as passed as arguments
  • l-RMSD-sorted.disp: ligand RMSD sorted from smallest to largest
    Question: Which one of the two clusters is the closest to the target?
According to CAPRI criteria, a docking model with i-RMSD or l-RMSD below 1A can be considered as a high accuracy predition. A medium quality prediction will have a i-RMSD below 2A and/or l-RMSD below 5A. Acceptable predictions should have i-RMSD below 4A or l-RMSD below 10A.
    Note that you can also run those scripts for it0, it1 or water structures. Copy the scripts in the respective directories and type:
    ./i-rmsd_to_target.csh `cat file.nam`
    
    The i-RMSD.disp file will now be sorted in according to the HADDOCK score.

Another measure of quality defined in CAPRI is the fraction of native contacts. Native contacts are defined as intermolecular contact between residues within 5A from each other. A script to calculate the fraction of native contacts is provided in the e2a-hpr/ana_scripts directory: fraction-native.csh. It takes as arguments a list of PDB files in the same manner as the RMSD scripts described above. For example, to calculate the fraction of native contacts of the water-refined structures, copy this script in the water directory and type:
./fraction-native.csh `cat file.nam`
This will generate a file called file.nam_fnat containing the fraction of native contacts.


Finally, there is one additional piece of information that we did not use for the docking and that could help in discriminating further the clusters:
    E2A stands for phosphotransferase enzyme II and HPr for histidine-containing phosphocarrier protein. There is thus transfer of a phosphate group between two histidines of those two proteins!

Question: Inspect again graphically the best solution of each cluster and find out if, in one of them, there are two histidines (one from each protein) at reasonable proximity from each other?

Question:Which cluster is consistent with this additional information?

Question:To improve the docking results, how would you incorporate this information in HADDOCK (the phosphate group is transfered between two imino groups of histidines)?



References
  1. van Nuland N.A. et al. (1994) J Mol Biol 237, 544.
  2. Worthylake D. et al. (1991) Proc Natl Acad Sci US 88, 10382.
  3. van Nuland N.A. et al. (1995) J Mol Biol 246, 180.
  4. Chen Y. et al. (1993) Biochemistry 32, 32.
  5. Wang G. et al (2000) Embo J 19, 5635.