Chemical Shift-Based Protein DockingTutorial
In order to validate the HADDOCK program, we initially performed the docking on three known complexes (see the HADDOCK paper. We are providing in the haddock/examples directory the data for all three complexes. This tutorial deals however only with the E2A-HPr complex. The structures in the free form1,2 and NMR titration data3,4 are available. The complex structure had been solved experimentally by NMR5 (PDB entry 1GGR).
Note that the example uses the NMR solution structure
of HPr (entry 1HDN) to allow the docking from an ensemble of
conformations. Because of that, the definition of active and
passive residues might differ from the one in the HADDOCK
paper since the filtering is based on the average accessibilities
(see the AIR restraints section).
- Defining the AIR restraints
- Setting up the project for HADDOCK
- Defining the parameters for the docking and running HADDOCK
- Analyzing the docking results
- Rerunning the HADDOCK analysis on a cluster basis
- Comparing the clusters and the reference structure
To start this tutorial, first copy the protein-protein directory from the haddock2.2/examples directory:
cp -r $HADDOCK/examples/protein-protein .
Note: The $HADDOCK environment variable should be defined if haddock was
You will find in there the PDB files of the E2A and the HPr proteins:
- e2a_1F3G.pdb - e2aP_1F3G.pdb (with phosphorylated histidine)
For HPr, we will use the first 10 conformations from the PDB entry 1HDN. These can be found in the hpr> directory:
- hpr_1.pdb - ... - hpr_10.pdb
The file containing the listing of those 10 structures for docking from an ensemble of structures is provided (Note: edit it to modify the directory path!):
The PDB file of the reference complex can be found in the ana_scripts directory and is called
The (average) per residue solvent accessibilities calculated with NACCESS (see the AIR restraints section) are also provided for convenience
- e2a_1F3G.rsa - hpr/hpr_rsa_ave.lis
You also find a number of additional files containing the AIR restraints that we used for calculations,
a number of Rasmol scripts for visualization (see the "Defining AIRs" section) of the active and passive residues for E2A and HPr,
- e2a_rasmol_active.script - e2a_rasmol_active-passive.script - hpr_rasmol_active.script - hpr_rasmol_active-passive.script
two example files for setting up a new project
and to define all the parameters for docking
(see the Defining the parameters for the docking and running HADDOCK section for usage and parameter definitions)
Various analysis scripts can be found in the ana_scripts directory. They will allow you to calculate the interface RMSD (i-RMSD) (RMSD calculated on backbone atoms of all residues within 10A from the partner molecule) and ligand RMSD (l-RMSD) (RMSD calculated on the smallest component (hpr) after fitting on the largest component (e2a)). These are standard CAPRI definitions. A script is also provided that allows you to calculate the fraction of native contacts. For usage and details refer to the README file in the ana_scripts directory.
Finally, a script to copy data from an example HADDOCK run is provided as:
2. Defining the AIR restraints
Some general explanations about how to define the interface can be found in the HADDOCK AIR restraints section.
For this particular example, based on the available NMR titration data3,4, and the relative solvent accessibilities, we defined 11 active residues for E2A:
- active: D38, V40, I45, V46, K69, F71, S78, E80, D94, V96 and S141
- active: H15, T16, R17, A20, F48, K49, Q51, T52, G54 and T56
Note: The active residues and the filtered ones can be viewed in Rasmol using the provided Rasmol scripts, e2a_rasmol_active.script and hpr_rasmol_active.script. For example for E2A, start rasmol with
rasmol e2a_1F3G.pdband then, at the rasmol prompt type:
RasMol> source e2a_rasmol_active.scriptThe active residues are colored in red and the filtered-out active residues (if any) in yellow. For Hpr, three active residues were filtered out because they are not solvent accessible. These are residues 50,53 and 55. Control in the hpr_rsa_ave.lis file giving the relative solvent accessible surface area per residue for hpr that these three residues are indeed not solvent accessible (The criteria used is
+SD > 50%).
- Note: Rasmol scripts containing all the active and passive
residues that we defined for E2A and HPr are provided for
comparison (e2a_rasmol_active-passive.script and
For both E2A and HPr we defined 9 passives residues:
- E2A passive: P37, V39, E43 G68, E72, E97, E109 and K132
- HPr passive: P11, N12, Q21, K24, K40, S41, L47, Q57 and E85
3. Setup a new project for HADDOCK
To start a new project in HADDOCK you have to define a number of files such as input PDB files for both molecules, various restraint files and the path and name of the project. For this, go again to the project setup page, click on "start a new project" and follow the instructions.
After saving the new.html file to disk, type haddock2.2 in the same directory. This will generate a run directory containing all necessary information to run haddock. See the "a new project" section on the HADDOCK home page for more information and for a description of the content and organization of the generated run directory.
milou.abonvin.examples/protein-protein> haddock2.2 ############################################################################## # # # starting HADDOCK # # Author: Alexandre Bonvin, Utrecht University # # # # N-components version of HADDOCK (current maximum is 6) # # # ############################################################################## Starting HADDOCK on: 2014-09-18 11:59:09 HADDOCK version: 2.2 Python version: 2.7.2 (default, Apr 4 2012, 13:50:40) [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] PYTHONPATH system variable contains: ['/home/abonvin/haddock_git/haddock2.2/Haddock', '/home/software/software/python2.7/lib/python2.7/site-packages/setuptools-0.9.8-py2.7.egg', '/home/software/software/python2.7/lib/python2.7/site-packages/argparse-1.2.1-py2.7.egg', '/home/software/software/modeller_v9.11/lib/x86_64-intel8', '/home/software/software/modeller_v9.11/modlib', '/home/software/software/numpy_stable/lib/python', '/home/software/software/scipy/lib/python', '/home/software/software/biopython/build/lib.linux-x86_64-2.7', '/home/software/software/nose/lib/python', '/home/enmr/services/HADDOCK/spyder', '/home/software/software/cython/build/lib.linux-x86_64-2.7', '/home/software/software/matplotlib/lib/python2.7/site-packages', '/home/software/software/moderna_source_1.7.1/build/lib', '/home/abonvin/haddock_git/haddock2.2', '/home/software/software/python2.7/lib/python27.zip', '/home/software/software/python2.7/lib/python2.7', '/home/software/software/python2.7/lib/python2.7/plat-linux2', '/home/software/software/python2.7/lib/python2.7/lib-tk', '/home/software/software/python2.7/lib/python2.7/lib-old', '/home/software/software/python2.7/lib/python2.7/lib-dynload', '/home/software/software/python2.7/lib/python2.7/site-packages'] reading parameters from the file ./new.html setting some variables: submit_save set to: Save updated parameters N_COMP set to: 2 PROJECT_DIR set to: . HADDOCK_DIR set to: ../.. DANI1_FILE set to: ./dani.tbl RUN_NUMBER set to: 1 PROT_SEGID_2 set to: B PROT_SEGID_1 set to: A PDB_FILE1 set to: ./e2aP_1F3G.pdb AMBIG_TBL set to: ./e2a-hpr_air.tbl PDB_LIST2 set to: ./hpr-files.list PDB_FILE2 set to: ./hpr/hpr_1.pdb project /home/abonvin/haddock_git/haddock2.2/examples/protein-protein exists, run 1 does not exist create the new run 1 copying ./dani.tbl to /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/dani/dani1.tbl ./e2aP_1F3G.pdb copied to /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/sequence after removing the chain and segIDs ./hpr/hpr_1.pdb copied to /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/sequence after removing the chain and segIDs editing run.cns: setting the default values copying new.html to /home/abonvin/haddock_git/haddock2.2/examples/protein-protein/run1/data/new.html created new run1 for the project /home/abonvin/haddock_git/haddock2.2/examples/protein-protein copying AIR restraint data to ./run1/data/distances/ambig.tbl copying AIR restraint data to ./run1/structures/it0/ambig.tbl copying AIR restraint data to ./run1/structures/it1/ambig.tbl copying PDB file list of molecule 2 to ./run1/data/sequence/file_B.list copying PDB files of the list of molecule 2 to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_1.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_2.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_3.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_4.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_5.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_6.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_7.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_8.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_9.pdb of list to ./run1/data/ensemble-models/ removing chain and segIDs and copying PDB file ./hpr/hpr_10.pdb of list to ./run1/data/ensemble-models/ Enforcing PDB file absolute path for ./run1/data/ensemble-models/ new project . has been set up. Now you have to edit run.cns in ./run1 ############################################################################## Finishing HADDOCK on: 2014-09-18 11:59:09
Note:An example of a new.html
file can be found in the haddock/examples/protein-protein
4. Defining the parameters for the docking and running HADDOCK
The next step consists in editing the various parameters that will govern the docking process in the run.cns file in the newly generated run directory. For this, again go to the project setup page, enter the path of the run.cns file and click on "edit. Refer to the run.cns section for a description of the various parameters.
- Note: Most of the parameters in run.cns are set by default
for the e2a-hpr example and you will only need to modify a few of those.
- Check the various paths to your project directory and the filenames
to be used ("Filenames" section)
- Check the definition of the semi-flexible interface
("Definition of the interface"
section). The default in HADDOCK2.2 is -1, meaning that the semi-flexible
interface will be automatically defined by HADDOCK from a contact analysis.
- Check the number of structures to generate at the various steps
and decrease those if needed (strongly recommended if you are going
to run all the structure calculations on a single processor).
("Number of structures to dock" section)
For the present tutorial, keep the default parameters (1000 structures for the rigid body docking stage and 200 afterwards). You will not perform a complete HADDOCK run but simply copy the data from the provided example run directory and move on with the analysis of the results
- Check the queue commands for running the jobs, the location of the
cns executable and the number of jobs to run simultaneously.
("Parallel jobs" section)
For the present tutorial, use csh as queue command and 1 as number of jobs. The CNS executable path should be properly set.
Note:An example of a run.cns
file can be found in the haddock/examples/protein-protein
A description of what happens in the various HADDOCK steps can be found in the the docking section of the online HADDOCK manual.
Examine now the content of the various directories. The generated structures of the complex can be found in:
- structures/it0: structures after rigid-body docking
- structures/it1: structures after semi-flexible simulated annealing
- structures/it1/water: structures after water refinement
- file.nam: list of PDB filenames only
- file.list: list of PDB filenames for use in CNS with the haddock score that has been used for sorting
- file.cns: list of pdb filenames in yet another cns format.
5. analyzing the docking results
in structures/it1 and structures/it1/water you will find an analysis directory containing the results of the analysis automatically performed by haddock. refer to the analysis section of the online haddock manual for a description of the various analysis performed and inspect the content of some of the described files (mainly those with extensions .disp and .lis.
An important part come now: the manual analysis of the results (see also the related manual analysis section of the online haddock manual). since typically the restraints used to drive the docking are highly ambiguous, several solutions will be generated. clustering the solutions and performing a manual analysis is therefore required. a number of scripts are provided in the tools directory for this purpose. you can either copy them to the directory into which you will run the manual analysis or call them using:
$HADDOCKTOOLS/script_namea first typical step is to gather various energetics violation and rmsd from lowest energy structure statistics. the ana_structures.csh script can be used for this purpose. it makes use of profit for rmsd calculations if present and defined.
go into the structures/it1/water directory and run the analysis:
cd structures/it1/water $HADDOCKTOOLS/ana_structures.cshthe csh script will extract information from the headers of the pdb files and calculate the backbone rmsd from the lowest energy structure. it generates a number of files, file.nam_bsa, file.nam_ener, file.nam_rmsd and file.nam_viol containing respectively the buried surface area, various energy terms, the backbone rmsd from the lowest energy structure and the violations statistics.
the generated files are combined into one file called
- structures_air-sorted.stat: sorting based on distance restraint energy
- structures_airviol-sorted.stat: sorting based on number of distance violations
- structures_bsa-sorted.stat: sorting based on buried surface area
- structures_Edesolv-sorted.stat: sorting based on the empirical desolvation energy
- structures_dh-sorted.stat: sorting based on total "binding" energy (sum of the total energies (both bonded and non-bonded terms) of the individual molecules - total energy of the complex)
- structures_ene-sorted.stat: sorting based on total energy (sum of various restraints energies and intermolecular van der waals and electrostatic energies)
- structures_nb-sorted.stat: sorting based on non-bonded intermolecular energy
- structures_nbw-sorted.stat: sorting based on weighted non-bonded intermolecular energy ( 1*evdw + 0.1*eelec)
- structures_rmsd-sorted.stat: sorting based on rmsd from lowest haddock score structure
$HADDOCKTOOLS/make_ene-rmsd_graph.csh 3 2 structures_haddock-sorted.statColumns 2 and 3 correspond to the HADDOCK score and backbone rmsds from the lowest HADDOCK score structure, respectively. You can then visualize it with:
- Question:Considering the plot of the HADDOCK scores versus rmsds from the lowest HADDOCK score structure,
how many clusters do you expect?
The next step consists in clustering the solutions based on the backbone rmsd matrix calculated by HADDOCK in the analysis directory. The RMSDs are calculated by first superimposing the structures on the flexible interface backbone atoms of molecule A and calculating the RMSD on the flexible interface backbone atoms of the other molecules. This RMSD can be termed: "ligand interface RMSD". The matrix (half of it) is stored in the analysis directory in a file called fileroot_rmsd.disp where fileroot is the root name used for HADDOCK and defined in the run.cns file.
Go into the water/analysis, locate that file and perform the clustering with the cluster_struc program provided with HADDOCK.
$HADDOCKTOOLS/cluster_struc e2a-hpr_rmsd.disp 7.5 4e2a-hpr_rmsd.disp is the file containing the rmsd matrix, 7.5 the rmsd distance cutoff in Angstrom and 4 the minimum number of members to define a cluster. See also the pairwise RMSD matrix section of the analysis manual page.
The output looks like:
Cluster 1 -> 3 1 2 4 5 6 7 8 10 11 12 13 14 17 19 20 21 22 23 25 26 27 28 29 30 36 37 38 42 43 45 47 53 54 55 56 57 58 60 61 63 67 68 69 70 71 72 73 75 76 77 78 79 81 83 84 86 88 89 90 93 94 95 96 97 98 100 101 102 103 104 105 106 108 110 115 116 119 121 122 123 125 126 127 129 130 131 133 134 135 136 137 138 139 141 142 143 146 147 148 149 151 152 153 156 157 158 161 162 163 164 165 166 167 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 188 189 190 192 193 194 195 197 199 200 Cluster 2 -> 15 16 18 31 32 39 40 41 46 48 49 50 51 62 65 66 74 80 82 92 107 109 111 117 145 154 159 196 Cluster 3 -> 114 9 24 44 52 59 112 113 120 124 140 155 Cluster 4 -> 33 34 35 85 91 99 132 150 168 Cluster 5 -> 144 87 118 187 191 198where the structure numbers in the clusters correspond to the position of the PDB file in the file.nam file (or directly to the number of the structures in the analysis directory.
Try different cutoffs and compare the results. With a too small cutoff no clusters or only very small clusters are found. When increasing the cutoff, the clusters will grow and some might merge. There is no ideal value. It depends on the degree of flexibility present in the structure, the restraints used for docking and, therefore should be optimized for each system. A common cutoff value we are using is 7.5A. For more details see the clustering section of the analysis manual page.
Note: When searching for a proper cutoff, it is a good idea to check how many
structures do fall into a cluster. This number can be obtained for example
by executing the following command and substract 3 times the number of lines,
$HADDOCKTOOLS/cluster_struc e2a-hpr_rmsd.disp 7.5 4 | wc 5 210 738The first number give the number of lines and second number corresponds to the numbers of words. This means in this example that 210-(3*5)=195 structures do cluster. Ideally, you would like a large number of structures to cluster (at least 50%). The structure numbers correspond to the rank of the HADDOCK-sorted structure. It is also advised to check if the best structures do appear in one cluster.
Once you have found a proper cutoff, repeat the clustering, saving this time the output to a file, e.g.:
$HADDOCKTOOLS/cluster_struc e2a-hpr_rmsd.disp 7.5 4 >cluster7.5.out
This file can now be used to analyze the solutions on a cluster basis and obtain cluster average values for the various energy terms, violations, ... For this, go back into the water directory and use yet another provided analysis script called ana_clusters.csh:
ana_clusters.csh: calculates average energies and RMSDs on a cluster basis usage: ana_clusters.csh [-best num] clustering_file_output The -best option allows to specify how many structures are being used to to calculate the average cluster values. If specified, additional files with extension .stat_best_num will be created.
Run the analysis, specifying how many of the best (lowest energy) structures are being used to calculate cluster averages and the path and name of the file containing the output of the cluster_struc program:
$HADDOCKTOOLS/ana_clusters.csh -best 4 analysis/cluster7.5.out
This script generates a number of files containing for each cluster the average backbone rmsd from the lowest energy structure of the cluster, the average backbone rmsd from the overall lowest energy structure, the number of members of the cluster, the various average energies, violations, buried surface area with the respective standard deviations:
- clusters_haddock-sorted.stat contains the various cluster averages, sorted accordingly to the HADDOCK score
- clusters_ene-sorted.stat contains the various cluster averages, sorted accordingly to the intermolecular energy (restraints+vdw+elec)
- clusters_dH-sorted.stat contains the various cluster averages, sorted accordingly to the binding energy (sum of the total energies (both bonded and non-bonded terms) of the individual molecules - total energy of the complex)
- clusters_Edesolv-sorted.stat contains the various cluster averages, sorted accordingly to the empirical desolvation energy
- clusters_nb-sorted.stat contains the various cluster averages, sorted accordingly to the intermolecular non-bonded energy (vdw+elec)
- clusters_nbw-sorted.stat contains the various cluster averages, sorted accordingly to the weighted intermolecular non-bonded energy (vdw+0.1*elec)
- clusters_air-sorted.stat contains the various cluster averages, sorted accordingly to the AIR energy
- clusters_bsa-sorted.stat contains the various cluster averages, sorted accordingly to the buried surface area
- clusters_sani-sorted.stat contains the various cluster averages, sorted accordingly RDC (direct, SANI) restraint energy
- clusters_vean-sorted.stat contains the various cluster averages, sorted accordingly to the RDC (intervector projection angles, VEAN) restraint energy
Perform the cluster analysis and look at the HADDOCK-sorted results. What we would consider
the best solution would be the cluster with the lowest HADDOCK score. Since clusters typically
differ in size, it is best to compare the results of the cluster averages obtained from the XX
best structures only (so that a similar number of structures are considered per cluster).
Which cluster would be the best? How many members does it count?
6. Rerunning the HADDOCK analysis on a cluster basis
Having performed the cluster analysis, you can now rerun the HADDOCK analysis for the best structures of each cluster to obtain various statistics on a "cluster bests" basis. One needs for this the cluster-specific file.nam_clust#, file.list_clust# and file.cns_clust# files. A csh script called make_links.csh is provided that will move the original file.nam, file.list and file.cns files to file.nam_all, file.list_all, file.cns_all and the same with the analysis directory. It will then create links to the appropriate files (file.nam_clust#,...) and to a new analysis_clust# directory.
For example, to rerun the analysis for the best 10 structures of the first cluster type in the water directory:
$HADDOCKTOOLS/make_links.csh clust1_best4 cd ../../.. haddock2.2The cd command brings you back into the main run directory from where you start again HADDOCK. Only the analysis of the best 4 structures of the first cluster in the water will be run. Once finished, go to the respective analysis directory and inspect the various files. The rmsd from the average structures should now be low (check rmsave.disp).
Repeat this analysis for the second cluster.
7. Comparing the clusters and the reference structure
Having run the HADDOCK analysis on a cluster basis you should now have two new directories in the water directory:
$HADDOCKTOOLS/joinpdb -o e2a-hpr_clust1.pdb e2a-hprfit_*.pdband then start Rasmol with:
rasmol -nmrpdb e2a-hpr_clust1.pdbUse Display Backbone to display only the backbone trace and Colours Group for a color coding from N- to C-terminus.
From the average cluster HADDOCK score, cluster 1 is clearly the best.
It can however happen that the difference in score is small. Since the AIR restraints defined from NMR chemical shift perturbation data (or other sources) are highly ambiguous, they are typically easily satisfied and might not discriminate well between various orientations, e.g. a 180 degree rotation of one protein with respect to the other around an axis perpendicular to the interface. Some degree of asymmetry either in the interface (electrostatic, shape) or in additional experimental data is thus required, or even better, additional experimental information could be used to discriminate between clusters (e.g. mutagenesis or conservation data).
Inspect in Rasmol or any graphic program the best solution of each cluster (it is the
first one in the respective analysis directories, e2a-hprfit_1.pdb) and try to
figure out the differences between them. Check in particular the relative orientation
of the two protein.
For this particular test case, the reference structure is known and has been solved by NMR5. In order to compare our docking solutions with the reference structure we are providing csh scripts called i-rmsd_to_target.csh for interface RMSD calculation and l-rmsd_to_target.csh for ligand RMSD calculations. These can be found in the e2a-hpr/ana_scripts directory. These script takes as arguments PDB filenames and calculates various rmsd values from the reference structure using ProFit.
Copy the script in the clusters respective analysis directories and run them with as argument *.pdb:
./i-rmsd_to_target.csh *.pdb ./l-rmsd_to_target.csh *.pdbFour files will be created:
- i-RMSD.disp: inteface RMSD for the PDB files in the same order as passed as arguments
- i-RMSD-sorted.disp: interface RMSD sorted from smallest to largest
- l-RMSD.disp: ligand RMSD for the PDB files in the same order as passed as arguments
- l-RMSD-sorted.disp: ligand RMSD sorted from smallest to largest
Which one of the two clusters is the closest to the target?
Note that you can also run those scripts for it0, it1 or water structures. Copy the scripts
in the respective directories and type:
./i-rmsd_to_target.csh `cat file.nam`The i-RMSD.disp file will now be sorted in according to the HADDOCK score.
Another measure of quality defined in CAPRI is the fraction of native contacts. Native contacts are defined as intermolecular contact between residues within 5A from each other. A script to calculate the fraction of native contacts is provided in the e2a-hpr/ana_scripts directory: fraction-native.csh. It takes as arguments a list of PDB files in the same manner as the RMSD scripts described above. For example, to calculate the fraction of native contacts of the water-refined structures, copy this script in the water directory and type:
./fraction-native.csh `cat file.nam`This will generate a file called file.nam_fnat containing the fraction of native contacts.
Finally, there is one additional piece of information that we did not use for the docking and that could help in discriminating further the clusters:
E2A stands for phosphotransferase enzyme II and HPr for histidine-containing
phosphocarrier protein. There is thus transfer of a phosphate group between two
histidines of those two proteins!
Question: Inspect again graphically the best solution of each cluster and find out if, in one of them, there are two histidines (one from each protein) at reasonable proximity from each other?
Question:Which cluster is consistent with this additional information?
Question:To improve the docking results, how would you incorporate this information in HADDOCK (the phosphate group is transfered between two imino groups of histidines)?
- van Nuland N.A. et al. (1994) J Mol Biol 237, 544.
- Worthylake D. et al. (1991) Proc Natl Acad Sci US 88, 10382.
- van Nuland N.A. et al. (1995) J Mol Biol 246, 180.
- Chen Y. et al. (1993) Biochemistry 32, 32.
- Wang G. et al (2000) Embo J 19, 5635.