Computational Structural Biology group focusing on dissecting, understanding and predicting biomolecular interactions at the molecular level.

Supported by:

# HADDOCK CASP-CAPRI T70 ab-initio docking tutorial

This tutorial consists of the following sections:

## Introduction

Our information-driven docking approach HADDOCK is a consistent top predictor and scorer since the start of its participation in the CAPRI community-wide experiment. This sustained performance is due, in part, to its ability to integrate experimental data and/or bioinformatics information into the modelling process, and also to the overall robustness of the scoring function used to assess and rank the predictions.

This tutorial will demonstrate the use of HADDOCK for predicting target70 of the CASP-CAPRI experiment. This target was given to the CAPRI community as a tetramer, but there has been discussions whether the biological unit is a dimer or a tetramer. We will use this target to illustrate the ab-initio docking mode of HADDOCK, using a combination of center-of-mass restraints to bring the subunits together and symmetry restraints to define the symmetry of the assembly.

For this tutorial we will make use of the HADDOCK2.2 webserver. A description of our web server can be found in the following publications:

Further, multi-body docking and the use of symmetry restraints is described in the following paper:

Throughout the tutorial, coloured text will be used to refer to questions or instructions, Linux and/or Pymol commands.

The case we will be investigating is target70 from the CASP-CAPRI experiment, which corresponds to PDB entry 4PWU.

## Setup

In order to run this tutorial you will need to have the following software installed: (Pymol and ProFit), and download the data to run this tutorial from our GitHub data repository here or clone it from the command line

Alternatively, if you don’t have git installed, simply go the above web address and download the zip archive. You will also need to compile a few provided programs for post-analysis. For this go into the ana_scripts directory of the cloned directory and type make (we are here assuming a tcsh or csh shell):

cd CASP-CAPRI-T70-tutorial/ana_scripts
make
source set-target.csh
cd ..

If you don’t want to wait with the docking runs to complete in order to proceed with the analysis (see below), you can already download pre-calculated runs using the script provided into the runs directory:

cd runs
cd ..

This will download two reduced docking runs, one for the dimeric and one for the tetrameric forms of T70 (about 38MB of data).

This will download two full docking runs, one for the dimeric and one for the tetrameric forms of T70 (about 800MB of data).

## Inspecting the content of the tutorial

Let us first inspect the various files provided with this tutorial. You will see three directories and one file:

• HADDOCK-runfiles: this directory contains the reference HADDOCK parameter files for dimer and tetramer docking. These can be used to reproduce the docking using the file upload interface of the HADDOCK server.

• ana_scripts: this directory contains various analysis scripts to analyse the results of the docking, including the calculation of the CAPRI i-RMSD and Fnat statistics

• protein1.pdb: this is the model we built for this target based on the sequence information that was provided to the CAPRI predictors.

• runs: this directory contains a script that allows you to download pre-calculated full docking runs.

## Ab-initio, multi-body docking with symmetry restraints

We will launch two docking runs, one for the dimeric and one for the tetrameric form of this target. For this we will make us of the multi-body interface of the HADDOCK web server, which does require guru level access (provided with course credential if given to you, otherwise register to the server and request this access level):

Note: The blue bars on the server can be folded/unfolded by clicking on the arrow on the right

We will first set up a symmetrical tetramer docking run:

• Step1: Define a name for your docking run, e.g. T70-tetramer.

• Step2: Input the proteins PDB files. For this unfold the Molecule definition menu.

(Our model might not be the real fragment used for crystallisation - better to have uncharged termini)

• Steps 3-4: Repeat step 2 to define the other three molecules, using the same input protein file and assigning segment IDs as B,C and D.

• Step 5: Turn on center-of-mass restraints. For this unfold the Distance restraints menu

Center of mass restraints -> Check the box

• Step 6: Since we are doing ab-initio docking we do need to increase the sampling. For this unfold the Sampling parameters menu:

Note: If you use course credentials, these numbers will be reduced to 500/50/50 to save computing time and get back results faster. You can also manually decrease those numbers and download instead a full pre-calculated run for analysis (see setup above).

• Step 7: Define noncrystallographic symmetry restraint to enforce the various chains will have exactly the same conformation. For this unfold the Noncrystallographic symmetry restraints menu:

Use this type of restraints -> Check the box

Unfold the segment pair menu and define three segment pairs: A-B, B-C and C-D. The protein sequence starts at residue 6 and ends at residue 78. Use those numbers to define the various segments.

We have four molecules, so do three pairs of NCS restraints make sense to ensure all four molecules remain similar?

• Step 8: Define the symmetry restraints to enforce the symmetry of our tetramer. For this unfold the symmetry restraints menu:

We are dealing with a symmetrical tetramer. Two kind of symmetries are possible: C4 or D2. What are the differences between those two symmetries?

Since we don’t know a priori which symmetry the system has, we can define 6 pairs of C2 restraints which are consistent with both C4 and D2. For this we will need to define the following symmetry pairs: A-B, A-C, A-D, B-C, B-D, C-D. In this way the system will be free to adopt one of those two symmetries.

Use this type of restraints -> Check the box

Use the C2 symmetry segment pair menu to define those six pairs of symmetry restraints. Use for defining the segments again residues 6 and 78 as start and end points.

• Step 9: You are ready to submit! Enter your username and password (or the course credentials provided to you). Remember that for this interface you do need guru access.

## First visual analysis of the results

Once your run has completed you will be presented with a result page showing the cluster statistics and some graphical representation of the data. Such an example output page can be found here

Note: You can also view a result page from a downloaded pre-calculated docking run by opening in your favourite browser the index.html file provided in the run directory.

The run with reduced number of models (course setting) should be returning only one cluster. Load a representative model and compare it to the crystal structure:

pymol cluster1_1.pdb WDIR/30_70.2.pdb This will load cluster1_1 and the reference crystal structure (the second interface evaluated by CAPRI) into pymol (make sure to have run the setup described above). In case WDIR is not defined or you are not running under a linux-like environment you can find the reference structure into the ana_scripts directory. In pymol you can type the following to superimpose the models and change the style: align cluster1_1, 30_70.2 zoom vis show cartoon util.cbc ## More quantitative analysis of the results Let’s have a more detailed look at the quality and ranking of the generated models. For this we will calculate CAPRI statistics, namely the interface RMSD (i-RMSD) and fraction of native contacts (Fnat) compared to the reference crystal structure. CAPRI has defined the following criteria to assess the quality of a model: • Unacceptable: i-RMSD >4Å or Fnat < 0.1 • Acceptable: 4Å <=i-RMSD < 2Å and Fnat > 0.1 • Medium: 2Å <=i-RMSD < 1Å and Fnat > 0.3 • High: i-RMSD < 1Å and Fnat > 0.5 In order to perform the more quantitative analysis, download the full run from the results page (provide in a link in the first line of the result page) and unpack it. Alternatively, download the pre-calculated data from the following links: • T70 tetramer docking - reduced settings: link • T70 tetramer docking - full sampling: link • T70 dimer docking - reduced settings: link • T70 dimer docking - full sampling: link Then start the quantitative analysis in the directory where you saved and unpacked the run with the following command:WDIR/run_all.csh my-run-directory

Note: If you use the pre-calculated runs, this analysis has already been performed and you can skip the above step.

Be patient since this might take some time depending on whether you are analysing a full or reduced run. The script will calculate CAPRI statistics for all generated models (rigid-body (it0) - semi-flexible refinement (it1) - water refinement (water)). Those can be found in the unpacked run directory under structures/it0, structures/it1 and structures/it1/water directories, respectively.

Once the analysis script has completed you can get a first glimpse of the number of acceptable models or better using the following command:

$WDIR/check_runs_i-rmsd.csh my-run-directory Here is an example of what this script will return for a full tetramer docking run: T70-tetramer-full #it0: structures with i-RMSD<4A: 194 #it0: structures within best200 with i-RMSD<4A: 94 #it0: structures within best200 with i-RMSD<2A: 94 #it0: structures within best200 with i-RMSD<1A: 0 #it1: structures with i-RMSD<4A: 108 #it1: structures with i-RMSD<2A: 108 #it1: structures with i-RMSD<1A: 0 #water: structures with i-RMSD<4A: 103 #water: structures with i-RMSD<2A: 103 #water: structures with i-RMSD<1A: 0 Note: To get the proper stats in case of a reduced sampling run, use instead check_runs_i-rmsd-reduced.csh. Are there any acceptable models after water refinement? Which fraction of the acceptable models at it0 is selected for further refinement? Considering that we are generating 10000 models for a full run at it0, how successful was our scoring function in selecting acceptable models for further flexible refinement? You can also check the cluster-based statistics with the following command:$WDIR/check_clusterranks.csh my-run-directory

Here is an example of what this script will return for a full tetramer docking run:

######################################################
______________________________________________________
=========== T70-tetramer-full =============
Rank of clusters with i-RMSD<4A:
1 clust3    1.31 +/-  0.07  Fnat=  0.49 +/-  0.02
2 clust8    1.26 +/-  0.04  Fnat=  0.51 +/-  0.06
3 clust6    1.28 +/-  0.06  Fnat=  0.49 +/-  0.05
16 clust12   4.27 +/-  4.08  Fnat=  0.31 +/-  0.01
17 clust11   6.65 +/-  4.73  Fnat=  0.33 +/-  0.02
######################################################
______________________________________________________
=========== T70-tetramer-full =============
Rank of clusters with i-RMSD<4A:
1 clust3    1.33 +/-  0.03  Fnat=  0.69 +/-  0.01
2 clust8    1.31 +/-  0.02  Fnat=  0.67 +/-  0.02
3 clust7    1.32 +/-  0.01  Fnat=  0.69 +/-  0.00
17 clust13   4.17 +/-  3.93  Fnat=  0.48 +/-  0.08
18 clust12   4.26 +/-  4.08  Fnat=  0.40 +/-  0.02

The above results show three clusters with medium quality (around 1.3Å i-RMSD). What could be the reason that these do not fall within one single cluster? Remember here that we are docking four molecules with different chainIDs.

Check also the ranking of the clusters. Did the HADDOCK score do a good job are ranking at the top acceptable clusters?

In case there is no single acceptable cluster, but the check_runs_i-rmasd.csh script did report some acceptable models at water refinement, do check the ranking of those single structure. For this go into the my-docking-run/structures/it1/water directory and inspect the i-RMSD.dat file. This file contains the i-RMSD values sorted according to the HADDOCK score.

cd my-docking-run/structures/it1/water
less i-RMSD.dat

Did HADDOCK rank any acceptable model at the top? If not, try to find out what is the rank of the first acceptable model.

You can also inspect the fraction of native contacts of the models by looking at the file.nam_fnat file.

Let’s inspect the best generated model in terms of i-RMSDs. This is the top model listed in i-RMSD-sorted.dat. Upload it in pymol (or your favourite graphical program) and compare it to the crystal structure:

pymol my-best-model \$WDIR/30_70.2.pdb

And in pymol:

align my-best-model, 30_70.2
zoom vis
show cartoon
util.cbc

The above align command might not show a convincing superimposition because different chains might be forming the correct interface. If this is the case, pymol provide another command to perform a better structural alignment:

cealign my-best-model, 30_70.2

## Comparing dimer and tetramer docking

You can follow the same steps described above to perform instead of a tetramer docking a dimer docking. Simply follow the provided instructions, but inputting to the HADDOCK server only two molecules.

Repeat the analysis of the results and compare the success with tetramer docking.

Is dimer docking as successful as tetramer docking?

## Congratulations!

You have completed this tutorial. If you have any questions or suggestions, feel free to contact us via email or by submitting an issue in the appropriate Github repository or asking a question through our support center.