Computational Structural Biology group focusing on dissecting, understanding and predicting biomolecular interactions at the molecular level.

Supported by:

This tutorial consists of the following sections:

## Introduction

### What is benchmark-tools?

The benchmark-tools are an effort to reduce code duplication and to streamline the execution of HADDOCK benchmark. It is a standalone program written in gothat can be used to run HADDOCK on a set of benchmark targets. It is designed to be used with both the production-ready HADDOCK2.4, the pre-release HADDOCK2.5 and the experimental (unpublished) HADDOCK3 versions.

When running a benchmark, users/developers may be interested in the following (in no specific order):

• The quality of the docking results when using different parameters
• Comparing the results of different versions
• The time it takes to run HADDOCK on a set of targets

However the benchmark-tools can be used to run HADDOCK on a large set of targets such as for virtual screening.

### How does benchmark-tools work?

The execution of a HADDOCK benchmark consists of a few steps:

1. Setup the benchmark
• Copy the target structures to the location where the HADDOCK run will be executed
• For HADDOCK2.4, writing the run.param file and executing the haddock2.4 program once to setup the folder structure
• For HADDOCK3, writing the run.toml
3. Distribute several HADDOCK runs in a HPC-friendly manner

benchmark-tools aim to automate all these steps, additionally giving the user the possibility of setting up various scenarios. A scenario is a set of parameters that will be used to run HADDOCK. For example, a user may want to run HADDOCK against a set of targets with different sampling values, different restraints, different parameters, etc.

### Who is benchmark-tools for?

The tool is designed for users/students/developers that are familiar with HADDOCK, command-line scripting and with access to a HPC infrastructure. If this is the first time you are using HADDOCK, please familiarize first yourself with the software by running the basic HADDOCK2.4 or HADDOCK3 tutorials. benchmark-tools is not meant to be used by end-users who want to run a single target, or a small set of targets; for that purpose we recommend instead using the HADDOCK2.4 web server.

## Installation

benchmark-tools is open-source, licensed under Apache 2.0 and freely available from the following repository: github.com/haddocking/benchmark-tools.

Additionally, you can build the latest version from source, make sure go is installed and run the following commands:

## Setting up the benchmark

The setup consists of the following steps:

1. Writing the input file list of the targets input.list
2. Writing a run-haddock.sh script
3. Preparing the configuration file, benchmark.yaml
4. Running benchmark-tools

### 1. Creating the input.list file

The input list is a flat text file with the paths of the targets;

Note that this file must follow the pattern:

In the above example, complex1 and complex2 correspond thus to NAME, identifying the complex which is modelled. Each PDB file (indicated by the .pdb extension) has a suffix, this is extremely important as it will be used to organize the data. For example, the file complex1_r_u.pdb is the receptor of the target complex1 and complex1_l_u is the ligand of the same target.

In this example the suffixes are: receptor_suffix: "_r_u" and ligand_suffix: "_l_u". The suffixes are defined in the benchmark.yaml file.

The same logic applies to the restraints files, in the example above the pattern for the ambiguous restraint can be defined as ambig: "ti", so the file complex1_ti.tbl will be used as the ambiguous restraint for the target complex1, complex2_ti.tbl for the target complex2, etc. See section 3.2.2 for information specific to the definition of restraints when setting up a HADDOCK3.0 run.

HADDOCK supports many modified amino acids/bases/glycans/ions (check the full list). However if your target molecule is not present in this library, you can also provide it following the same logic; topology: "_ligand.top" and param: "_ligand.param" will use the files protein2_ligand.top and protein2_ligand.param for the target protein2.

IMPORTANT: For ensembles, provide each model individually and append a number to the suffix, for example: complex1_l_u_1.pdb, complex1_l_u_2.pdb, etc.

See below a full example:

### 2. Writing the run-haddock.sh script

The run-haddock.sh script is a bash script that will be executed by benchmark-tools for each target. The purpose of this script is to provide an “adapter” to account for different HADDOCK versions and/or different python versions and even different operating systems and configurations on your cluster.

This script should contain all the commands necessary to run HADDOCK and it must be customized for your installation, for example:

haddock24.sh

haddock3.sh

### 3. Writing the benchmark.yaml file

The benchmark.yaml file is a configuration file in YAML format that will be used by benchmark-tools to run the benchmark. This file is divided in 2 main sections; general and scenarios

#### 3.1. General section

Here you must define the following:

• executable: the path to the run-haddock.sh script (see above for more details)
• max_concurrent: the maximum number of runs that can be executed at a given time (a run is a target in a given scenario)
• haddock_dir: the path to the HADDOCK installation
• receptor_suffix: the suffix used to identify the receptor files
• ligand_suffix: the suffix used to identify the ligand files
• input_list: the path to the input list (see above for more details)
• work_dir: the path to the benchmark output

#### 3.2. Scenario section

Here you must define the scenarios that you want to run, it is slightly different for HADDOCK2.4 and HADDOCK3.0.

For HADDOCK2.4 you must define the following:

• name: the name of the scenario
• parameters: the parameters to be used in the scenario
• run_cns: parameters that will be used in the run.cns file
• restraints: patterns used to identify the restraints files
• ambig: pattern used to identify the ambiguous restraints file
• unambig: pattern used to identify the unambiguous restraints file
• hbonds: pattern used to identify the hydrogen bonds restraints file
• custom_toppar: patterns used to identify the custom topology files
• topology: pattern used to identify the topology file
• param: pattern used to identify the parameter file

For HADDOCK3.0 you must define the following:

• name: the name of the scenario
• parameters: the parameters to be used in the scenario
• general: general parameters; those are the ones defined in the “top” section of the run.toml script
• modules: this subsection is related to the parameters of each module in HADDOCK3.0
• order: the order of the modules to be used in HADDOCK3.0
• <module-name>: parameters for the module

#### 3.2 Full example

Here is a full example of the benchmark.yaml file:

HADDOCK2.4

HADDOCK3.0

### 3.3 Running the benchmark

Considering the config input file and the config .yaml file have been properly set, you can run the benchmark by executing the benchmark-tools simply with:

benchmark-tools will read the input file, create the working directory, copy the input files to a data/ directory and start the benchmark. Make sure you have enough space in your disk to store the input files and the results.

VERY IMPORTANT: In the current version, benchmark-tools does not submit jobs to the queue, instead it leverages the internal scheduling routines of HADDOCK2.4/HADDOCK3.0. This means that the number of concurrent runs is related to the number of docking runs at a given time, not to the total number of processors being used by HADDOCK! The actual number of processors being used depends on how HADDOCK was configured. For HADDOCK2.4 this depends on parameters defined in the run.cns (queue_N/cpunumber_N) and for HADDOCK3, the number of processors (or queue slots) to use and the running mode is defined in the config file under the general section (see examples above).

Example: max_concurrent: 10 with scenarios.parameters.mode: local and scenarios.parameters.ncores: 10 means 10x10 processors will be required!

## Setting up a benchmark experiment with the Docking benchmark 5 (BM5)

The Protein-Protein docking benchmark v5 (Vreven, 2015), namely BM5, contains a is a large set of non-redundat high-quality structures, check here the full set.

The BonvinLab provides a HADDOCK-ready sub-version of the BM5 which can be easily used as input for benchmark-tools. This version is available the following repository; github.com/haddocking/BM5-clean. Below we will go over step-by-step instructions on how to use it as input.

Clone the repository and checkout a version. Note that its always recomended to use a specific version, as the main branch might change and for reproducibility.

### 2. Create bm5-input.list

As previously mentioned, the BM5-clean repository is already an organized sub-version, thus its very simple to create the bm5-input.list file with a few bash commands;

## Getting help

If you encounter any issues or have any questions, please open an issue on the GitHub repository, contact us at software.csb [at] gmail.com or join the BioExcel forum and post your question there.

## Final considerations

The benchmark-tools is under active development and we have a list of planned features, such as an option to resume/restart the benchmark and a full suite of analysis.

If you have any suggestions, or feedback please let us know! 🤓