HADDOCK3 Architecture

Disclaimer: generated by Claude Code, model Opus 4.8 xhigh

This document describes the code architecture of HADDOCK3 and maps the major concepts to the files and folders that implement them. It is intended for developers and contributors who want a mental model of how the pieces fit together. For user-facing documentation see docs/pages/intro.md, the examples, and the online user manual.

Scope. This is a code-architecture map, not an API reference. The auto-generated API docs are built from docstrings with Sphinx (see docs/README.md).


1. What HADDOCK3 is

HADDOCK3 is the modular rewrite of the HADDOCK integrative-modelling software. Where HADDOCK2.x exposed a fixed three-stage pipeline (rigid-body docking → semi-flexible refinement → final refinement), HADDOCK3 lets users assemble their own pipeline by chaining reusable modules.

The unit of work is a workflow: a user-authored configuration file (TOML-like .cfg) that lists the modules to run, in order, with their parameters. HADDOCK3 reads that file, validates it, and executes each module in sequence, with each module reading the previous module’s output.

The physics (topology generation, docking, refinement, scoring) is largely done by CNS (Crystallography & NMR System), invoked as a subprocess per job. The Python codebase is mostly the orchestration, parameter handling, I/O, analysis, and plumbing around CNS.


2. The conceptual model

 config file (.cfg/.toml)            run directory/
 ┌───────────────────────┐          ┌──────────────────────────────┐
 │ run_dir = "run1"      │          │ 0_topoaa/   → io.json         │
 │ molecules = [...]     │  setup   │ 1_rigidbody/→ io.json         │
 │ [topoaa]              │ ───────► │ 2_seletop/  → io.json         │
 │ [rigidbody]           │          │ 3_flexref/  → io.json         │
 │ [seletop]             │          │ 4_caprieval/→ io.json         │
 │ [flexref]             │          │ data/        (inputs copied)  │
 │ [caprieval]           │          │ analysis/    (post-process)   │
 └───────────────────────┘          │ log, traceback/              │
                                     └──────────────────────────────┘

Key properties of the model (important for anyone reasoning about caching, re-runs, or replacing the engine):

  • Strictly linear DAG. There is no branching at the workflow level. A module type (e.g. caprieval) may appear multiple times; identity is by position (the numbered step folder), not by name.

  • Workflow is dynamic, defined per run. The list of steps is read from the user’s config at run start, not statically declared.

  • Module-to-module communication is via files. Each step writes an io.json describing its output models; the next step reads it.

  • The communication payload is rich. io.json holds serialized “ontology” objects (PDBFile, …) carrying far more than a path: score, cluster id/rank, topology references, restraint files, seed, unweighted energies, etc.


3. Execution flow

The end-to-end control flow for haddock3 <config>:

Step

Where

What happens

1. Parse CLI args

src/haddock/clis/cli.py

mainclimain(workflow, restart, extend_run, …)

2. Setup & validate

gear/prepare_run.py setup_run()

Parse config, validate module names/params, create run_dir/, copy inputs to data/, resolve defaults

3. Build workflow

libs/libworkflow.py WorkflowManagerWorkflowStep

One Step per config block, in order

4. Run each step

Step.execute()

Import the module package, instantiate HaddockModule, update_params, save_config, run()

5. Module body

each module’s _run()

Build CNS input (or run Python analysis), fan out jobs via an engine, collect models, write io.json

6. Forward runtime params

WorkflowManager.run()

Propagate any _output_params a module produced to later steps

7. Post-process

WorkflowManager.postprocess()

Run cli_analyse + cli_traceback over caprieval steps

8. Clean / archive

WorkflowManager.clean(), gear/postprocessing.py

Optionally compress step outputs and archive the run

Two variants of step 3–4 exist:

  • --restart N (positional): delete step folders from N onward and re-run from there. See gear/restart_run.py. There is no content-based identity or partial reuse — changing sampling=200→400 forces a full re-run of sampling.

  • --extend-run: append new steps to a finished run, via WorkflowManagerExtend in gear/extend_run.py (paired with the haddock3-copy CLI).


4. Repository layout (top level)

Path

Role

src/haddock/

The Python package — all application code (see §5)

docs/

Sphinx documentation sources (.md, .rst), built to HTML

examples/

Ready-to-run example workflows and data, organised by system type

tests/

Unit tests (no CNS required)

integration_tests/

Integration tests (require CNS)

end-to-end_tests/

Full-workflow tests

notebooks/

Example/analysis Jupyter notebooks

varia/, devtools/

Auxiliary scripts and developer tooling

pyproject.toml, setup.py

Packaging, dependencies, console-script entry points

Dockerfile, entrypoint.sh

Containerised execution

CHANGELOG.md, CONTRIBUTING.md, README.md, LICENSE, …

Project metadata


5. The source tree (src/haddock/)

The package is organised into a small number of layers. Top-to-bottom, the dependency direction is roughly: clislibworkflow/modulesgearlibscore.

src/haddock/
├── __init__.py     # package paths, version, logging setup, EmptyPath sentinel
├── core/           # constants, parameter schemas, exceptions, types
├── gear/           # run-lifecycle machinery (plugin-like "gears")
├── libs/           # reusable libraries (I/O, CNS, parallelism, ontology, math…)
├── modules/        # the simulation/analysis modules, grouped by category
├── clis/           # command-line entry points (haddock3 and friends)
├── cns/            # bundled CNS binaries (bin/) and force-field data (toppar/)
├── fcc/            # Fraction of Common Contacts clustering helpers
├── deps/           # C/C++ sources compiled at install (contact_fcc, fast-rmsdmatrix)
└── prodrg/         # bundled PRODRG ligand-topology binaries

5.1 core/ — definitions and contracts

The lowest layer: no logic, just definitions everything else depends on.

File

Contents

defaults.py

Framework constants: RUNDIR, MODULE_IO_FILE (io.json), MODULE_DEFAULT_YAML (defaults.yaml), CNS_MODULES, CNS executable discovery, exec paths for compiled deps

mandatory.yaml

Global mandatory parameters: run_dir, molecules

optional.yaml

Global optional parameters: preprocess, postprocess, gen_archive

exceptions.py

Custom errors: HaddockError, StepError, ConfigurationError, HaddockTermination

typing.py

Shared type aliases (FilePath, ParamDict, ParamMap, …)

cns_paths.py

Locations of CNS topology/parameter files

supported_molecules.py

Recognised residues/molecule types

5.2 gear/ — run-lifecycle machinery

“Gears” are self-contained pieces of run-orchestration logic that sit between the CLI and the modules. Each handles one cross-cutting concern.

File

Concern

prepare_run.py

The big one: setup_run() — parse, validate (names, types, ranges, compatibility), create the run dir, copy inputs, expand parameters

config.py

Read/write the HADDOCK3 config format (load/loads/save, get_module_name, path coercion)

yaml2cfg.py

Turn a module’s annotated defaults.yaml into a flat default config; detect incompatible params

parameters.py

Definitions of mandatory/general parameter sets

expandable_parameters.py

Per-molecule / repeatable parameter blocks (e.g. mol_*, seg_*)

validations.py

Domain-specific validation rules

restart_run.py

--restart flag logic

extend_run.py

--extend-run flag + haddock3-copy; WorkflowManagerExtend

clean_steps.py

Compress/clean a step’s output files

postprocessing.py

Archive the run, build analysis bundle

preprocessing.py

Input PDB sanitisation/preprocessing

zerofill.py

Compute the zero-padded numeric step-folder prefixes

haddockmodel.py

HaddockModel: parse CNS output PDBs and their energy headers

known_cns_errors.py

Pattern-match common CNS failures from logs

greetings.py

Banner / feedback messages

5.3 libs/ — reusable libraries

Stateless or near-stateless helpers used across modules and gears.

File

Responsibility

libworkflow.py

Workflow engine: WorkflowManager, Workflow, Step (see §3)

libontology.py

Inter-module data model: PDBFile, TopologyFile, RMSDFile, ModuleIO (see §6)

libcns.py

Build CNS input scripts from templates + parameters

libsubprocess.py

CNSJob and Job wrappers around subprocess execution

libparallel.py

Scheduler/Worker — local multiprocessing fan-out

libhpc.py

HPCScheduler — batch/queue submission

libmpi.py

MPIScheduler — MPI execution

libgrid.py

GRIDScheduler — DIRAC grid execution

libpdb.py, libstructure.py

Parse and manipulate PDB structures

libalign.py, libmath.py

Alignment and RMSD/geometry maths

libclust.py, libfcc.py

Clustering helpers

librestraints.py

Restraint (.tbl) handling

libaa2cg.py, libligand.py

Coarse-grain mapping, ligand topology

libplots.py, libnotebooks.py

Analysis plots and notebook generation

libprodigy.py

PRODIGY binding-affinity scoring integration

libinteractive.py

Backing for haddock3-re interactive re-scoring/clustering

libio.py, liblog.py, libtimer.py, libutil.py, libcli.py, libfunc.py

Cross-cutting utilities (I/O, logging, timing, CLI args, functional helpers)

assets/

Static assets used by libs (e.g. templates)

5.4 modules/ — the simulation & analysis modules

This is where the science lives. Modules are grouped into categories, which are just the immediate subfolders. The category registry and the BaseHaddockModule contract are defined in modules/init.py.

modules/
├── __init__.py          # module registry, BaseHaddockModule, get_engine(), step-folder helpers
├── base_cns_module.py   # BaseCNSModule: shared behaviour for CNS-backed modules
├── defaults.yaml        # global non-mandatory parameters (ncores, mode, clean, …)
├── topology/            # topoaa, topocg
├── sampling/            # rigidbody, lightdock
├── refinement/          # flexref, emref, mdref, cgtoaa, openmm
├── scoring/             # emscoring, mdscoring, prodigyprotein, prodigyligand, sasascore
├── analysis/            # caprieval, clustfcc, clustrmsd, rmsdmatrix, ilrmsdmatrix,
│                        #   seletop, seletopclusts, alascan, contactmap, filter
├── extras/              # exit
└── _template_cat/       # template for authoring a new category/module

The category hierarchy (declared order, used for ordering/validation) is: topology sampling refinement scoring analysis extras. There is no constraint on mixing categories in a workflow; the hierarchy is organisational.

Anatomy of a module. Every module is a package (folder) containing:

  • __init__.py defining a class named HaddockModule (subclass of BaseHaddockModule, or BaseCNSModule for CNS-backed ones). The module-level docstring is the user documentation. _run() is the body.

  • defaults.yaml — every parameter annotated with default, type, range, title/short/long help text, group, and explevel (easy/expert/guru). This single file drives defaults, validation, the haddock3-cfg help output, and the web/GUI parameter forms.

  • cns/ (CNS modules only) — the .cns template scripts run by CNS.

See modules/_template_cat/ for the canonical skeleton when adding a module.

Step.execute() discovers a module dynamically: it looks up the category in the registry, imports haddock.modules.<category>.<name>, and instantiates that package’s HaddockModule. Adding a module is therefore a matter of dropping a correctly-shaped folder in the right category — no central registration.

5.5 clis/ — command-line interfaces

Each cli_*.py exposes a maincli() wired to a console script in pyproject.toml. The full toolset:

Command

Module

Purpose

haddock3

cli.py

Run a workflow (the main entry point)

haddock3-cfg

cli_cfg.py

Print a module’s parameters/defaults

haddock3-copy

cli_cp.py

Copy/prepare a run for --extend-run

haddock3-clean

cli_clean.py

Compress/clean a run’s outputs

haddock3-pp

cli_pp.py

Preprocess input PDBs

haddock3-score

cli_score.py

Score a complex standalone

haddock3-analyse

cli_analyse.py

Generate analysis reports/plots

haddock3-traceback

cli_traceback.py

Trace each final model back through the steps

haddock3-re

cli_re.py + re/

Interactive re-scoring/re-clustering of a finished step

haddock3-restraints

cli_restraints.py + restraints/

Restraint generation utilities

haddock3-mpitask

cli_mpi.py

Worker invoked under MPI execution

haddock3-dmn

cli_dmn.py

Daemon for batch/grid coordination

haddock3-unpack

cli_unpack.py

Unpack an archived/cleaned run

5.6 Bundled native/data assets

Path

Contents

cns/bin/

Per-platform CNS executables

cns/toppar/

CNS force-field topology/parameter files (TOPPAR env var points here)

deps/

contact_fcc.cpp, fast-rmsdmatrix.c — compiled to bin/ at install

fcc/

Python FCC matrix calculation and clustering

prodrg/

Per-platform PRODRG ligand-topology binaries


6. Inter-module communication: the ontology

Modules never call each other directly. They communicate through files described by the ontology in libs/libontology.py:

  • Persistent — base class for any framework-generated file (records name, type, path, optional md5, restraint file).

  • PDBFile — a model. Beyond the file path it carries score, clt_id, clt_rank, clt_model_rank, topology, aa_topology, ligand_top_fname, ligand_param_fname, restr_fname, seed, unw_energies, shape, etc. Comparison operators sort by score.

  • TopologyFile, RMSDFile — typed persistent files.

  • ModuleIO — the input/output container. Holds input and output lists of ontology objects; save()/load() serialise to/from io.json using jsonpickle. retrieve_models() prepares the previous step’s models for the current one (pairwise, cross-dock, or individualised); check_faulty() / remove_missing() enforce output completeness tolerance.

The mechanics in BaseHaddockModule (modules/init.py):

  • On construction, _load_previous_io() reads the previous step’s io.json into self.previous_io.

  • During _run() the module produces self.output_models.

  • export_io_models() builds a ModuleIO (input = previous output, output = new models), drops missing models, and writes this step’s io.json.

Hidden side channels (handle with care)

Two mechanisms break the “clean declared input/output” abstraction and matter for anyone reworking the engine, caching, or determinism:

  1. _output_params — a module may publish key/value pairs that WorkflowManager.run() then injects into all later steps that expose the same key and haven’t set it (libworkflow.py). Example: topoaa propagates auto-generated ligand_param_fname/ligand_top_fname downstream.

  2. In-place mutation of PDBFile — attributes are mutated during execution (e.g. clustfcc writes cluster info onto the model objects; seletop sets rank). The same object travels through steps gathering state.


7. Parameters & configuration

  • Module defaults live in each module’s defaults.yaml as annotated entries (default, type, range, help text, group, expert level). gear/yaml2cfg.py flattens these into a usable config; the same metadata powers haddock3-cfg and external GUIs.

  • Global parameters come from two places:

    • mandatory/optional run-level params in core/mandatory.yaml / core/optional.yaml (run_dir, molecules, preprocess, postprocess, gen_archive);

    • non-mandatory general params in modules/defaults.yaml (ncores, mode, cns_exec, clean, self_contained, …) which can be set globally and overridden per module.

  • Precedence: module-local value > global value > module default. This is applied by recursive_dict_update in update_params (modules/init.py) and in Workflow.__init__.

  • Expandable parameters (gear/expandable_parameters.py) handle repeated/per-molecule blocks (e.g. mol1_*, seg_*), expanded against the actual number of input molecules.

  • Config format: a TOML-like .cfg parsed/written by gear/config.py; _fname-suffixed parameters are coerced to paths and existence-checked.


8. Execution engines & parallelism

  • Across steps: none. Steps run strictly sequentially.

  • Within a step: pluggable engine. get_engine(mode, params) in modules/init.py is a small factory selecting an engine by the mode parameter:

    mode

    Engine

    File

    local

    Scheduler (multiprocessing)

    libs/libparallel.py

    batch

    HPCScheduler (queue submit)

    libs/libhpc.py

    mpi

    MPIScheduler

    libs/libmpi.py

    grid

    GRIDScheduler (DIRAC; falls back to local if unreachable)

    libs/libgrid.py

  • Granularity. Sampling/refinement modules generate one CNS subprocess (CNSJob) per output model, and the engine fans these jobs across cores/nodes. This per-model job is the natural unit for any future caching scheme.


9. CNS coupling

The physics modules wrap CNS rather than reimplementing it. The coupling lives in modules/base_cns_module.py and libs/libcns.py:

  • A CNS module loads its .cns template (recipe_str), fills it with the current parameters and per-model data via libcns, and writes a concrete .inp.

  • Each job runs as a CNSJob (libs/libsubprocess.py) — the CNS binary from cns/bin/, with env vars MODULE, MODDIR, TOPPAR (pointing at cns/toppar/).

  • self_contained mode copies the CNS scripts, toppar, and executable into the run directory so it can be re-run elsewhere.

Because the workflow plumbing is independent of CNS, replacing CNS for a given module is mostly a matter of replacing the subprocess invocation in that module’s _run(). The modern OpenMM refinement (modules/refinement/openmm/) is an example of a non-CNS engine living alongside the CNS ones.


10. Run directory layout

A completed run (run_dir) looks like:

run_dir/
├── 0_topoaa/          # one zero-padded, numbered folder per step
│   ├── io.json        # ModuleIO for this step (input + output models)
│   ├── params.cfg     # the exact parameters this step ran with
│   └── *.pdb, *.psf, *.inp, *.out, …
├── 1_rigidbody/
├── 2_caprieval/
├── data/              # copies of user inputs (molecules, restraints)
├── analysis/          # post-processing reports/plots (ANA_FOLDER)
├── traceback/         # model lineage across steps (TRACEBACK_FOLDER)
└── log                # run log

The step-folder naming (<index>_<modulename>) is the system of record for step identity and ordering. get_module_steps_folders() and is_step_folder() in modules/init.py parse it; the step_folder_regex there is the canonical matcher. Restart and extend operate on these folders by index.


11. Testing

Suite

Location

CNS needed

Scope

Unit

tests/

No

Functions/classes in isolation; fixtures in tests/golden_data, tests/data

Integration

integration_tests/

Yes

Individual modules end-to-end against CNS

End-to-end

end-to-end_tests/

Yes

Complete workflows

Run with pytest tests/, pytest integration_tests/, pytest end-to-end_tests/. See docs/pages/DEVELOPMENT.md and CONTRIBUTING.md.


12. Where to start when…