HADDOCK3 Architecture

Disclaimer: generated by Claude Code, model Opus 4.8 xhigh

This document describes the code architecture of HADDOCK3 and maps the major concepts to the files and folders that implement them. It is intended for developers and contributors who want a mental model of how the pieces fit together. For user-facing documentation see docs/pages/intro.md, the examples, and the online user manual.

Scope. This is a code-architecture map, not an API reference. The auto-generated API docs are built from docstrings with Sphinx (see docs/README.md).

1. What HADDOCK3 is

HADDOCK3 is the modular rewrite of the HADDOCK integrative-modelling software. Where HADDOCK2.x exposed a fixed three-stage pipeline (rigid-body docking → semi-flexible refinement → final refinement), HADDOCK3 lets users assemble their own pipeline by chaining reusable modules.

The unit of work is a workflow: a user-authored configuration file (TOML-like .cfg) that lists the modules to run, in order, with their parameters. HADDOCK3 reads that file, validates it, and executes each module in sequence, with each module reading the previous module’s output.

The physics (topology generation, docking, refinement, scoring) is largely done by CNS (Crystallography & NMR System), invoked as a subprocess per job. The Python codebase is mostly the orchestration, parameter handling, I/O, analysis, and plumbing around CNS.

2. The conceptual model

 config file (.cfg/.toml)            run directory/
 ┌───────────────────────┐          ┌──────────────────────────────┐
 │ run_dir = "run1"      │          │ 0_topoaa/   → io.json         │
 │ molecules = [...]     │  setup   │ 1_rigidbody/→ io.json         │
 │ [topoaa]              │ ───────► │ 2_seletop/  → io.json         │
 │ [rigidbody]           │          │ 3_flexref/  → io.json         │
 │ [seletop]             │          │ 4_caprieval/→ io.json         │
 │ [flexref]             │          │ data/        (inputs copied)  │
 │ [caprieval]           │          │ analysis/    (post-process)   │
 └───────────────────────┘          │ log, traceback/              │
                                     └──────────────────────────────┘

Key properties of the model (important for anyone reasoning about caching, re-runs, or replacing the engine):

Strictly linear DAG. There is no branching at the workflow level. A module type (e.g. caprieval) may appear multiple times; identity is by position (the numbered step folder), not by name.
Workflow is dynamic, defined per run. The list of steps is read from the user’s config at run start, not statically declared.
Module-to-module communication is via files. Each step writes an io.json describing its output models; the next step reads it.
The communication payload is rich. io.json holds serialized “ontology” objects (PDBFile, …) carrying far more than a path: score, cluster id/rank, topology references, restraint files, seed, unweighted energies, etc.

3. Execution flow

The end-to-end control flow for haddock3 <config>:

Step	Where	What happens
1. Parse CLI args	src/haddock/clis/cli.py	`maincli` → `main(workflow, restart, extend_run, …)`
2. Setup & validate	gear/prepare_run.py `setup_run()`	Parse config, validate module names/params, create `run_dir/`, copy inputs to `data/`, resolve defaults
3. Build workflow	libs/libworkflow.py `WorkflowManager` → `Workflow` → `Step`	One `Step` per config block, in order
4. Run each step	`Step.execute()`	Import the module package, instantiate `HaddockModule`, `update_params`, `save_config`, `run()`
5. Module body	each module’s `_run()`	Build CNS input (or run Python analysis), fan out jobs via an engine, collect models, write `io.json`
6. Forward runtime params	`WorkflowManager.run()`	Propagate any `_output_params` a module produced to later steps
7. Post-process	`WorkflowManager.postprocess()`	Run `cli_analyse` + `cli_traceback` over `caprieval` steps
8. Clean / archive	`WorkflowManager.clean()`, gear/postprocessing.py	Optionally compress step outputs and archive the run

Two variants of step 3–4 exist:

--restart N (positional): delete step folders from N onward and re-run from there. See gear/restart_run.py. There is no content-based identity or partial reuse — changing sampling=200→400 forces a full re-run of sampling.
--extend-run: append new steps to a finished run, via WorkflowManagerExtend in gear/extend_run.py (paired with the haddock3-copy CLI).

4. Repository layout (top level)

Path	Role
src/haddock/	The Python package — all application code (see §5)
docs/	Sphinx documentation sources (`.md`, `.rst`), built to HTML
examples/	Ready-to-run example workflows and data, organised by system type
tests/	Unit tests (no CNS required)
integration_tests/	Integration tests (require CNS)
end-to-end_tests/	Full-workflow tests
notebooks/	Example/analysis Jupyter notebooks
varia/, devtools/	Auxiliary scripts and developer tooling
pyproject.toml, setup.py	Packaging, dependencies, console-script entry points
Dockerfile, entrypoint.sh	Containerised execution
`CHANGELOG.md`, `CONTRIBUTING.md`, `README.md`, `LICENSE`, …	Project metadata

5. The source tree (`src/haddock/`)

The package is organised into a small number of layers. Top-to-bottom, the dependency direction is roughly: clis → libworkflow/modules → gear → libs → core.

src/haddock/
├── __init__.py     # package paths, version, logging setup, EmptyPath sentinel
├── core/           # constants, parameter schemas, exceptions, types
├── gear/           # run-lifecycle machinery (plugin-like "gears")
├── libs/           # reusable libraries (I/O, CNS, parallelism, ontology, math…)
├── modules/        # the simulation/analysis modules, grouped by category
├── clis/           # command-line entry points (haddock3 and friends)
├── cns/            # bundled CNS binaries (bin/) and force-field data (toppar/)
├── fcc/            # Fraction of Common Contacts clustering helpers
├── deps/           # C/C++ sources compiled at install (contact_fcc, fast-rmsdmatrix)
└── prodrg/         # bundled PRODRG ligand-topology binaries

5.1 `core/` — definitions and contracts

The lowest layer: no logic, just definitions everything else depends on.

File	Contents
defaults.py	Framework constants: `RUNDIR`, `MODULE_IO_FILE` (`io.json`), `MODULE_DEFAULT_YAML` (`defaults.yaml`), `CNS_MODULES`, CNS executable discovery, exec paths for compiled deps
mandatory.yaml	Global mandatory parameters: `run_dir`, `molecules`
optional.yaml	Global optional parameters: `preprocess`, `postprocess`, `gen_archive`
exceptions.py	Custom errors: `HaddockError`, `StepError`, `ConfigurationError`, `HaddockTermination`
typing.py	Shared type aliases (`FilePath`, `ParamDict`, `ParamMap`, …)
cns_paths.py	Locations of CNS topology/parameter files
supported_molecules.py	Recognised residues/molecule types

5.2 `gear/` — run-lifecycle machinery

“Gears” are self-contained pieces of run-orchestration logic that sit between the CLI and the modules. Each handles one cross-cutting concern.

File	Concern
prepare_run.py	The big one: `setup_run()` — parse, validate (names, types, ranges, compatibility), create the run dir, copy inputs, expand parameters
config.py	Read/write the HADDOCK3 config format (`load`/`loads`/`save`, `get_module_name`, path coercion)
yaml2cfg.py	Turn a module’s annotated `defaults.yaml` into a flat default config; detect incompatible params
parameters.py	Definitions of mandatory/general parameter sets
expandable_parameters.py	Per-molecule / repeatable parameter blocks (e.g. `mol_`, `seg_`)
validations.py	Domain-specific validation rules
restart_run.py	`--restart` flag logic
extend_run.py	`--extend-run` flag + `haddock3-copy`; `WorkflowManagerExtend`
clean_steps.py	Compress/clean a step’s output files
postprocessing.py	Archive the run, build analysis bundle
preprocessing.py	Input PDB sanitisation/preprocessing
zerofill.py	Compute the zero-padded numeric step-folder prefixes
haddockmodel.py	`HaddockModel`: parse CNS output PDBs and their energy headers
known_cns_errors.py	Pattern-match common CNS failures from logs
greetings.py	Banner / feedback messages

5.3 `libs/` — reusable libraries

Stateless or near-stateless helpers used across modules and gears.

File	Responsibility
libworkflow.py	Workflow engine: `WorkflowManager`, `Workflow`, `Step` (see §3)
libontology.py	Inter-module data model: `PDBFile`, `TopologyFile`, `RMSDFile`, `ModuleIO` (see §6)
libcns.py	Build CNS input scripts from templates + parameters
libsubprocess.py	`CNSJob` and `Job` wrappers around subprocess execution
libparallel.py	`Scheduler`/`Worker` — local multiprocessing fan-out
libhpc.py	`HPCScheduler` — batch/queue submission
libmpi.py	`MPIScheduler` — MPI execution
libgrid.py	`GRIDScheduler` — DIRAC grid execution
libpdb.py, libstructure.py	Parse and manipulate PDB structures
libalign.py, libmath.py	Alignment and RMSD/geometry maths
libclust.py, libfcc.py	Clustering helpers
librestraints.py	Restraint (`.tbl`) handling
libaa2cg.py, libligand.py	Coarse-grain mapping, ligand topology
libplots.py, libnotebooks.py	Analysis plots and notebook generation
libprodigy.py	PRODIGY binding-affinity scoring integration
libinteractive.py	Backing for `haddock3-re` interactive re-scoring/clustering
libio.py, liblog.py, libtimer.py, libutil.py, libcli.py, libfunc.py	Cross-cutting utilities (I/O, logging, timing, CLI args, functional helpers)
assets/	Static assets used by libs (e.g. templates)

5.4 `modules/` — the simulation & analysis modules

This is where the science lives. Modules are grouped into categories, which are just the immediate subfolders. The category registry and the BaseHaddockModule contract are defined in modules/init.py.

modules/
├── __init__.py          # module registry, BaseHaddockModule, get_engine(), step-folder helpers
├── base_cns_module.py   # BaseCNSModule: shared behaviour for CNS-backed modules
├── defaults.yaml        # global non-mandatory parameters (ncores, mode, clean, …)
├── topology/            # topoaa, topocg
├── sampling/            # rigidbody, lightdock
├── refinement/          # flexref, emref, mdref, cgtoaa, openmm
├── scoring/             # emscoring, mdscoring, prodigyprotein, prodigyligand, sasascore
├── analysis/            # caprieval, clustfcc, clustrmsd, rmsdmatrix, ilrmsdmatrix,
│                        #   seletop, seletopclusts, alascan, contactmap, filter
├── extras/              # exit
└── _template_cat/       # template for authoring a new category/module

The category hierarchy (declared order, used for ordering/validation) is: topology → sampling → refinement → scoring → analysis → extras. There is no constraint on mixing categories in a workflow; the hierarchy is organisational.

Anatomy of a module. Every module is a package (folder) containing:

__init__.py defining a class named HaddockModule (subclass of BaseHaddockModule, or BaseCNSModule for CNS-backed ones). The module-level docstring is the user documentation. _run() is the body.
defaults.yaml — every parameter annotated with default, type, range, title/short/long help text, group, and explevel (easy/expert/guru). This single file drives defaults, validation, the haddock3-cfg help output, and the web/GUI parameter forms.
cns/ (CNS modules only) — the .cns template scripts run by CNS.

See modules/_template_cat/ for the canonical skeleton when adding a module.

Step.execute() discovers a module dynamically: it looks up the category in the registry, imports haddock.modules.<category>.<name>, and instantiates that package’s HaddockModule. Adding a module is therefore a matter of dropping a correctly-shaped folder in the right category — no central registration.

5.5 `clis/` — command-line interfaces

Each cli_*.py exposes a maincli() wired to a console script in pyproject.toml. The full toolset:

Command	Module	Purpose
`haddock3`	cli.py	Run a workflow (the main entry point)
`haddock3-cfg`	cli_cfg.py	Print a module’s parameters/defaults
`haddock3-copy`	cli_cp.py	Copy/prepare a run for `--extend-run`
`haddock3-clean`	cli_clean.py	Compress/clean a run’s outputs
`haddock3-pp`	cli_pp.py	Preprocess input PDBs
`haddock3-score`	cli_score.py	Score a complex standalone
`haddock3-analyse`	cli_analyse.py	Generate analysis reports/plots
`haddock3-traceback`	cli_traceback.py	Trace each final model back through the steps
`haddock3-re`	cli_re.py + re/	Interactive re-scoring/re-clustering of a finished step
`haddock3-restraints`	cli_restraints.py + restraints/	Restraint generation utilities
`haddock3-mpitask`	cli_mpi.py	Worker invoked under MPI execution
`haddock3-dmn`	cli_dmn.py	Daemon for batch/grid coordination
`haddock3-unpack`	cli_unpack.py	Unpack an archived/cleaned run

5.6 Bundled native/data assets

Path	Contents
cns/bin/	Per-platform CNS executables
cns/toppar/	CNS force-field topology/parameter files (`TOPPAR` env var points here)
deps/	`contact_fcc.cpp`, `fast-rmsdmatrix.c` — compiled to `bin/` at install
fcc/	Python FCC matrix calculation and clustering
prodrg/	Per-platform PRODRG ligand-topology binaries

6. Inter-module communication: the ontology

Modules never call each other directly. They communicate through files described by the ontology in libs/libontology.py:

Persistent — base class for any framework-generated file (records name, type, path, optional md5, restraint file).
PDBFile — a model. Beyond the file path it carries score, clt_id, clt_rank, clt_model_rank, topology, aa_topology, ligand_top_fname, ligand_param_fname, restr_fname, seed, unw_energies, shape, etc. Comparison operators sort by score.
TopologyFile, RMSDFile — typed persistent files.
ModuleIO — the input/output container. Holds input and output lists of ontology objects; save()/load() serialise to/from io.json using jsonpickle. retrieve_models() prepares the previous step’s models for the current one (pairwise, cross-dock, or individualised); check_faulty() / remove_missing() enforce output completeness tolerance.

The mechanics in BaseHaddockModule (modules/init.py):

On construction, _load_previous_io() reads the previous step’s io.json into self.previous_io.
During _run() the module produces self.output_models.
export_io_models() builds a ModuleIO (input = previous output, output = new models), drops missing models, and writes this step’s io.json.

Hidden side channels (handle with care)

Two mechanisms break the “clean declared input/output” abstraction and matter for anyone reworking the engine, caching, or determinism:

_output_params — a module may publish key/value pairs that WorkflowManager.run() then injects into all later steps that expose the same key and haven’t set it (libworkflow.py). Example: topoaa propagates auto-generated ligand_param_fname/ligand_top_fname downstream.
In-place mutation of PDBFile — attributes are mutated during execution (e.g. clustfcc writes cluster info onto the model objects; seletop sets rank). The same object travels through steps gathering state.

7. Parameters & configuration

Module defaults live in each module’s defaults.yaml as annotated entries (default, type, range, help text, group, expert level). gear/yaml2cfg.py flattens these into a usable config; the same metadata powers haddock3-cfg and external GUIs.
Global parameters come from two places:
- mandatory/optional run-level params in core/mandatory.yaml / core/optional.yaml (run_dir, molecules, preprocess, postprocess, gen_archive);
- non-mandatory general params in modules/defaults.yaml (ncores, mode, cns_exec, clean, self_contained, …) which can be set globally and overridden per module.
Precedence: module-local value > global value > module default. This is applied by recursive_dict_update in update_params (modules/init.py) and in Workflow.__init__.
Expandable parameters (gear/expandable_parameters.py) handle repeated/per-molecule blocks (e.g. mol1_*, seg_*), expanded against the actual number of input molecules.
Config format: a TOML-like .cfg parsed/written by gear/config.py; _fname-suffixed parameters are coerced to paths and existence-checked.

8. Execution engines & parallelism

Across steps: none. Steps run strictly sequentially.

Within a step: pluggable engine. get_engine(mode, params) in modules/init.py is a small factory selecting an engine by the mode parameter:

`mode`	Engine	File
`local`	`Scheduler` (multiprocessing)	libs/libparallel.py
`batch`	`HPCScheduler` (queue submit)	libs/libhpc.py
`mpi`	`MPIScheduler`	libs/libmpi.py
`grid`	`GRIDScheduler` (DIRAC; falls back to `local` if unreachable)	libs/libgrid.py

Granularity. Sampling/refinement modules generate one CNS subprocess (CNSJob) per output model, and the engine fans these jobs across cores/nodes. This per-model job is the natural unit for any future caching scheme.

9. CNS coupling

The physics modules wrap CNS rather than reimplementing it. The coupling lives in modules/base_cns_module.py and libs/libcns.py:

A CNS module loads its .cns template (recipe_str), fills it with the current parameters and per-model data via libcns, and writes a concrete .inp.
Each job runs as a CNSJob (libs/libsubprocess.py) — the CNS binary from cns/bin/, with env vars MODULE, MODDIR, TOPPAR (pointing at cns/toppar/).
self_contained mode copies the CNS scripts, toppar, and executable into the run directory so it can be re-run elsewhere.

Because the workflow plumbing is independent of CNS, replacing CNS for a given module is mostly a matter of replacing the subprocess invocation in that module’s _run(). The modern OpenMM refinement (modules/refinement/openmm/) is an example of a non-CNS engine living alongside the CNS ones.

10. Run directory layout

A completed run (run_dir) looks like:

run_dir/
├── 0_topoaa/          # one zero-padded, numbered folder per step
│   ├── io.json        # ModuleIO for this step (input + output models)
│   ├── params.cfg     # the exact parameters this step ran with
│   └── *.pdb, *.psf, *.inp, *.out, …
├── 1_rigidbody/
├── 2_caprieval/
├── data/              # copies of user inputs (molecules, restraints)
├── analysis/          # post-processing reports/plots (ANA_FOLDER)
├── traceback/         # model lineage across steps (TRACEBACK_FOLDER)
└── log                # run log

The step-folder naming (<index>_<modulename>) is the system of record for step identity and ordering. get_module_steps_folders() and is_step_folder() in modules/init.py parse it; the step_folder_regex there is the canonical matcher. Restart and extend operate on these folders by index.

11. Testing

Suite	Location	CNS needed	Scope
Unit	tests/	No	Functions/classes in isolation; fixtures in `tests/golden_data`, `tests/data`
Integration	integration_tests/	Yes	Individual modules end-to-end against CNS
End-to-end	end-to-end_tests/	Yes	Complete workflows

Run with pytest tests/, pytest integration_tests/, pytest end-to-end_tests/. See docs/pages/DEVELOPMENT.md and CONTRIBUTING.md.

12. Where to start when…

Adding a module → copy modules/_template_cat/; implement HaddockModule._run() and author defaults.yaml. No central registration is needed.
Changing how steps are scheduled/chained → libs/libworkflow.py.
Changing what flows between modules → libs/libontology.py.
Changing run setup/validation → gear/prepare_run.py.
Adding/adjusting parallel execution → get_engine() in modules/init.py and the lib*scheduler files in libs/.
A new CLI tool → add clis/cli_<name>.py with a maincli() and register it under [project.scripts] in pyproject.toml.