HADDOCK3 Architecture
Disclaimer: generated by Claude Code, model Opus 4.8 xhigh
This document describes the code architecture of HADDOCK3 and maps the major concepts to the files and folders that implement them. It is intended for developers and contributors who want a mental model of how the pieces fit together. For user-facing documentation see docs/pages/intro.md, the examples, and the online user manual.
Scope. This is a code-architecture map, not an API reference. The auto-generated API docs are built from docstrings with Sphinx (see docs/README.md).
1. What HADDOCK3 is
HADDOCK3 is the modular rewrite of the HADDOCK integrative-modelling software. Where HADDOCK2.x exposed a fixed three-stage pipeline (rigid-body docking → semi-flexible refinement → final refinement), HADDOCK3 lets users assemble their own pipeline by chaining reusable modules.
The unit of work is a workflow: a user-authored configuration file
(TOML-like .cfg) that lists the modules to run, in order, with their
parameters. HADDOCK3 reads that file, validates it, and executes each module in
sequence, with each module reading the previous module’s output.
The physics (topology generation, docking, refinement, scoring) is largely done by CNS (Crystallography & NMR System), invoked as a subprocess per job. The Python codebase is mostly the orchestration, parameter handling, I/O, analysis, and plumbing around CNS.
2. The conceptual model
config file (.cfg/.toml) run directory/
┌───────────────────────┐ ┌──────────────────────────────┐
│ run_dir = "run1" │ │ 0_topoaa/ → io.json │
│ molecules = [...] │ setup │ 1_rigidbody/→ io.json │
│ [topoaa] │ ───────► │ 2_seletop/ → io.json │
│ [rigidbody] │ │ 3_flexref/ → io.json │
│ [seletop] │ │ 4_caprieval/→ io.json │
│ [flexref] │ │ data/ (inputs copied) │
│ [caprieval] │ │ analysis/ (post-process) │
└───────────────────────┘ │ log, traceback/ │
└──────────────────────────────┘
Key properties of the model (important for anyone reasoning about caching, re-runs, or replacing the engine):
Strictly linear DAG. There is no branching at the workflow level. A module type (e.g.
caprieval) may appear multiple times; identity is by position (the numbered step folder), not by name.Workflow is dynamic, defined per run. The list of steps is read from the user’s config at run start, not statically declared.
Module-to-module communication is via files. Each step writes an
io.jsondescribing its output models; the next step reads it.The communication payload is rich.
io.jsonholds serialized “ontology” objects (PDBFile, …) carrying far more than a path: score, cluster id/rank, topology references, restraint files, seed, unweighted energies, etc.
3. Execution flow
The end-to-end control flow for haddock3 <config>:
Step |
Where |
What happens |
|---|---|---|
1. Parse CLI args |
|
|
2. Setup & validate |
gear/prepare_run.py |
Parse config, validate module names/params, create |
3. Build workflow |
libs/libworkflow.py |
One |
4. Run each step |
|
Import the module package, instantiate |
5. Module body |
each module’s |
Build CNS input (or run Python analysis), fan out jobs via an engine, collect models, write |
6. Forward runtime params |
|
Propagate any |
7. Post-process |
|
Run |
8. Clean / archive |
|
Optionally compress step outputs and archive the run |
Two variants of step 3–4 exist:
--restart N(positional): delete step folders fromNonward and re-run from there. See gear/restart_run.py. There is no content-based identity or partial reuse — changingsampling=200→400forces a full re-run of sampling.--extend-run: append new steps to a finished run, viaWorkflowManagerExtendin gear/extend_run.py (paired with thehaddock3-copyCLI).
4. Repository layout (top level)
Path |
Role |
|---|---|
The Python package — all application code (see §5) |
|
Sphinx documentation sources ( |
|
Ready-to-run example workflows and data, organised by system type |
|
Unit tests (no CNS required) |
|
Integration tests (require CNS) |
|
Full-workflow tests |
|
Example/analysis Jupyter notebooks |
|
Auxiliary scripts and developer tooling |
|
Packaging, dependencies, console-script entry points |
|
Containerised execution |
|
|
Project metadata |
5. The source tree (src/haddock/)
The package is organised into a small number of layers. Top-to-bottom, the
dependency direction is roughly: clis → libworkflow/modules → gear →
libs → core.
src/haddock/
├── __init__.py # package paths, version, logging setup, EmptyPath sentinel
├── core/ # constants, parameter schemas, exceptions, types
├── gear/ # run-lifecycle machinery (plugin-like "gears")
├── libs/ # reusable libraries (I/O, CNS, parallelism, ontology, math…)
├── modules/ # the simulation/analysis modules, grouped by category
├── clis/ # command-line entry points (haddock3 and friends)
├── cns/ # bundled CNS binaries (bin/) and force-field data (toppar/)
├── fcc/ # Fraction of Common Contacts clustering helpers
├── deps/ # C/C++ sources compiled at install (contact_fcc, fast-rmsdmatrix)
└── prodrg/ # bundled PRODRG ligand-topology binaries
5.1 core/ — definitions and contracts
The lowest layer: no logic, just definitions everything else depends on.
File |
Contents |
|---|---|
Framework constants: |
|
Global mandatory parameters: |
|
Global optional parameters: |
|
Custom errors: |
|
Shared type aliases ( |
|
Locations of CNS topology/parameter files |
|
Recognised residues/molecule types |
5.2 gear/ — run-lifecycle machinery
“Gears” are self-contained pieces of run-orchestration logic that sit between the CLI and the modules. Each handles one cross-cutting concern.
File |
Concern |
|---|---|
The big one: |
|
Read/write the HADDOCK3 config format ( |
|
Turn a module’s annotated |
|
Definitions of mandatory/general parameter sets |
|
Per-molecule / repeatable parameter blocks (e.g. |
|
Domain-specific validation rules |
|
|
|
|
|
Compress/clean a step’s output files |
|
Archive the run, build analysis bundle |
|
Input PDB sanitisation/preprocessing |
|
Compute the zero-padded numeric step-folder prefixes |
|
|
|
Pattern-match common CNS failures from logs |
|
Banner / feedback messages |
5.3 libs/ — reusable libraries
Stateless or near-stateless helpers used across modules and gears.
File |
Responsibility |
|---|---|
Workflow engine: |
|
Inter-module data model: |
|
Build CNS input scripts from templates + parameters |
|
|
|
|
|
|
|
|
|
|
|
Parse and manipulate PDB structures |
|
Alignment and RMSD/geometry maths |
|
Clustering helpers |
|
Restraint ( |
|
Coarse-grain mapping, ligand topology |
|
Analysis plots and notebook generation |
|
PRODIGY binding-affinity scoring integration |
|
Backing for |
|
libio.py, liblog.py, libtimer.py, libutil.py, libcli.py, libfunc.py |
Cross-cutting utilities (I/O, logging, timing, CLI args, functional helpers) |
Static assets used by libs (e.g. templates) |
5.4 modules/ — the simulation & analysis modules
This is where the science lives. Modules are grouped into categories, which
are just the immediate subfolders. The category registry and the
BaseHaddockModule contract are defined in
modules/init.py.
modules/
├── __init__.py # module registry, BaseHaddockModule, get_engine(), step-folder helpers
├── base_cns_module.py # BaseCNSModule: shared behaviour for CNS-backed modules
├── defaults.yaml # global non-mandatory parameters (ncores, mode, clean, …)
├── topology/ # topoaa, topocg
├── sampling/ # rigidbody, lightdock
├── refinement/ # flexref, emref, mdref, cgtoaa, openmm
├── scoring/ # emscoring, mdscoring, prodigyprotein, prodigyligand, sasascore
├── analysis/ # caprieval, clustfcc, clustrmsd, rmsdmatrix, ilrmsdmatrix,
│ # seletop, seletopclusts, alascan, contactmap, filter
├── extras/ # exit
└── _template_cat/ # template for authoring a new category/module
The category hierarchy (declared order, used for ordering/validation) is:
topology → sampling → refinement → scoring → analysis → extras. There is no
constraint on mixing categories in a workflow; the hierarchy is organisational.
Anatomy of a module. Every module is a package (folder) containing:
__init__.pydefining a class namedHaddockModule(subclass ofBaseHaddockModule, orBaseCNSModulefor CNS-backed ones). The module-level docstring is the user documentation._run()is the body.defaults.yaml— every parameter annotated withdefault,type, range,title/short/longhelp text,group, andexplevel(easy/expert/guru). This single file drives defaults, validation, thehaddock3-cfghelp output, and the web/GUI parameter forms.cns/(CNS modules only) — the.cnstemplate scripts run by CNS.
See modules/_template_cat/ for the canonical skeleton when adding a module.
Step.execute() discovers a module dynamically: it looks up the category in the
registry, imports haddock.modules.<category>.<name>, and instantiates that
package’s HaddockModule. Adding a module is therefore a matter of dropping a
correctly-shaped folder in the right category — no central registration.
5.5 clis/ — command-line interfaces
Each cli_*.py exposes a maincli() wired to a console script in
pyproject.toml. The full toolset:
Command |
Module |
Purpose |
|---|---|---|
|
Run a workflow (the main entry point) |
|
|
Print a module’s parameters/defaults |
|
|
Copy/prepare a run for |
|
|
Compress/clean a run’s outputs |
|
|
Preprocess input PDBs |
|
|
Score a complex standalone |
|
|
Generate analysis reports/plots |
|
|
Trace each final model back through the steps |
|
|
Interactive re-scoring/re-clustering of a finished step |
|
|
Restraint generation utilities |
|
|
Worker invoked under MPI execution |
|
|
Daemon for batch/grid coordination |
|
|
Unpack an archived/cleaned run |
5.6 Bundled native/data assets
Path |
Contents |
|---|---|
Per-platform CNS executables |
|
CNS force-field topology/parameter files ( |
|
|
|
Python FCC matrix calculation and clustering |
|
Per-platform PRODRG ligand-topology binaries |
6. Inter-module communication: the ontology
Modules never call each other directly. They communicate through files described by the ontology in libs/libontology.py:
Persistent— base class for any framework-generated file (records name, type, path, optional md5, restraint file).PDBFile— a model. Beyond the file path it carriesscore,clt_id,clt_rank,clt_model_rank,topology,aa_topology,ligand_top_fname,ligand_param_fname,restr_fname,seed,unw_energies,shape, etc. Comparison operators sort byscore.TopologyFile,RMSDFile— typed persistent files.ModuleIO— the input/output container. Holdsinputandoutputlists of ontology objects;save()/load()serialise to/fromio.jsonusingjsonpickle.retrieve_models()prepares the previous step’s models for the current one (pairwise, cross-dock, or individualised);check_faulty()/remove_missing()enforce output completeness tolerance.
The mechanics in BaseHaddockModule
(modules/init.py):
On construction,
_load_previous_io()reads the previous step’sio.jsonintoself.previous_io.During
_run()the module producesself.output_models.export_io_models()builds aModuleIO(input = previous output, output = new models), drops missing models, and writes this step’sio.json.
7. Parameters & configuration
Module defaults live in each module’s
defaults.yamlas annotated entries (default, type, range, help text, group, expert level). gear/yaml2cfg.py flattens these into a usable config; the same metadata powershaddock3-cfgand external GUIs.Global parameters come from two places:
mandatory/optional run-level params in core/mandatory.yaml / core/optional.yaml (
run_dir,molecules,preprocess,postprocess,gen_archive);non-mandatory general params in modules/defaults.yaml (
ncores,mode,cns_exec,clean,self_contained, …) which can be set globally and overridden per module.
Precedence: module-local value > global value > module default. This is applied by
recursive_dict_updateinupdate_params(modules/init.py) and inWorkflow.__init__.Expandable parameters (gear/expandable_parameters.py) handle repeated/per-molecule blocks (e.g.
mol1_*,seg_*), expanded against the actual number of input molecules.Config format: a TOML-like
.cfgparsed/written by gear/config.py;_fname-suffixed parameters are coerced to paths and existence-checked.
8. Execution engines & parallelism
Across steps: none. Steps run strictly sequentially.
Within a step: pluggable engine.
get_engine(mode, params)in modules/init.py is a small factory selecting an engine by themodeparameter:modeEngine
File
localScheduler(multiprocessing)batchHPCScheduler(queue submit)mpiMPISchedulergridGRIDScheduler(DIRAC; falls back tolocalif unreachable)Granularity. Sampling/refinement modules generate one CNS subprocess (
CNSJob) per output model, and the engine fans these jobs across cores/nodes. This per-model job is the natural unit for any future caching scheme.
9. CNS coupling
The physics modules wrap CNS rather than reimplementing it. The coupling lives in modules/base_cns_module.py and libs/libcns.py:
A CNS module loads its
.cnstemplate (recipe_str), fills it with the current parameters and per-model data vialibcns, and writes a concrete.inp.Each job runs as a
CNSJob(libs/libsubprocess.py) — the CNS binary from cns/bin/, with env varsMODULE,MODDIR,TOPPAR(pointing at cns/toppar/).self_containedmode copies the CNS scripts, toppar, and executable into the run directory so it can be re-run elsewhere.
Because the workflow plumbing is independent of CNS, replacing CNS for a given
module is mostly a matter of replacing the subprocess invocation in that
module’s _run(). The modern OpenMM refinement
(modules/refinement/openmm/) is an
example of a non-CNS engine living alongside the CNS ones.
10. Run directory layout
A completed run (run_dir) looks like:
run_dir/
├── 0_topoaa/ # one zero-padded, numbered folder per step
│ ├── io.json # ModuleIO for this step (input + output models)
│ ├── params.cfg # the exact parameters this step ran with
│ └── *.pdb, *.psf, *.inp, *.out, …
├── 1_rigidbody/
├── 2_caprieval/
├── data/ # copies of user inputs (molecules, restraints)
├── analysis/ # post-processing reports/plots (ANA_FOLDER)
├── traceback/ # model lineage across steps (TRACEBACK_FOLDER)
└── log # run log
The step-folder naming (<index>_<modulename>) is the system of record for step
identity and ordering. get_module_steps_folders() and is_step_folder() in
modules/init.py parse it; the
step_folder_regex there is the canonical matcher. Restart and extend operate on
these folders by index.
11. Testing
Suite |
Location |
CNS needed |
Scope |
|---|---|---|---|
Unit |
No |
Functions/classes in isolation; fixtures in |
|
Integration |
Yes |
Individual modules end-to-end against CNS |
|
End-to-end |
Yes |
Complete workflows |
Run with pytest tests/, pytest integration_tests/, pytest end-to-end_tests/. See docs/pages/DEVELOPMENT.md and
CONTRIBUTING.md.
12. Where to start when…
Adding a module → copy modules/_template_cat/; implement
HaddockModule._run()and authordefaults.yaml. No central registration is needed.Changing how steps are scheduled/chained → libs/libworkflow.py.
Changing what flows between modules → libs/libontology.py.
Changing run setup/validation → gear/prepare_run.py.
Adding/adjusting parallel execution →
get_engine()in modules/init.py and thelib*schedulerfiles in libs/.A new CLI tool → add
clis/cli_<name>.pywith amaincli()and register it under[project.scripts]in pyproject.toml.