Skip to content

db

Module for managing the DuckDB database used for storing metadata for session.

ddl = read_text('protein_detective', 'ddl.sql') module-attribute

The DDL statements to create the database schema to hold session metadata.

Paths to files in the database are stored relative to the session directory. So you can move the session directory around without breaking the paths.

Just after connection to the database, you need to set the session_dir as DuckDB variable with con.execute("SET VARIABLE session_dir = ?", (str(session_dir),)). This is done for you if you use connect() function.

For example with cwd ~/ and session_dir "session1" and file "~/session1/foo.pdb" then return to consumer as "session1/foo.pdb", but stored in the database as "foo.pdb"

Some tables are prefixed with raw_ to store file paths relative to the session directory. The views (table name without raw_) then prepend the session directory (using the DuckDB session_dir variable) to the paths, so the paths are pointing to the correct files.

connect(session_dir, read_only=False)

Context manager to connect to the DuckDB database holding session metadata.

Examples:

To query in read only mode.

session_dir = Path("path/to/session")
with connect(session_dir, read_only=True) as con:
    result = con.execute("SELECT * FROM proteins").fetchall()

Parameters:

Name Type Description Default
session_dir Path

The directory where the session data is stored.

required
read_only bool

If True, the connection will be read-only. If read only then database can be read by multiple processes. If not read only then database can be read and written to by a single process.

False

Yields:

Name Type Description
DuckDBPyConnection DuckDBPyConnection

The connection to the DuckDB database.

db_path(session_dir)

Return the path to the DuckDB database file in the given session directory.

Parameters:

Name Type Description Default
session_dir Path

The directory where the session data is stored.

required

Returns:

Type Description
Path

Path to the DuckDB database file.

initialize_db(session_dir, con)

Initialize the DuckDB database by creating the necessary tables and variables.

Parameters:

Name Type Description Default
session_dir Path

The directory where the session data is stored.

required
con DuckDBPyConnection

The DuckDB connection to use for executing the DDL statements.

required

list_lcc_files(con)

List Local Cross Validation files (lcc.mrc).

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
list[tuple[int, str, str]]

A list of tuples containing the PowerFit run ID, structure, and path to the lcc.mrc file.

load_alphafold_ids(con)

Load AlphaFold IDs from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
set[str]

A set of AlphaFold IDs (UniProt accessions).

load_alphafolds(con)

Load AlphaFold entries from the database. Args: con: The DuckDB connection to use for fetching the data.

Returns:

Type Description
list[AlphaFoldEntry]

A list of AlphaFold entries.

load_density_filtered_alphafolds_files(con)

Load density filtered AlphaFold PDB files from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
list[Path]

A list of paths to the density filtered AlphaFold PDB files.

load_fitted_models(con)

Load fitted model PDB files from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
DataFrame

A DataFrame containing the fitted model PDB file with columns:

  • unfitted_model_file: The path to the original model PDB file.
  • fitted_model_file: The path to the fitted model PDB file.

and all columns returned by powerfit_solutions with pdb_file renamed to unfitted_model_file column..

load_pdb_ids(con)

Load PDB IDs from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
set[str]

A set of PDB IDs.

load_pdbs(con)

Load PDB entries from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
list[ProteinPdbRow]

A list of protein pdb rows.

load_powerfit_run(powerfit_run_id, con)

Load a specific PowerFit run by its ID.

Parameters:

Name Type Description Default
powerfit_run_id int

The ID of the PowerFit run to load.

required
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
tuple[PowerfitOptions, Path]

A tuple containing the PowerFit options and the path to the density map file.

Raises:

Type Description
ValueError

If the PowerFit run with the specified ID does not exist.

load_powerfit_runs(con)

Load all PowerFit runs from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
list[tuple[int, PowerfitOptions, Path]]

A list of tuples containing the PowerFit run ID, options, and density map path.

load_single_chain_pdb_files(con)

Load single chain PDB files from the database.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required

Returns:

Type Description
list[Path]

A list of paths to the single chain PDB files.

powerfit_solutions(con, powerfit_run_id=None)

Retrieve PowerFit solutions from the solutions.out files.

Parameters:

Name Type Description Default
con DuckDBPyConnection

The DuckDB connection to use for fetching the data.

required
powerfit_run_id int | None

Optional ID of a specific PowerFit run to filter results. If None, all runs are included.

None

Returns:

Type Description
DataFrame

A DataFrame containing the PowerFit solutions with columns:

  • powerfit_run_id: The ID of the PowerFit run.
  • structure: The structure identifier.
  • rank: The rank of the solution.
  • cc: The correlation coefficient of the solution.
  • fishz: The Fish-Z score of the solution.
  • relz: The relative Z-score of the solution.
  • translation: The translation vector applied to the structure.
  • rotation: The rotation matrix applied to the structure.
  • density_filter_id: The ID of the density filter applied to the structure, if stucture came from AlphaFold.
  • af_id: The AlphaFold ID associated with the structure, if structure came from AlphaFold.
  • pdb_id: The PDB ID of the structure, if structure came from PDBe.
  • pdb_file: The path to the PDB file of the structure used as input structre for powerfit run.
  • uniprot_acc: The UniProt accession number associated with the structure.

save_alphafolds(afs, con)

Save AlphaFold entries to the database.

Parameters:

Name Type Description Default
afs dict[str, set[str]]

A dictionary mapping UniProt accessions to sets of AlphaFold IDs.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_alphafolds_files(afs, con)

Save AlphaFold files to the database.

Parameters:

Name Type Description Default
afs list[AlphaFoldEntry]

A list of AlphaFold entries.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_density_filtered(query, files, uniprot_accessions, con)

Save density filtered AlphaFold results to the database.

Parameters:

Name Type Description Default
query DensityFilterQuery

The density filter query parameters.

required
files list[DensityFilterResult]

A list of the results.

required
uniprot_accessions list[str]

A list of UniProt accessions corresponding to the results.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

Raises:

Type Description
ValueError

If the density filter could not be inserted or retrieved.

save_fitted_models(df, con)

Save fitted model PDB files to the database.

Parameters:

Name Type Description Default
df DataFrame

A DataFrame containing the fitted model data with columns: - powerfit_run_id: The ID of the PowerFit run. - structure: The structure identifier. - rank: The rank of the solution. - unfitted_model_file: The path to the original model PDB file. - fitted_model_file: The path to the fitted model PDB file.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_pdb_files(mmcif_files, con)

Save PDB files to the database.

Parameters:

Name Type Description Default
mmcif_files Mapping[str, Path]

A mapping of PDB IDs to their file paths.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_pdbs(uniprot2pdbs, con)

Save PDB entries and their associations with UniProt accessions to the database.

Parameters:

Name Type Description Default
uniprot2pdbs Mapping[str, Iterable[PdbResult]]

A mapping of UniProt accessions to their associated PDB.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_powerfit_options(options, con)

Save PowerFit options of a powerfit run to the database.

Parameters:

Name Type Description Default
options PowerfitOptions

The PowerFit options to save.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

Returns:

Type Description
int

The ID of the PowerFit run created or reused.

Raises:

Type Description
ValueError

If the options could not be saved or retrieved.

save_query(query, con)

Save a UniProt search query to the database.

Parameters:

Name Type Description Default
query Query

The UniProt search query to save.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_single_chain_pdb_files(files, con)

Save single chain PDB files to the database.

Parameters:

Name Type Description Default
files list[SingleChainResult]

A list of objects containing the PDB file paths and metadata.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required

save_uniprot_accessions(uniprot_accessions, con)

Save UniProt accessions to the database.

Parameters:

Name Type Description Default
uniprot_accessions Iterable[str]

An iterable of UniProt accessions to save.

required
con DuckDBPyConnection

The DuckDB connection to use for saving the data.

required