db
Module for managing the DuckDB database used for storing metadata for session.
ddl = read_text('protein_detective', 'ddl.sql')
module-attribute
The DDL statements to create the database schema to hold session metadata.
Paths to files in the database are stored relative to the session directory. So you can move the session directory around without breaking the paths.
Just after connection to the database, you need to set the session_dir as DuckDB variable
with con.execute("SET VARIABLE session_dir = ?", (str(session_dir),))
.
This is done for you if you use connect() function.
For example with cwd ~/ and session_dir "session1" and file "~/session1/foo.pdb" then return to consumer as "session1/foo.pdb", but stored in the database as "foo.pdb"
Some tables are prefixed with raw_
to store file paths relative to the session directory.
The views (table name without raw_
) then prepend the session directory
(using the DuckDB session_dir
variable) to the paths,
so the paths are pointing to the correct files.
connect(session_dir, read_only=False)
Context manager to connect to the DuckDB database holding session metadata.
Examples:
To query in read only mode.
session_dir = Path("path/to/session")
with connect(session_dir, read_only=True) as con:
result = con.execute("SELECT * FROM proteins").fetchall()
Parameters:
Name | Type | Description | Default |
---|---|---|---|
session_dir
|
Path
|
The directory where the session data is stored. |
required |
read_only
|
bool
|
If True, the connection will be read-only. If read only then database can be read by multiple processes. If not read only then database can be read and written to by a single process. |
False
|
Yields:
Name | Type | Description |
---|---|---|
DuckDBPyConnection |
DuckDBPyConnection
|
The connection to the DuckDB database. |
db_path(session_dir)
initialize_db(session_dir, con)
Initialize the DuckDB database by creating the necessary tables and variables.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
session_dir
|
Path
|
The directory where the session data is stored. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for executing the DDL statements. |
required |
list_lcc_files(con)
List Local Cross Validation files (lcc.mrc).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
list[tuple[int, str, str]]
|
A list of tuples containing the PowerFit run ID, structure, and path to the lcc.mrc file. |
load_alphafold_ids(con)
Load AlphaFold IDs from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
set[str]
|
A set of AlphaFold IDs (UniProt accessions). |
load_alphafolds(con)
Load AlphaFold entries from the database. Args: con: The DuckDB connection to use for fetching the data.
Returns:
Type | Description |
---|---|
list[AlphaFoldEntry]
|
A list of AlphaFold entries. |
load_density_filtered_alphafolds_files(con)
Load density filtered AlphaFold PDB files from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
list[Path]
|
A list of paths to the density filtered AlphaFold PDB files. |
load_fitted_models(con)
Load fitted model PDB files from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A DataFrame containing the fitted model PDB file with columns:
and all columns returned by powerfit_solutions
with |
load_pdb_ids(con)
Load PDB IDs from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
set[str]
|
A set of PDB IDs. |
load_pdbs(con)
Load PDB entries from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
list[ProteinPdbRow]
|
A list of protein pdb rows. |
load_powerfit_run(powerfit_run_id, con)
Load a specific PowerFit run by its ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
powerfit_run_id
|
int
|
The ID of the PowerFit run to load. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
tuple[PowerfitOptions, Path]
|
A tuple containing the PowerFit options and the path to the density map file. |
Raises:
Type | Description |
---|---|
ValueError
|
If the PowerFit run with the specified ID does not exist. |
load_powerfit_runs(con)
Load all PowerFit runs from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
list[tuple[int, PowerfitOptions, Path]]
|
A list of tuples containing the PowerFit run ID, options, and density map path. |
load_single_chain_pdb_files(con)
Load single chain PDB files from the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
Returns:
Type | Description |
---|---|
list[Path]
|
A list of paths to the single chain PDB files. |
powerfit_solutions(con, powerfit_run_id=None)
Retrieve PowerFit solutions from the solutions.out files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
con
|
DuckDBPyConnection
|
The DuckDB connection to use for fetching the data. |
required |
powerfit_run_id
|
int | None
|
Optional ID of a specific PowerFit run to filter results. If None, all runs are included. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
A DataFrame containing the PowerFit solutions with columns:
|
save_alphafolds(afs, con)
Save AlphaFold entries to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
afs
|
dict[str, set[str]]
|
A dictionary mapping UniProt accessions to sets of AlphaFold IDs. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_alphafolds_files(afs, con)
Save AlphaFold files to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
afs
|
list[AlphaFoldEntry]
|
A list of AlphaFold entries. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_density_filtered(query, files, uniprot_accessions, con)
Save density filtered AlphaFold results to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
DensityFilterQuery
|
The density filter query parameters. |
required |
files
|
list[DensityFilterResult]
|
A list of the results. |
required |
uniprot_accessions
|
list[str]
|
A list of UniProt accessions corresponding to the results. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If the density filter could not be inserted or retrieved. |
save_fitted_models(df, con)
Save fitted model PDB files to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
A DataFrame containing the fitted model data with columns: - powerfit_run_id: The ID of the PowerFit run. - structure: The structure identifier. - rank: The rank of the solution. - unfitted_model_file: The path to the original model PDB file. - fitted_model_file: The path to the fitted model PDB file. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_pdb_files(mmcif_files, con)
Save PDB files to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mmcif_files
|
Mapping[str, Path]
|
A mapping of PDB IDs to their file paths. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_pdbs(uniprot2pdbs, con)
Save PDB entries and their associations with UniProt accessions to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uniprot2pdbs
|
Mapping[str, Iterable[PdbResult]]
|
A mapping of UniProt accessions to their associated PDB. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_powerfit_options(options, con)
Save PowerFit options of a powerfit run to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
options
|
PowerfitOptions
|
The PowerFit options to save. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
Returns:
Type | Description |
---|---|
int
|
The ID of the PowerFit run created or reused. |
Raises:
Type | Description |
---|---|
ValueError
|
If the options could not be saved or retrieved. |
save_query(query, con)
Save a UniProt search query to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
Query
|
The UniProt search query to save. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_single_chain_pdb_files(files, con)
Save single chain PDB files to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
files
|
list[SingleChainResult]
|
A list of objects containing the PDB file paths and metadata. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |
save_uniprot_accessions(uniprot_accessions, con)
Save UniProt accessions to the database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uniprot_accessions
|
Iterable[str]
|
An iterable of UniProt accessions to save. |
required |
con
|
DuckDBPyConnection
|
The DuckDB connection to use for saving the data. |
required |