Skip to content

utils

Module for functions that are used in multiple places.

CopyMethod = Literal['copy', 'symlink', 'hardlink'] module-attribute

Methods for copying files.

copy_methods = set(get_args(CopyMethod)) module-attribute

Set of valid copy methods.

Cacher

Bases: Protocol

Protocol for a cacher.

__contains__(item)

Check if a file is in the cache.

Parameters:

Name Type Description Default
item str | Path

The filename or Path to check.

required

Returns:

Type Description
bool

True if the file is in the cache, False otherwise.

copy_from_cache(target) async

Copy a file from the cache to a target location if it exists in the cache.

Assumes:

  • target does not exist.
  • the parent directory of target exists.

Parameters:

Name Type Description Default
target Path

The path to copy the file to.

required

Returns:

Type Description
Path | None

The path to the cached file if it was copied, None otherwise.

write_bytes(target, content) async

Write bytes to a file and cache it.

Parameters:

Name Type Description Default
target Path

The path to write the content to.

required
content bytes

The bytes to write to the file.

required

Returns:

Type Description
Path

The path to the cached file.

Raises:

Type Description
FileExistsError

If the target file already exists.

write_iter(target, content) async

Write content to a file and cache it.

Parameters:

Name Type Description Default
target Path

The path to write the content to.

required
content AsyncStreamIterator[bytes]

An async iterator that yields bytes to write to the file.

required

Returns:

Type Description
Path

The path to the cached file.

Raises:

Type Description
FileExistsError

If the target file already exists.

DirectoryCacher dataclass

Bases: Cacher

Class to cache files in a directory.

Caching logic is based on the file name only. If file name of paths are the same then the files are considered the same.

Attributes:

Name Type Description
cache_dir Path | None

The directory to use for caching.

copy_method CopyMethod

The method to use for copying files.

__post_init__()

Normalize and validate dataclass fields after initialization.

populate_cache(source_dir)

Populate the cache from an existing directory.

This will copy all files from the source directory to the cache directory. If a file with the same name already exists in the cache, it will be skipped.

Parameters:

Name Type Description Default
source_dir Path

The directory to populate the cache from.

required

Returns:

Type Description
dict[Path, Path]

A dictionary mapping source file paths to their cached paths.

Raises:

Type Description
NotADirectoryError

If the source_dir is not a directory.

InvalidContentEncodingError

Bases: ClientResponseError

Content encoding is invalid.

NestedAsyncIOLoopError

Bases: RuntimeError

Custom error for nested async I/O loops.

PassthroughCacher dataclass

Bases: Cacher

A cacher that caches nothing.

On writes it just writes to the target path.

async_copyfile(source, target, copy_method='copy') async

Asynchronously make target path be same file as source by either copying or symlinking or hardlinking.

Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.

Parameters:

Name Type Description Default
source Path

The source file to copy.

required
target Path

The target file to create.

required
copy_method CopyMethod

The method to use for copying.

'copy'

Raises:

Type Description
FileNotFoundError

If the source file or parent of target does not exist.

FileExistsError

If the target file already exists.

ValueError

If an unknown copy method is provided.

copyfile(source, target, copy_method='copy')

Make target path be same file as source by either copying or symlinking or hardlinking.

Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.

Parameters:

Name Type Description Default
source Path

The source file to copy or link.

required
target Path

The target file to create.

required
copy_method CopyMethod

The method to use for copying.

'copy'

Raises:

Type Description
FileNotFoundError

If the source file or parent of target does not exist.

FileExistsError

If the target file already exists.

ValueError

If an unknown copy method is provided.

friendly_session(retries=3, total_timeout=300) async

Create an aiohttp session with retry capabilities.

Examples:

Use as async context:

>>> async with friendly_session(retries=5, total_timeout=60) as session:
>>>     r = await session.get("https://example.com/api/data")
>>>     print(r)
<ClientResponse(https://example.com/api/data) [404 Not Found]>
<CIMultiDictProxy('Accept-Ranges': 'bytes', ...

Parameters:

Name Type Description Default
retries int

The number of retry attempts for failed requests.

3
total_timeout int

The total timeout for a request in seconds.

300

populate_cache_command(raw_args=None)

Command line interface to populate the cache from an existing directory.

Can be called from the command line as:

python3 -m protein_quest.utils populate-cache /path/to/source/dir

Parameters:

Name Type Description Default
raw_args Sequence[str] | None

The raw command line arguments to parse. If None, uses sys.argv.

None

read_ids_from_csv(file, *, id_column, model_provider, transform_model_identifier=None)

Read model IDs from a CSV file.

The CSV file can provide source-specific IDs in id_column (for example, pdb_id or af_id). It can also provide generic identifiers through the model_provider and model_identifier columns. If the CSV contains only one column, every value in that column is treated as an ID, including the first row.

Parameters:

Name Type Description Default
file Path

Path to file containing CSV data.

required
id_column str

Name of the direct ID column to read when present.

required
model_provider str

Expected value in the model_provider column. If row has different provider it is skipped.

required
transform_model_identifier Callable[[str], str] | None

Optional function to transform model_identifier values before adding them.

None

Returns:

Type Description
set[str]

A set of IDs extracted from the CSV file.

Raises:

Type Description
ValueError

If required columns are missing.

retrieve_files(urls, save_dir, max_parallel_downloads=5, retries=3, total_timeout=300, desc='Downloading files', cacher=None, chunk_size=524288, gzip_files=False, raise_for_not_found=True) async

Retrieve files from a list of URLs and save them to a directory.

Parameters:

Name Type Description Default
urls Iterable[tuple[URL | str, str]] | Iterable[tuple[URL | str, str, bool]]

A list of tuples, where each tuple contains a URL and a filename. Or tuple with URL, filename and whether to download gzipped content. When given (...,...,True) then, it requires the server of that URL can send gzip encoded content.

required
save_dir Path

The directory to save the downloaded files to.

required
max_parallel_downloads int

The maximum number of files to download in parallel.

5
retries int

The number of times to retry a failed download.

3
total_timeout int

The total timeout for a download in seconds.

300
desc str

Description for the progress bar.

'Downloading files'
cacher Cacher | None

An optional cacher to use for caching files.

None
chunk_size int

The size of each chunk to read from the response.

524288
gzip_files bool

Whether to gzip the downloaded files. This requires the server can send gzip encoded content.

False
raise_for_not_found bool

Whether to raise an error for HTTP 404 errors. If false then function does not returns Path for which url gave HTTP 404 error and logs as debug message.

True

Returns:

Type Description
list[Path]

A list of paths to the downloaded files.

run_async(coroutine)

Run an async coroutine with nicer error.

Parameters:

Name Type Description Default
coroutine Coroutine[Any, Any, R]

The async coroutine to run.

required

Returns:

Type Description
R

The result of the coroutine.

Raises:

Type Description
NestedAsyncIOLoopError

If called from a nested async I/O loop like in a Jupyter notebook.

user_cache_root_dir()

Get the users root directory for caching files.

Returns:

Type Description
Path

The path to the user's cache directory for protein-quest.