Skip to content

utils

Module for functions that are used in multiple places.

CopyMethod = Literal['copy', 'symlink', 'hardlink'] module-attribute

Methods for copying files.

copy_methods = set(get_args(CopyMethod)) module-attribute

Set of valid copy methods.

Cacher

Bases: Protocol

Protocol for a cacher.

__contains__(item)

Check if a file is in the cache.

Parameters:

Name Type Description Default
item str | Path

The filename or Path to check.

required

Returns:

Type Description
bool

True if the file is in the cache, False otherwise.

copy_from_cache(target) async

Copy a file from the cache to a target location if it exists in the cache.

Assumes:

  • target does not exist.
  • the parent directory of target exists.

Parameters:

Name Type Description Default
target Path

The path to copy the file to.

required

Returns:

Type Description
Path | None

The path to the cached file if it was copied, None otherwise.

write_bytes(target, content) async

Write bytes to a file and cache it.

Parameters:

Name Type Description Default
target Path

The path to write the content to.

required
content bytes

The bytes to write to the file.

required

Returns:

Type Description
Path

The path to the cached file.

Raises:

Type Description
FileExistsError

If the target file already exists.

write_iter(target, content) async

Write content to a file and cache it.

Parameters:

Name Type Description Default
target Path

The path to write the content to.

required
content AsyncStreamIterator[bytes]

An async iterator that yields bytes to write to the file.

required

Returns:

Type Description
Path

The path to the cached file.

Raises:

Type Description
FileExistsError

If the target file already exists.

DirectoryCacher

Bases: Cacher

Class to cache files in a directory.

Caching logic is based on the file name only. If file name of paths are the same then the files are considered the same.

Attributes:

Name Type Description
cache_dir Path

The directory to use for caching.

copy_method CopyMethod

The method to use for copying files.

__init__(cache_dir=None, copy_method='hardlink')

Initialize the cacher.

If file name of paths are the same then the files are considered the same.

Parameters:

Name Type Description Default
cache_dir Path | None

The directory to use for caching. If None, a default cache directory (~/.cache/protein-quest) is used.

None
copy_method CopyMethod

The method to use for copying.

'hardlink'

populate_cache(source_dir)

Populate the cache from an existing directory.

This will copy all files from the source directory to the cache directory. If a file with the same name already exists in the cache, it will be skipped.

Parameters:

Name Type Description Default
source_dir Path

The directory to populate the cache from.

required

Returns:

Type Description
dict[Path, Path]

A dictionary mapping source file paths to their cached paths.

Raises:

Type Description
NotADirectoryError

If the source_dir is not a directory.

InvalidContentEncodingError

Bases: ClientResponseError

Content encoding is invalid.

NestedAsyncIOLoopError

Bases: RuntimeError

Custom error for nested async I/O loops.

PassthroughCacher

Bases: Cacher

A cacher that caches nothing.

On writes it just writes to the target path.

async_copyfile(source, target, copy_method='copy') async

Asynchronously make target path be same file as source by either copying or symlinking or hardlinking.

Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.

Parameters:

Name Type Description Default
source Path

The source file to copy.

required
target Path

The target file to create.

required
copy_method CopyMethod

The method to use for copying.

'copy'

Raises:

Type Description
FileNotFoundError

If the source file or parent of target does not exist.

FileExistsError

If the target file already exists.

ValueError

If an unknown copy method is provided.

copyfile(source, target, copy_method='copy')

Make target path be same file as source by either copying or symlinking or hardlinking.

Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.

Parameters:

Name Type Description Default
source Path

The source file to copy or link.

required
target Path

The target file to create.

required
copy_method CopyMethod

The method to use for copying.

'copy'

Raises:

Type Description
FileNotFoundError

If the source file or parent of target does not exist.

FileExistsError

If the target file already exists.

ValueError

If an unknown copy method is provided.

friendly_session(retries=3, total_timeout=300) async

Create an aiohttp session with retry capabilities.

Examples:

Use as async context:

>>> async with friendly_session(retries=5, total_timeout=60) as session:
>>>     r = await session.get("https://example.com/api/data")
>>>     print(r)
<ClientResponse(https://example.com/api/data) [404 Not Found]>
<CIMultiDictProxy('Accept-Ranges': 'bytes', ...

Parameters:

Name Type Description Default
retries int

The number of retry attempts for failed requests.

3
total_timeout int

The total timeout for a request in seconds.

300

populate_cache_command(raw_args=None)

Command line interface to populate the cache from an existing directory.

Can be called from the command line as:

python3 -m protein_quest.utils populate-cache /path/to/source/dir

Parameters:

Name Type Description Default
raw_args Sequence[str] | None

The raw command line arguments to parse. If None, uses sys.argv.

None

retrieve_files(urls, save_dir, max_parallel_downloads=5, retries=3, total_timeout=300, desc='Downloading files', cacher=None, chunk_size=524288, gzip_files=False) async

Retrieve files from a list of URLs and save them to a directory.

Parameters:

Name Type Description Default
urls Iterable[tuple[URL | str, str]]

A list of tuples, where each tuple contains a URL and a filename.

required
save_dir Path

The directory to save the downloaded files to.

required
max_parallel_downloads int

The maximum number of files to download in parallel.

5
retries int

The number of times to retry a failed download.

3
total_timeout int

The total timeout for a download in seconds.

300
desc str

Description for the progress bar.

'Downloading files'
cacher Cacher | None

An optional cacher to use for caching files.

None
chunk_size int

The size of each chunk to read from the response.

524288
gzip_files bool

Whether to gzip the downloaded files.

False

Returns:

Type Description
list[Path]

A list of paths to the downloaded files.

run_async(coroutine)

Run an async coroutine with nicer error.

Parameters:

Name Type Description Default
coroutine Coroutine[Any, Any, run_async[R]]

The async coroutine to run.

required

Returns:

Type Description
run_async[R]

The result of the coroutine.

Raises:

Type Description
NestedAsyncIOLoopError

If called from a nested async I/O loop like in a Jupyter notebook.

user_cache_root_dir()

Get the users root directory for caching files.

Returns:

Type Description
Path

The path to the user's cache directory for protein-quest.