utils

Module for functions that are used in multiple places.

`CopyMethod = Literal['copy', 'symlink', 'hardlink']` `module-attribute`

Methods for copying files.

`copy_methods = set(get_args(CopyMethod))` `module-attribute`

Set of valid copy methods.

`Cacher`

Bases: Protocol

Protocol for a cacher.

`contains(item)`

Check if a file is in the cache.

Parameters:

Name	Type	Description	Default
`item`	`str \| Path`	The filename or Path to check.	required

Returns:

Type	Description
`bool`	True if the file is in the cache, False otherwise.

`copy_from_cache(target)` `async`

Copy a file from the cache to a target location if it exists in the cache.

Assumes:

target does not exist.
the parent directory of target exists.

Parameters:

Name	Type	Description	Default
`target`	`Path`	The path to copy the file to.	required

Returns:

Type	Description
`Path \| None`	The path to the cached file if it was copied, None otherwise.

`write_bytes(target, content)` `async`

Write bytes to a file and cache it.

Parameters:

Name	Type	Description	Default
`target`	`Path`	The path to write the content to.	required
`content`	`bytes`	The bytes to write to the file.	required

Returns:

Type	Description
`Path`	The path to the cached file.

Raises:

Type	Description
`FileExistsError`	If the target file already exists.

`write_iter(target, content)` `async`

Write content to a file and cache it.

Parameters:

Name	Type	Description	Default
`target`	`Path`	The path to write the content to.	required
`content`	`AsyncStreamIterator[bytes]`	An async iterator that yields bytes to write to the file.	required

Returns:

Type	Description
`Path`	The path to the cached file.

Raises:

Type	Description
`FileExistsError`	If the target file already exists.

`DirectoryCacher`

Bases: Cacher

Class to cache files in a directory.

Caching logic is based on the file name only. If file name of paths are the same then the files are considered the same.

Attributes:

Name	Type	Description
`cache_dir`	`Path`	The directory to use for caching.
`copy_method`	`CopyMethod`	The method to use for copying files.

`init(cache_dir=None, copy_method='hardlink')`

Initialize the cacher.

If file name of paths are the same then the files are considered the same.

Parameters:

Name	Type	Description	Default
`cache_dir`	`Path \| None`	The directory to use for caching. If None, a default cache directory (~/.cache/protein-quest) is used.	`None`
`copy_method`	`CopyMethod`	The method to use for copying.	`'hardlink'`

`populate_cache(source_dir)`

Populate the cache from an existing directory.

This will copy all files from the source directory to the cache directory. If a file with the same name already exists in the cache, it will be skipped.

Parameters:

Name	Type	Description	Default
`source_dir`	`Path`	The directory to populate the cache from.	required

Returns:

Type	Description
`dict[Path, Path]`	A dictionary mapping source file paths to their cached paths.

Raises:

Type	Description
`NotADirectoryError`	If the source_dir is not a directory.

`InvalidContentEncodingError`

Bases: ClientResponseError

Content encoding is invalid.

`NestedAsyncIOLoopError`

Bases: RuntimeError

Custom error for nested async I/O loops.

`PassthroughCacher`

Bases: Cacher

A cacher that caches nothing.

On writes it just writes to the target path.

`async_copyfile(source, target, copy_method='copy')` `async`

Asynchronously make target path be same file as source by either copying or symlinking or hardlinking.

Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.

Parameters:

Name	Type	Description	Default
`source`	`Path`	The source file to copy.	required
`target`	`Path`	The target file to create.	required
`copy_method`	`CopyMethod`	The method to use for copying.	`'copy'`

Raises:

Type	Description
`FileNotFoundError`	If the source file or parent of target does not exist.
`FileExistsError`	If the target file already exists.
`ValueError`	If an unknown copy method is provided.

`copyfile(source, target, copy_method='copy')`

Make target path be same file as source by either copying or symlinking or hardlinking.

Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.

Parameters:

Name	Type	Description	Default
`source`	`Path`	The source file to copy or link.	required
`target`	`Path`	The target file to create.	required
`copy_method`	`CopyMethod`	The method to use for copying.	`'copy'`

Raises:

Type	Description
`FileNotFoundError`	If the source file or parent of target does not exist.
`FileExistsError`	If the target file already exists.
`ValueError`	If an unknown copy method is provided.

`friendly_session(retries=3, total_timeout=300)` `async`

Create an aiohttp session with retry capabilities.

Examples:

Use as async context:

>>> async with friendly_session(retries=5, total_timeout=60) as session:
>>>     r = await session.get("https://example.com/api/data")
>>>     print(r)
<ClientResponse(https://example.com/api/data) [404 Not Found]>
<CIMultiDictProxy('Accept-Ranges': 'bytes', ...

Parameters:

Name	Type	Description	Default
`retries`	`int`	The number of retry attempts for failed requests.	`3`
`total_timeout`	`int`	The total timeout for a request in seconds.	`300`

`populate_cache_command(raw_args=None)`

Command line interface to populate the cache from an existing directory.

Can be called from the command line as:

python3 -m protein_quest.utils populate-cache /path/to/source/dir

Parameters:

Name	Type	Description	Default
`raw_args`	`Sequence[str] \| None`	The raw command line arguments to parse. If None, uses sys.argv.	`None`

`retrieve_files(urls, save_dir, max_parallel_downloads=5, retries=3, total_timeout=300, desc='Downloading files', cacher=None, chunk_size=524288, gzip_files=False)` `async`

Retrieve files from a list of URLs and save them to a directory.

Parameters:

Name	Type	Description	Default
`urls`	`Iterable[tuple[URL \| str, str]]`	A list of tuples, where each tuple contains a URL and a filename.	required
`save_dir`	`Path`	The directory to save the downloaded files to.	required
`max_parallel_downloads`	`int`	The maximum number of files to download in parallel.	`5`
`retries`	`int`	The number of times to retry a failed download.	`3`
`total_timeout`	`int`	The total timeout for a download in seconds.	`300`
`desc`	`str`	Description for the progress bar.	`'Downloading files'`
`cacher`	`Cacher \| None`	An optional cacher to use for caching files.	`None`
`chunk_size`	`int`	The size of each chunk to read from the response.	`524288`
`gzip_files`	`bool`	Whether to gzip the downloaded files.	`False`

Returns:

Type	Description
`list[Path]`	A list of paths to the downloaded files.

`run_async(coroutine)`

Run an async coroutine with nicer error.

Parameters:

Name	Type	Description	Default
`coroutine`	`Coroutine[Any, Any, run_async[R]]`	The async coroutine to run.	required

Returns:

Type	Description
`run_async[R]`	The result of the coroutine.

Raises:

Type	Description
`NestedAsyncIOLoopError`	If called from a nested async I/O loop like in a Jupyter notebook.

`user_cache_root_dir()`

Get the users root directory for caching files.

Returns:

Type	Description
`Path`	The path to the user's cache directory for protein-quest.

utils

CopyMethod = Literal['copy', 'symlink', 'hardlink'] module-attribute

copy_methods = set(get_args(CopyMethod)) module-attribute

Cacher

__contains__(item)

copy_from_cache(target) async

write_bytes(target, content) async

write_iter(target, content) async

DirectoryCacher

__init__(cache_dir=None, copy_method='hardlink')

populate_cache(source_dir)

InvalidContentEncodingError

NestedAsyncIOLoopError

PassthroughCacher

async_copyfile(source, target, copy_method='copy') async

copyfile(source, target, copy_method='copy')

friendly_session(retries=3, total_timeout=300) async

populate_cache_command(raw_args=None)

retrieve_files(urls, save_dir, max_parallel_downloads=5, retries=3, total_timeout=300, desc='Downloading files', cacher=None, chunk_size=524288, gzip_files=False) async

run_async(coroutine)

user_cache_root_dir()

`CopyMethod = Literal['copy', 'symlink', 'hardlink']` `module-attribute`

`copy_methods = set(get_args(CopyMethod))` `module-attribute`

`Cacher`

`contains(item)`

`copy_from_cache(target)` `async`

`write_bytes(target, content)` `async`

`write_iter(target, content)` `async`

`DirectoryCacher`

`init(cache_dir=None, copy_method='hardlink')`

`populate_cache(source_dir)`

`InvalidContentEncodingError`

`NestedAsyncIOLoopError`

`PassthroughCacher`

`async_copyfile(source, target, copy_method='copy')` `async`

`copyfile(source, target, copy_method='copy')`

`friendly_session(retries=3, total_timeout=300)` `async`

`populate_cache_command(raw_args=None)`

`retrieve_files(urls, save_dir, max_parallel_downloads=5, retries=3, total_timeout=300, desc='Downloading files', cacher=None, chunk_size=524288, gzip_files=False)` `async`

`run_async(coroutine)`

`user_cache_root_dir()`