utils
Module for functions that are used in multiple places.
CopyMethod = Literal['copy', 'symlink', 'hardlink']
module-attribute
Methods for copying files.
copy_methods = set(get_args(CopyMethod))
module-attribute
Set of valid copy methods.
Cacher
Bases: Protocol
Protocol for a cacher.
__contains__(item)
copy_from_cache(target)
async
Copy a file from the cache to a target location if it exists in the cache.
Assumes:
- target does not exist.
- the parent directory of target exists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target
|
Path
|
The path to copy the file to. |
required |
Returns:
| Type | Description |
|---|---|
Path | None
|
The path to the cached file if it was copied, None otherwise. |
write_bytes(target, content)
async
Write bytes to a file and cache it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target
|
Path
|
The path to write the content to. |
required |
content
|
bytes
|
The bytes to write to the file. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
The path to the cached file. |
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the target file already exists. |
write_iter(target, content)
async
Write content to a file and cache it.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target
|
Path
|
The path to write the content to. |
required |
content
|
AsyncStreamIterator[bytes]
|
An async iterator that yields bytes to write to the file. |
required |
Returns:
| Type | Description |
|---|---|
Path
|
The path to the cached file. |
Raises:
| Type | Description |
|---|---|
FileExistsError
|
If the target file already exists. |
DirectoryCacher
Bases: Cacher
Class to cache files in a directory.
Caching logic is based on the file name only. If file name of paths are the same then the files are considered the same.
Attributes:
| Name | Type | Description |
|---|---|---|
cache_dir |
Path
|
The directory to use for caching. |
copy_method |
CopyMethod
|
The method to use for copying files. |
__init__(cache_dir=None, copy_method='hardlink')
Initialize the cacher.
If file name of paths are the same then the files are considered the same.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_dir
|
Path | None
|
The directory to use for caching. If None, a default cache directory (~/.cache/protein-quest) is used. |
None
|
copy_method
|
CopyMethod
|
The method to use for copying. |
'hardlink'
|
populate_cache(source_dir)
Populate the cache from an existing directory.
This will copy all files from the source directory to the cache directory. If a file with the same name already exists in the cache, it will be skipped.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_dir
|
Path
|
The directory to populate the cache from. |
required |
Returns:
| Type | Description |
|---|---|
dict[Path, Path]
|
A dictionary mapping source file paths to their cached paths. |
Raises:
| Type | Description |
|---|---|
NotADirectoryError
|
If the source_dir is not a directory. |
InvalidContentEncodingError
Bases: ClientResponseError
Content encoding is invalid.
NestedAsyncIOLoopError
PassthroughCacher
async_copyfile(source, target, copy_method='copy')
async
Asynchronously make target path be same file as source by either copying or symlinking or hardlinking.
Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Path
|
The source file to copy. |
required |
target
|
Path
|
The target file to create. |
required |
copy_method
|
CopyMethod
|
The method to use for copying. |
'copy'
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the source file or parent of target does not exist. |
FileExistsError
|
If the target file already exists. |
ValueError
|
If an unknown copy method is provided. |
copyfile(source, target, copy_method='copy')
Make target path be same file as source by either copying or symlinking or hardlinking.
Note that the hardlink copy method only works within the same filesystem and is harder to track. If you want to track cached files easily then use 'symlink'. On Windows you need developer mode or admin privileges to create symlinks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
Path
|
The source file to copy or link. |
required |
target
|
Path
|
The target file to create. |
required |
copy_method
|
CopyMethod
|
The method to use for copying. |
'copy'
|
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the source file or parent of target does not exist. |
FileExistsError
|
If the target file already exists. |
ValueError
|
If an unknown copy method is provided. |
friendly_session(retries=3, total_timeout=300)
async
Create an aiohttp session with retry capabilities.
Examples:
Use as async context:
>>> async with friendly_session(retries=5, total_timeout=60) as session:
>>> r = await session.get("https://example.com/api/data")
>>> print(r)
<ClientResponse(https://example.com/api/data) [404 Not Found]>
<CIMultiDictProxy('Accept-Ranges': 'bytes', ...
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retries
|
int
|
The number of retry attempts for failed requests. |
3
|
total_timeout
|
int
|
The total timeout for a request in seconds. |
300
|
populate_cache_command(raw_args=None)
Command line interface to populate the cache from an existing directory.
Can be called from the command line as:
python3 -m protein_quest.utils populate-cache /path/to/source/dir
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_args
|
Sequence[str] | None
|
The raw command line arguments to parse. If None, uses sys.argv. |
None
|
retrieve_files(urls, save_dir, max_parallel_downloads=5, retries=3, total_timeout=300, desc='Downloading files', cacher=None, chunk_size=524288, gzip_files=False)
async
Retrieve files from a list of URLs and save them to a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
urls
|
Iterable[tuple[URL | str, str]]
|
A list of tuples, where each tuple contains a URL and a filename. |
required |
save_dir
|
Path
|
The directory to save the downloaded files to. |
required |
max_parallel_downloads
|
int
|
The maximum number of files to download in parallel. |
5
|
retries
|
int
|
The number of times to retry a failed download. |
3
|
total_timeout
|
int
|
The total timeout for a download in seconds. |
300
|
desc
|
str
|
Description for the progress bar. |
'Downloading files'
|
cacher
|
Cacher | None
|
An optional cacher to use for caching files. |
None
|
chunk_size
|
int
|
The size of each chunk to read from the response. |
524288
|
gzip_files
|
bool
|
Whether to gzip the downloaded files. |
False
|
Returns:
| Type | Description |
|---|---|
list[Path]
|
A list of paths to the downloaded files. |
run_async(coroutine)
Run an async coroutine with nicer error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
coroutine
|
Coroutine[Any, Any, run_async[R]]
|
The async coroutine to run. |
required |
Returns:
| Type | Description |
|---|---|
run_async[R]
|
The result of the coroutine. |
Raises:
| Type | Description |
|---|---|
NestedAsyncIOLoopError
|
If called from a nested async I/O loop like in a Jupyter notebook. |
user_cache_root_dir()
Get the users root directory for caching files.
Returns:
| Type | Description |
|---|---|
Path
|
The path to the user's cache directory for protein-quest. |