uniprot
Module for searching UniProtKB using SPARQL.
ComplexPortalEntry
dataclass
A ComplexPortal entry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_protein
|
str
|
The UniProt accession used to find entry. |
required |
complex_id
|
str
|
The ComplexPortal identifier (for example "CPX-1234"). |
required |
complex_url
|
str
|
The URL to the ComplexPortal entry. |
required |
complex_title
|
str
|
The title of the complex. |
required |
members
|
set[str]
|
UniProt accessions which are members of the complex. |
required |
PdbChainLengthError
PdbResult
dataclass
Result of a PDB search in UniProtKB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id
|
str
|
PDB ID (e.g., "1H3O"). |
required |
method
|
str
|
Method used for the PDB entry (e.g., "X-ray diffraction"). |
required |
uniprot_chains
|
str
|
Chains in UniProt format (e.g., "A/B=1-42,A/B=50-99"). |
required |
resolution
|
str | None
|
Resolution of the PDB entry (e.g., "2.0" for 2.0 Å). Optional. |
None
|
chain
cached
property
The first chain from the UniProt chains aka self.uniprot_chains.
chain_length
cached
property
The length of the chain from the UniProt chains aka self.uniprot_chains.
Query
dataclass
Search query for UniProtKB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
taxon_id
|
str | None
|
NCBI Taxon ID to filter results by organism (e.g., "9606" for human). |
required |
reviewed
|
bool | None
|
Whether to filter results by reviewed status (True for reviewed, False for unreviewed). |
None
|
subcellular_location_uniprot
|
str | None
|
Subcellular location in UniProt format (e.g., "nucleus"). |
None
|
subcellular_location_go
|
list[str] | None
|
Subcellular location in GO format. Can be a single GO term (e.g., ["GO:0005634"]) or a collection of GO terms (e.g., ["GO:0005634", "GO:0005737"]). |
None
|
molecular_function_go
|
list[str] | None
|
Molecular function in GO format. Can be a single GO term (e.g., ["GO:0003674"]) or a collection of GO terms (e.g., ["GO:0003674", "GO:0008150"]). |
None
|
min_sequence_length
|
int | None
|
Minimum length of the canonical sequence. |
None
|
max_sequence_length
|
int | None
|
Maximum length of the canonical sequence. |
None
|
UniprotDetails
dataclass
Details of an UniProt entry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accession
|
str
|
UniProt accession. |
required |
uniprot_id
|
str
|
UniProt ID (mnemonic). |
required |
sequence_length
|
int
|
Length of the canonical sequence. |
required |
reviewed
|
bool
|
Whether the entry is reviewed (Swiss-Prot) or unreviewed (TrEMBL). |
required |
protein_name
|
str
|
Recommended protein name. |
required |
taxon_id
|
int
|
NCBI Taxonomy ID of the organism. |
required |
taxon_name
|
str
|
Scientific name of the organism. |
required |
filter_pdb_results_on_chain_length(pdb_results, min_residues, max_residues, keep_invalid=False)
Filter PDB results based on chain length.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdb_results
|
PdbResults
|
Dictionary with protein IDs as keys and sets of PDB results as values. |
required |
min_residues
|
int | None
|
Minimum number of residues required in the chain mapped to the UniProt accession. If None, no minimum is applied. |
required |
max_residues
|
int | None
|
Maximum number of residues allowed in chain mapped to the UniProt accession. If None, no maximum is applied. |
required |
keep_invalid
|
bool
|
If True, PDB results with invalid chain length (could not be determined) are kept. If False, PDB results with invalid chain length are filtered out. Warnings are logged when length can not be determined. |
False
|
Returns:
| Type | Description |
|---|---|
PdbResults
|
Filtered dictionary with protein IDs as keys and sets of PDB results as values. |
map_uniprot_accessions2uniprot_details(uniprot_accessions, timeout=1800, batch_size=1000)
Map UniProt accessions to UniProt details by querying the UniProt SPARQL endpoint.
Example:
SPARQL query to get details for 7 UniProt entries, run on https://sparql.uniprot.org/sparql.
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT
(?ac AS ?uniprot_accession)
?uniprot_id
(STRAFTER(STR(?organism), "taxonomy/") AS ?taxon_id)
?taxon_name
?reviewed
?protein_name
(STRLEN(?sequence) AS ?seq_length)
WHERE {
# Input UniProt accessions
VALUES (?ac) { ("P05067") ("A6NGD5") ("O14627") ("P00697") ("P42284") ("A0A0B5AC95") ("A0A0S2Z4R0")}
BIND (IRI(CONCAT("http://purl.uniprot.org/uniprot/", ?ac)) AS ?protein)
?protein a up:Protein .
?protein up:mnemonic ?uniprot_id .
?protein up:organism ?organism .
?organism up:scientificName ?taxon_name .
?protein up:reviewed ?reviewed .
?protein up:recommendedName/up:fullName ?protein_name .
?protein up:sequence ?isoform .
?isoform a up:Simple_Sequence .
?isoform rdf:value ?sequence .
BIND (IRI(STRBEFORE(REPLACE(
STR(?isoform), "http://purl.uniprot.org/isoforms/", "http://purl.uniprot.org/uniprot/"
), "-")) AS ?ac_of_isoform)
FILTER(?ac_of_isoform = ?protein)
}
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accessions
|
Collection[str]
|
Iterable of UniProt accessions. |
required |
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
batch_size
|
int
|
Size of batches to process the UniProt accessions. |
1000
|
Yields:
| Type | Description |
|---|---|
Generator[UniprotDetails]
|
UniprotDetails objects in random order. |
search4af(uniprot_accs, min_sequence_length=None, max_sequence_length=None, limit=10000, timeout=1800, batch_size=10000)
Search for AlphaFold entries in UniProtKB accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Collection[str]
|
UniProt accessions. |
required |
min_sequence_length
|
int | None
|
Minimum length of the canonical sequence. |
None
|
max_sequence_length
|
int | None
|
Maximum length of the canonical sequence. |
None
|
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
batch_size
|
int
|
Size of batches to process the UniProt accessions. |
10000
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dictionary with protein IDs as keys and sets of AlphaFold IDs as values. |
search4emdb(uniprot_accs, limit=10000, timeout=1800)
Search for EMDB entries in UniProtKB accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Iterable[str]
|
UniProt accessions. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dictionary with protein IDs as keys and sets of EMDB IDs as values. |
search4interaction_partners(uniprot_accession, excludes=None, limit=10000, timeout=1800)
Search for interaction partners of a given UniProt accession using ComplexPortal database references.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accession
|
str
|
UniProt accession to search interaction partners for. |
required |
excludes
|
set[str] | None
|
Set of UniProt accessions to exclude from the results. For example already known interaction partners. If None then no complex members are excluded. |
None
|
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dictionary with UniProt accessions of interaction partners as keys and sets of ComplexPortal entry IDs |
dict[str, set[str]]
|
in which the interaction occurs as values. |
search4macromolecular_complexes(uniprot_accs, limit=10000, timeout=1800)
Search for macromolecular complexes by UniProtKB accessions.
Queries for references to/from https://www.ebi.ac.uk/complexportal/ database in the Uniprot SPARQL endpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Iterable[str]
|
UniProt accessions. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
list[ComplexPortalEntry]
|
List of ComplexPortalEntry objects. |
search4pdb(uniprot_accs, limit=10000, timeout=1800, batch_size=10000)
Search for PDB entries in UniProtKB accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Collection[str]
|
UniProt accessions. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
batch_size
|
int
|
Size of batches to process the UniProt accessions. |
10000
|
Returns:
| Type | Description |
|---|---|
PdbResults
|
Dictionary with protein IDs as keys and sets of PDB results as values. |
search4uniprot(query, limit=10000, timeout=1800)
Search for UniProtKB entries based on the given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Query
|
Query object containing search parameters. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of uniprot accessions. |