uniprot
Module for searching UniProtKB using SPARQL.
ComplexPortalEntry
dataclass
A ComplexPortal entry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_protein
|
str
|
The UniProt accession used to find entry. |
required |
complex_id
|
str
|
The ComplexPortal identifier (for example "CPX-1234"). |
required |
complex_url
|
str
|
The URL to the ComplexPortal entry. |
required |
complex_title
|
str
|
The title of the complex. |
required |
members
|
set[str]
|
UniProt accessions which are members of the complex. |
required |
Query
dataclass
Search query for UniProtKB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
taxon_id
|
int | None
|
NCBI Taxon ID to filter results by organism (for example 9606 for human). |
None
|
reviewed
|
bool | None
|
Whether to filter results by reviewed status (True for reviewed, False for unreviewed). |
None
|
subcellular_location_uniprot
|
str | None
|
Subcellular location in UniProt format (for example "nucleus"). |
None
|
subcellular_location_go
|
Annotated[set[str], Parameter(negative='')] | None
|
Subcellular location in GO format. Can be a single GO term (for example, ["GO:0005634"]) or a collection of GO terms (for example, ["GO:0005634", "GO:0005737"]), which are searched with OR logic. |
None
|
molecular_function_go
|
Annotated[set[str], Parameter(negative='')] | None
|
Molecular function in GO format. Can be a single GO term (for example, ["GO:0003674"]) or a collection of GO terms (for example, ["GO:0003674", "GO:0008150"]), which are searched with OR logic. |
None
|
min_sequence_length
|
int | None
|
Minimum length of the canonical sequence. |
None
|
max_sequence_length
|
int | None
|
Maximum length of the canonical sequence. |
None
|
UniprotDetails
Bases: TypedDict
Details of an UniProt entry.
Attributes:
| Name | Type | Description |
|---|---|---|
uniprot_accession |
str
|
UniProt accession. |
uniprot_id |
str
|
UniProt ID (mnemonic). |
sequence_length |
int
|
Length of the canonical sequence. |
reviewed |
bool
|
Whether the entry is reviewed (Swiss-Prot) or unreviewed (TrEMBL). |
protein_name |
str
|
Recommended protein name. |
taxon_id |
int
|
NCBI Taxonomy ID of the organism. |
taxon_name |
str
|
Scientific name of the organism. |
map_uniprot_accessions2uniprot_details(uniprot_accessions, timeout=1800, batch_size=1000)
Map UniProt accessions to UniProt details by querying the UniProt SPARQL endpoint.
Example:
SPARQL query to get details for 7 UniProt entries, run on https://sparql.uniprot.org/sparql.
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT
(?ac AS ?uniprot_accession)
?uniprot_id
(STRAFTER(STR(?organism), "taxonomy/") AS ?taxon_id)
?taxon_name
?reviewed
?protein_name
(STRLEN(?sequence) AS ?seq_length)
WHERE {
# Input UniProt accessions
VALUES (?ac) { ("P05067") ("A6NGD5") ("O14627") ("P00697") ("P42284") ("A0A0B5AC95") ("A0A0S2Z4R0")}
BIND (IRI(CONCAT("http://purl.uniprot.org/uniprot/", ?ac)) AS ?protein)
?protein a up:Protein .
?protein up:mnemonic ?uniprot_id .
?protein up:organism ?organism .
?organism up:scientificName ?taxon_name .
?protein up:reviewed ?reviewed .
?protein up:recommendedName/up:fullName ?protein_name .
?protein up:sequence ?isoform .
?isoform a up:Simple_Sequence .
?isoform rdf:value ?sequence .
BIND (IRI(STRBEFORE(REPLACE(
STR(?isoform), "http://purl.uniprot.org/isoforms/", "http://purl.uniprot.org/uniprot/"
), "-")) AS ?ac_of_isoform)
FILTER(?ac_of_isoform = ?protein)
}
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accessions
|
Collection[str]
|
Iterable of UniProt accessions. |
required |
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
batch_size
|
int
|
Size of batches to process the UniProt accessions. |
1000
|
Yields:
| Type | Description |
|---|---|
Generator[UniprotDetails]
|
UniprotDetails objects in random order. |
search4af(uniprot_accs, min_sequence_length=None, max_sequence_length=None, limit=10000, timeout=1800, batch_size=10000)
Search for AlphaFold entries in UniProtKB accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Collection[str]
|
UniProt accessions. |
required |
min_sequence_length
|
int | None
|
Minimum length of the canonical sequence. |
None
|
max_sequence_length
|
int | None
|
Maximum length of the canonical sequence. |
None
|
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
batch_size
|
int
|
Size of batches to process the UniProt accessions. |
10000
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dictionary with protein IDs as keys and sets of AlphaFold IDs as values. |
search4emdb(uniprot_accs, limit=10000, timeout=1800)
Search for EMDB entries in UniProtKB accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Iterable[str]
|
UniProt accessions. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dictionary with protein IDs as keys and sets of EMDB IDs as values. |
search4interaction_partners(uniprot_accession, excludes=None, limit=10000, timeout=1800)
Search for interaction partners of a given UniProt accession using ComplexPortal database references.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accession
|
str
|
UniProt accession to search interaction partners for. |
required |
excludes
|
set[str] | None
|
Set of UniProt accessions to exclude from the results. For example already known interaction partners. If None then no complex members are excluded. |
None
|
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
dict[str, set[str]]
|
Dictionary with UniProt accessions of interaction partners as keys and sets of ComplexPortal entry IDs |
dict[str, set[str]]
|
in which the interaction occurs as values. |
search4macromolecular_complexes(uniprot_accs, limit=10000, timeout=1800)
Search for macromolecular complexes by UniProtKB accessions.
Queries for references to/from https://www.ebi.ac.uk/complexportal/ database in the Uniprot SPARQL endpoint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Iterable[str]
|
UniProt accessions. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
list[ComplexPortalEntry]
|
List of ComplexPortalEntry objects. |
search4pdb(uniprot_accs, limit=10000, timeout=1800, batch_size=10000)
Search for PDB entries in UniProtKB accessions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uniprot_accs
|
Collection[str]
|
UniProt accessions. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
batch_size
|
int
|
Size of batches to process the UniProt accessions. |
10000
|
Returns:
| Type | Description |
|---|---|
PdbResults
|
Dictionary with protein IDs as keys and sets of PDB results as values. |
search4uniprot(query, limit=10000, timeout=1800)
Search for UniProtKB entries based on the given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Query
|
Query object containing search parameters. |
required |
limit
|
int
|
Maximum number of results to return. |
10000
|
timeout
|
int
|
Timeout for the SPARQL query in seconds. |
1800
|
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of uniprot accessions. |