Skip to content

uniprot

Module for searching UniProtKB using SPARQL.

ComplexPortalEntry dataclass

A ComplexPortal entry.

Parameters:

Name Type Description Default
query_protein str

The UniProt accession used to find entry.

required
complex_id str

The ComplexPortal identifier (for example "CPX-1234").

required
complex_url str

The URL to the ComplexPortal entry.

required
complex_title str

The title of the complex.

required
members set[str]

UniProt accessions which are members of the complex.

required

Query dataclass

Search query for UniProtKB.

Parameters:

Name Type Description Default
taxon_id int | None

NCBI Taxon ID to filter results by organism (for example 9606 for human).

None
reviewed bool | None

Whether to filter results by reviewed status (True for reviewed, False for unreviewed).

None
subcellular_location_uniprot str | None

Subcellular location in UniProt format (for example "nucleus").

None
subcellular_location_go Annotated[set[str], Parameter(negative='')] | None

Subcellular location in GO format. Can be a single GO term (for example, ["GO:0005634"]) or a collection of GO terms (for example, ["GO:0005634", "GO:0005737"]), which are searched with OR logic.

None
molecular_function_go Annotated[set[str], Parameter(negative='')] | None

Molecular function in GO format. Can be a single GO term (for example, ["GO:0003674"]) or a collection of GO terms (for example, ["GO:0003674", "GO:0008150"]), which are searched with OR logic.

None
min_sequence_length int | None

Minimum length of the canonical sequence.

None
max_sequence_length int | None

Maximum length of the canonical sequence.

None

UniprotDetails

Bases: TypedDict

Details of an UniProt entry.

Attributes:

Name Type Description
uniprot_accession str

UniProt accession.

uniprot_id str

UniProt ID (mnemonic).

sequence_length int

Length of the canonical sequence.

reviewed bool

Whether the entry is reviewed (Swiss-Prot) or unreviewed (TrEMBL).

protein_name str

Recommended protein name.

taxon_id int

NCBI Taxonomy ID of the organism.

taxon_name str

Scientific name of the organism.

map_uniprot_accessions2uniprot_details(uniprot_accessions, timeout=1800, batch_size=1000)

Map UniProt accessions to UniProt details by querying the UniProt SPARQL endpoint.

Example:

SPARQL query to get details for 7 UniProt entries, run on https://sparql.uniprot.org/sparql.

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up:   <http://purl.uniprot.org/core/>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT
(?ac AS ?uniprot_accession)
?uniprot_id
(STRAFTER(STR(?organism), "taxonomy/") AS ?taxon_id)
?taxon_name
?reviewed
?protein_name
(STRLEN(?sequence) AS ?seq_length)
WHERE {
# Input UniProt accessions
VALUES (?ac) { ("P05067") ("A6NGD5") ("O14627") ("P00697") ("P42284") ("A0A0B5AC95") ("A0A0S2Z4R0")}
BIND (IRI(CONCAT("http://purl.uniprot.org/uniprot/", ?ac)) AS ?protein)
?protein a up:Protein .
?protein up:mnemonic ?uniprot_id .
?protein up:organism ?organism .
?organism up:scientificName ?taxon_name .
?protein up:reviewed ?reviewed .
?protein up:recommendedName/up:fullName ?protein_name .
?protein up:sequence ?isoform .
?isoform a up:Simple_Sequence .
?isoform rdf:value ?sequence .
BIND (IRI(STRBEFORE(REPLACE(
    STR(?isoform), "http://purl.uniprot.org/isoforms/", "http://purl.uniprot.org/uniprot/"
), "-")) AS ?ac_of_isoform)
FILTER(?ac_of_isoform = ?protein)
}

Parameters:

Name Type Description Default
uniprot_accessions Collection[str]

Iterable of UniProt accessions.

required
timeout int

Timeout for the SPARQL query in seconds.

1800
batch_size int

Size of batches to process the UniProt accessions.

1000

Yields:

Type Description
Generator[UniprotDetails]

UniprotDetails objects in random order.

search4af(uniprot_accs, min_sequence_length=None, max_sequence_length=None, limit=10000, timeout=1800, batch_size=10000)

Search for AlphaFold entries in UniProtKB accessions.

Parameters:

Name Type Description Default
uniprot_accs Collection[str]

UniProt accessions.

required
min_sequence_length int | None

Minimum length of the canonical sequence.

None
max_sequence_length int | None

Maximum length of the canonical sequence.

None
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800
batch_size int

Size of batches to process the UniProt accessions.

10000

Returns:

Type Description
dict[str, set[str]]

Dictionary with protein IDs as keys and sets of AlphaFold IDs as values.

search4emdb(uniprot_accs, limit=10000, timeout=1800)

Search for EMDB entries in UniProtKB accessions.

Parameters:

Name Type Description Default
uniprot_accs Iterable[str]

UniProt accessions.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
dict[str, set[str]]

Dictionary with protein IDs as keys and sets of EMDB IDs as values.

search4interaction_partners(uniprot_accession, excludes=None, limit=10000, timeout=1800)

Search for interaction partners of a given UniProt accession using ComplexPortal database references.

Parameters:

Name Type Description Default
uniprot_accession str

UniProt accession to search interaction partners for.

required
excludes set[str] | None

Set of UniProt accessions to exclude from the results. For example already known interaction partners. If None then no complex members are excluded.

None
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
dict[str, set[str]]

Dictionary with UniProt accessions of interaction partners as keys and sets of ComplexPortal entry IDs

dict[str, set[str]]

in which the interaction occurs as values.

search4macromolecular_complexes(uniprot_accs, limit=10000, timeout=1800)

Search for macromolecular complexes by UniProtKB accessions.

Queries for references to/from https://www.ebi.ac.uk/complexportal/ database in the Uniprot SPARQL endpoint.

Parameters:

Name Type Description Default
uniprot_accs Iterable[str]

UniProt accessions.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
list[ComplexPortalEntry]

List of ComplexPortalEntry objects.

search4pdb(uniprot_accs, limit=10000, timeout=1800, batch_size=10000)

Search for PDB entries in UniProtKB accessions.

Parameters:

Name Type Description Default
uniprot_accs Collection[str]

UniProt accessions.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800
batch_size int

Size of batches to process the UniProt accessions.

10000

Returns:

Type Description
PdbResults

Dictionary with protein IDs as keys and sets of PDB results as values.

search4uniprot(query, limit=10000, timeout=1800)

Search for UniProtKB entries based on the given query.

Parameters:

Name Type Description Default
query Query

Query object containing search parameters.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
set[str]

Set of uniprot accessions.