Skip to content

uniprot

Module for searching UniProtKB using SPARQL.

ComplexPortalEntry dataclass

A ComplexPortal entry.

Parameters:

Name Type Description Default
query_protein str

The UniProt accession used to find entry.

required
complex_id str

The ComplexPortal identifier (for example "CPX-1234").

required
complex_url str

The URL to the ComplexPortal entry.

required
complex_title str

The title of the complex.

required
members set[str]

UniProt accessions which are members of the complex.

required

PdbChainLengthError

Bases: ValueError

Raised when a UniProt chain description does not yield a chain length.

PdbResult dataclass

Result of a PDB search in UniProtKB.

Parameters:

Name Type Description Default
id str

PDB ID (e.g., "1H3O").

required
method str

Method used for the PDB entry (e.g., "X-ray diffraction").

required
uniprot_chains str

Chains in UniProt format (e.g., "A/B=1-42,A/B=50-99").

required
resolution str | None

Resolution of the PDB entry (e.g., "2.0" for 2.0 Å). Optional.

None

chain cached property

The first chain from the UniProt chains aka self.uniprot_chains.

chain_length cached property

The length of the chain from the UniProt chains aka self.uniprot_chains.

Query dataclass

Search query for UniProtKB.

Parameters:

Name Type Description Default
taxon_id str | None

NCBI Taxon ID to filter results by organism (e.g., "9606" for human).

required
reviewed bool | None

Whether to filter results by reviewed status (True for reviewed, False for unreviewed).

None
subcellular_location_uniprot str | None

Subcellular location in UniProt format (e.g., "nucleus").

None
subcellular_location_go list[str] | None

Subcellular location in GO format. Can be a single GO term (e.g., ["GO:0005634"]) or a collection of GO terms (e.g., ["GO:0005634", "GO:0005737"]).

None
molecular_function_go list[str] | None

Molecular function in GO format. Can be a single GO term (e.g., ["GO:0003674"]) or a collection of GO terms (e.g., ["GO:0003674", "GO:0008150"]).

None
min_sequence_length int | None

Minimum length of the canonical sequence.

None
max_sequence_length int | None

Maximum length of the canonical sequence.

None

UniprotDetails dataclass

Details of an UniProt entry.

Parameters:

Name Type Description Default
uniprot_accession str

UniProt accession.

required
uniprot_id str

UniProt ID (mnemonic).

required
sequence_length int

Length of the canonical sequence.

required
reviewed bool

Whether the entry is reviewed (Swiss-Prot) or unreviewed (TrEMBL).

required
protein_name str

Recommended protein name.

required
taxon_id int

NCBI Taxonomy ID of the organism.

required
taxon_name str

Scientific name of the organism.

required

filter_pdb_results_on_chain_length(pdb_results, min_residues, max_residues, keep_invalid=False)

Filter PDB results based on chain length.

Parameters:

Name Type Description Default
pdb_results PdbResults

Dictionary with protein IDs as keys and sets of PDB results as values.

required
min_residues int | None

Minimum number of residues required in the chain mapped to the UniProt accession. If None, no minimum is applied.

required
max_residues int | None

Maximum number of residues allowed in chain mapped to the UniProt accession. If None, no maximum is applied.

required
keep_invalid bool

If True, PDB results with invalid chain length (could not be determined) are kept. If False, PDB results with invalid chain length are filtered out. Warnings are logged when length can not be determined.

False

Returns:

Type Description
PdbResults

Filtered dictionary with protein IDs as keys and sets of PDB results as values.

map_uniprot_accessions2uniprot_details(uniprot_accessions, timeout=1800, batch_size=1000)

Map UniProt accessions to UniProt details by querying the UniProt SPARQL endpoint.

Example:

SPARQL query to get details for 7 UniProt entries, run on https://sparql.uniprot.org/sparql.

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX up:   <http://purl.uniprot.org/core/>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT
(?ac AS ?uniprot_accession)
?uniprot_id
(STRAFTER(STR(?organism), "taxonomy/") AS ?taxon_id)
?taxon_name
?reviewed
?protein_name
(STRLEN(?sequence) AS ?seq_length)
WHERE {
# Input UniProt accessions
VALUES (?ac) { ("P05067") ("A6NGD5") ("O14627") ("P00697") ("P42284") ("A0A0B5AC95") ("A0A0S2Z4R0")}
BIND (IRI(CONCAT("http://purl.uniprot.org/uniprot/", ?ac)) AS ?protein)
?protein a up:Protein .
?protein up:mnemonic ?uniprot_id .
?protein up:organism ?organism .
?organism up:scientificName ?taxon_name .
?protein up:reviewed ?reviewed .
?protein up:recommendedName/up:fullName ?protein_name .
?protein up:sequence ?isoform .
?isoform a up:Simple_Sequence .
?isoform rdf:value ?sequence .
BIND (IRI(STRBEFORE(REPLACE(
    STR(?isoform), "http://purl.uniprot.org/isoforms/", "http://purl.uniprot.org/uniprot/"
), "-")) AS ?ac_of_isoform)
FILTER(?ac_of_isoform = ?protein)
}

Parameters:

Name Type Description Default
uniprot_accessions Collection[str]

Iterable of UniProt accessions.

required
timeout int

Timeout for the SPARQL query in seconds.

1800
batch_size int

Size of batches to process the UniProt accessions.

1000

Yields:

Type Description
Generator[UniprotDetails]

UniprotDetails objects in random order.

search4af(uniprot_accs, min_sequence_length=None, max_sequence_length=None, limit=10000, timeout=1800, batch_size=10000)

Search for AlphaFold entries in UniProtKB accessions.

Parameters:

Name Type Description Default
uniprot_accs Collection[str]

UniProt accessions.

required
min_sequence_length int | None

Minimum length of the canonical sequence.

None
max_sequence_length int | None

Maximum length of the canonical sequence.

None
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800
batch_size int

Size of batches to process the UniProt accessions.

10000

Returns:

Type Description
dict[str, set[str]]

Dictionary with protein IDs as keys and sets of AlphaFold IDs as values.

search4emdb(uniprot_accs, limit=10000, timeout=1800)

Search for EMDB entries in UniProtKB accessions.

Parameters:

Name Type Description Default
uniprot_accs Iterable[str]

UniProt accessions.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
dict[str, set[str]]

Dictionary with protein IDs as keys and sets of EMDB IDs as values.

search4interaction_partners(uniprot_accession, excludes=None, limit=10000, timeout=1800)

Search for interaction partners of a given UniProt accession using ComplexPortal database references.

Parameters:

Name Type Description Default
uniprot_accession str

UniProt accession to search interaction partners for.

required
excludes set[str] | None

Set of UniProt accessions to exclude from the results. For example already known interaction partners. If None then no complex members are excluded.

None
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
dict[str, set[str]]

Dictionary with UniProt accessions of interaction partners as keys and sets of ComplexPortal entry IDs

dict[str, set[str]]

in which the interaction occurs as values.

search4macromolecular_complexes(uniprot_accs, limit=10000, timeout=1800)

Search for macromolecular complexes by UniProtKB accessions.

Queries for references to/from https://www.ebi.ac.uk/complexportal/ database in the Uniprot SPARQL endpoint.

Parameters:

Name Type Description Default
uniprot_accs Iterable[str]

UniProt accessions.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
list[ComplexPortalEntry]

List of ComplexPortalEntry objects.

search4pdb(uniprot_accs, limit=10000, timeout=1800, batch_size=10000)

Search for PDB entries in UniProtKB accessions.

Parameters:

Name Type Description Default
uniprot_accs Collection[str]

UniProt accessions.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800
batch_size int

Size of batches to process the UniProt accessions.

10000

Returns:

Type Description
PdbResults

Dictionary with protein IDs as keys and sets of PDB results as values.

search4uniprot(query, limit=10000, timeout=1800)

Search for UniProtKB entries based on the given query.

Parameters:

Name Type Description Default
query Query

Query object containing search parameters.

required
limit int

Maximum number of results to return.

10000
timeout int

Timeout for the SPARQL query in seconds.

1800

Returns:

Type Description
set[str]

Set of uniprot accessions.