protein-quest
Python package to search/retrieve/filter proteins and protein structures.
It uses
- Uniprot Sparql endpoint to search for proteins and their measured or predicted 3D structures.
- Uniprot taxonomy to search for taxonomy.
- QuickGO to search for Gene Ontology terms.
- gemmi to work with macromolecular models.
- dask-distributed to compute in parallel.
The package is used by
An example workflow:
graph TB;
taxonomy[/Search taxon/] -. taxon_ids .-> searchuniprot[/Search UniprotKB/]
goterm[/Search GO term/] -. go_ids .-> searchuniprot[/Search UniprotKB/]
searchuniprot --> |uniprot_accessions|searchpdbe[/Search PDBe/]
searchuniprot --> |uniprot_accessions|searchaf[/Search Alphafold/]
searchuniprot -. uniprot_accessions .-> searchemdb[/Search EMDB/]
searchintactionpartners[/Search interaction partners/] -.-x |uniprot_accessions|searchuniprot
searchcomplexes[/Search complexes/]
searchpdbe -->|pdb_ids|fetchpdbe[Retrieve PDBe]
searchaf --> |uniprot_accessions|fetchad(Retrieve AlphaFold)
searchemdb -. emdb_ids .->fetchemdb[Retrieve EMDB]
fetchpdbe -->|mmcif_files| chainfilter{{Filter on chain of uniprot}}
chainfilter --> |mmcif_files| residuefilter{{Filter on chain length}}
fetchad -->|mmcif_files| confidencefilter{{Filter out low confidence}}
confidencefilter --> |mmcif_files| ssfilter{{Filter on secondary structure}}
residuefilter --> |mmcif_files| ssfilter
ssfilter -. mmcif_files .-> convert2cif([Convert to cif])
classDef dashedBorder stroke-dasharray: 5 5;
goterm:::dashedBorder
taxonomy:::dashedBorder
searchemdb:::dashedBorder
fetchemdb:::dashedBorder
searchintactionpartners:::dashedBorder
searchcomplexes:::dashedBorder
convert2cif:::dashedBorder
(Dotted nodes and edges are side-quests.)
Install
pip install protein-quest
Or to use the latest development version:
pip install git+https://github.com/haddocking/protein-quest.git
Usage
The main entry point is the protein-quest
command line tool which has multiple subcommands to perform actions.
To use programmaticly, see the Jupyter notebooks and API documentation.
While downloading or copying files it uses a global cache (located at ~/.cache/protein-quest
) and hardlinks to save disk space and improve speed.
This behavior can be customized with the --no-cache
, --cache-dir
, and --copy-method
command line arguments.
Search Uniprot accessions
protein-quest search uniprot \
--taxon-id 9606 \
--reviewed \
--subcellular-location-uniprot nucleus \
--subcellular-location-go GO:0005634 \
--molecular-function-go GO:0003677 \
--limit 100 \
uniprot_accs.txt
Search for PDBe structures of uniprot accessions
protein-quest search pdbe uniprot_accs.txt pdbe.csv
pdbe.csv
file is written containing the the PDB id and chain of each uniprot accession.
Search for Alphafold structures of uniprot accessions
protein-quest search alphafold uniprot_accs.txt alphafold.csv
Search for EMDB structures of uniprot accessions
protein-quest search emdb uniprot_accs.txt emdbs.csv
To retrieve PDB structure files
protein-quest retrieve pdbe pdbe.csv downloads-pdbe/
To retrieve AlphaFold structure files
protein-quest retrieve alphafold alphafold.csv downloads-af/
For each entry downloads the summary.json and cif file.
To retrieve EMDB volume files
protein-quest retrieve emdb emdbs.csv downloads-emdb/
To filter AlphaFold structures on confidence
Filter AlphaFoldDB structures based on confidence (pLDDT). Keeps entries with requested number of residues which have a confidence score above the threshold. Also writes pdb files with only those residues.
protein-quest filter confidence \
--confidence-threshold 50 \
--min-residues 100 \
--max-residues 1000 \
./downloads-af ./filtered
To filter PDBe files on chain of uniprot accession
Make PDBe files smaller by only keeping first chain of found uniprot entry and renaming to chain A.
protein-quest filter chain \
pdbe.csv \
./downloads-pdbe ./filtered-chains
To filter PDBe files on nr of residues
protein-quest filter residue \
--min-residues 100 \
--max-residues 1000 \
./filtered-chains ./filtered
To filter on secondary structure
To filter on structure being mostly alpha helices and have no beta sheets.
protein-quest filter secondary-structure \
--ratio-min-helix-residues 0.5 \
--ratio-max-sheet-residues 0.0 \
--write-stats filtered-ss/stats.csv \
./filtered-chains ./filtered-ss
Search Taxonomy
protein-quest search taxonomy "Homo sapiens" -
Search Gene Ontology (GO)
You might not know what the identifier of a Gene Ontology term is at protein-quest search uniprot
.
You can use following command to search for a Gene Ontology (GO) term.
protein-quest search go --limit 5 --aspect cellular_component apoptosome -
Search for interaction partners
Use https://www.ebi.ac.uk/complexportal to find interaction partners of given UniProt accession.
protein-quest search interaction-partners Q05471 interaction-partners-of-Q05471.txt
The interaction-partners-of-Q05471.txt
file contains uniprot accessions (one per line).
Search for complexes
Given Uniprot accessions search for macromolecular complexes at https://www.ebi.ac.uk/complexportal and return the complex entries and their members.
echo Q05471 | protein-quest search complexes - complexes.csv
The complexes.csv
looks like
query_protein,complex_id,complex_url,complex_title,members
Q05471,CPX-2122,https://www.ebi.ac.uk/complexportal/complex/CPX-2122,Swr1 chromatin remodelling complex,P31376;P35817;P38326;P53201;P53930;P60010;P80428;Q03388;Q03433;Q03940;Q05471;Q06707;Q12464;Q12509
Convert structure files to .cif format
Some tools (for example powerfit) only work with .cif
files and not *.cif.gz
or *.bcif
files.
protein-quest convert --output-dir ./filtered-cif ./filtered-ss
Model Context Protocol (MCP) server
Protein quest can also help LLMs like Claude Sonnet 4 by providing a set of tools for protein structures.
To run mcp server you have to install the mcp
extra with:
pip install protein-quest[mcp]
The server can be started with:
protein-quest mcp
The mcp server contains an prompt template to search/retrieve/filter candidate structures.
Contributing
For development information and contribution guidelines, please see CONTRIBUTING.md.