Retrieve information from Uniprot with Uniprot.ws
UniProt database
UniProt is amazing database that provide high-quality, hand curated, reliable and freely accessible information about protein sequences and it is integrated with other databases, if you want to find for example, all the genes reported in a organism and their function Uniprot.ws
package is a great option to get that information from UniProt database.
Load required libraries
As always, you have to load the libraries you are going to use, in this case just UniProt.ws
and tidyverse
.
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
BiocManager::install("UniProt.ws")
}
library(UniProt.ws)
library(tidyverse)
Using UniProt.ws
First, you have specify the organism in which you are interested, for example Homo sapiens neanderthalensis, you can see all the species avalible using availableUniprotSpecies() function and with the ~pattern~ argument you can filter your search.
availableUniprotSpecies(pattern = "Homo")
taxon ID | Species name |
---|---|
742918 | Human associated cyclovirus 1 (isolate Homo sapiens/Pakistan/PK5510/2007) |
1301368 | Homostichanthus duerdeni |
111945 | Homoroselaps lacteus |
410822 | Homona magnanima |
63221 | Homo sapiens neanderthalensis |
2774376 | Homoeomma sp. |
9606 | Homo sapiens |
Then you use the taxon ID to clarify you want to find information about Homo sapiens neanderthalensis.
my_org <- UniProt.ws(taxId = c(63221))
What are you looking for?
With keytypes() function you can specify the main database you want to use, and with columns() you can see the different kind of information you can consult, also, with keys() you can get the gene IDs.
#look the kind of information you can retrieve
head(keytypes(my_org))
## [1] "AARHUS/GHENT-2DPAGE" "AGD" "ALLERGOME"
## [4] "ARACHNOSERVER" "BIOCYC" "CGD"
UniProt.ws::columns(my_org)
## [1] "3D" "AARHUS/GHENT-2DPAGE"
## [3] "AGD" "ALLERGOME"
## [5] "ARACHNOSERVER" "BIOCYC"
## [7] "CGD" "CITATION"
## [9] "CLEANEX" "CLUSTERS"
## [11] "COMMENTS" "CONOSERVER"
## [13] "CYGD" "DATABASE(PDB)"
## [15] "DATABASE(PFAM)" "DICTYBASE"
## [17] "DIP" "DISPROT"
## [19] "DMDM" "DNASU"
## [21] "DOMAIN" "DOMAINS"
## [23] "DRUGBANK" "EC"
## [25] "ECHOBASE" "ECO2DBASE"
## [27] "ECOGENE" "EGGNOG"
## [29] "EMBL/GENBANK/DDBJ" "EMBL/GENBANK/DDBJ_CDS"
## [31] "ENSEMBL" "ENSEMBL_GENOMES"
## [33] "ENSEMBL_GENOMES PROTEIN" "ENSEMBL_GENOMES TRANSCRIPT"
## [35] "ENSEMBL_PROTEIN" "ENSEMBL_TRANSCRIPT"
## [37] "ENTREZ_GENE" "ENTRY-NAME"
## [39] "EUHCVDB" "EUPATHDB"
## [41] "EXISTENCE" "FAMILIES"
## [43] "FEATURES" "FLYBASE"
## [45] "FUNCTION" "GENECARDS"
## [47] "GENEFARM" "GENEID"
## [49] "GENENAME" "GENES"
## [51] "GENETREE" "GENOLIST"
## [53] "GENOMEREVIEWS" "GENOMERNAI"
## [55] "GERMONLINE" "GI_NUMBER*"
## [57] "GO" "GO-ID"
## [59] "H-INVDB" "HGNC"
## [61] "HOGENOM" "HPA"
## [63] "HSSP" "ID"
## [65] "INTERACTOR" "INTERPRO"
## [67] "KEGG" "KEYWORD-ID"
## [69] "KEYWORDS" "KO"
## [71] "LAST-MODIFIED" "LEGIOLIST"
## [73] "LENGTH" "LEPROMA"
## [75] "MAIZEGDB" "MEROPS"
## [77] "MGI" "MIM"
## [79] "MINT" "NEXTBIO"
## [81] "NEXTPROT" "OMA"
## [83] "ORGANISM" "ORGANISM-ID"
## [85] "ORPHANET" "ORTHODB"
## [87] "PATHWAY" "PATRIC"
## [89] "PDB" "PEROXIBASE"
## [91] "PHARMGKB" "PHOSSITE"
## [93] "PIR" "POMBASE"
## [95] "PPTASEDB" "PROTCLUSTDB"
## [97] "PROTEIN-NAMES" "PSEUDOCAP"
## [99] "REACTOME" "REBASE"
## [101] "REFSEQ_NUCLEOTIDE" "REFSEQ_PROTEIN"
## [103] "REVIEWED" "RGD"
## [105] "SCORE" "SEQUENCE"
## [107] "SGD" "SUBCELLULAR-LOCATIONS"
## [109] "TAIR" "TAXONOMIC-LINEAGE"
## [111] "TCDB" "TIGR"
## [113] "TOOLS" "TUBERCULIST"
## [115] "UCSC" "UNIPARC"
## [117] "UNIPATHWAY" "UNIPROTKB"
## [119] "UNIPROTKB_ID" "UNIREF100"
## [121] "UNIREF50" "UNIREF90"
## [123] "VECTORBASE" "VERSION"
## [125] "VIRUS-HOSTS" "WORLD-2DPAGE"
## [127] "WORMBASE" "WORMBASE_PROTEIN"
## [129] "WORMBASE_TRANSCRIPT" "XENBASE"
## [131] "ZFIN"
columns <- c("ORGANISM","GENENAME","FUNCTION","SEQUENCE")
my_keys <- keys(my_org, "ENTREZ_GENE")
length(my_keys)
## [1] 13
On the other hand, with lenght() function you can see the number of result you will get, in this case will be 13 gene sequences.
Retrieving the results
Finally with select() function you can get your results
res <- UniProt.ws::select(my_org,
keys = my_keys,
columns = columns,
keytype = "ENTREZ_GENE")
Some time duplicates appears, you can remove them with distinct() from dplyr
package.
res %>%
distinct(., ENTREZ_GENE, .keep_all = T) %>%
head(.,2)
ENTREZ_GENE | ORGANISM | GENENAME | FUNCTION | SEQUENCE |
---|---|---|---|---|
6775065 | Homo sapiens neanderthalensis (Neanderthal) | CYTB | FUNCTION: Component of the ubiquinol-cytochrome c reductase complex (complex III or cytochrome b-c1 complex) that is part of the mitochondrial respiratory chain. The b-c1 complex mediates electron transfer from ubiquinol to cytochrome c. Contributes to the generation of a proton gradient across the mitochondrial membrane that is then used for ATP synthesis. {ECO:0000256|ARBA:ARBA00002566, ECO:0000256|RuleBase:RU362117}. | MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSKTWNIGIILLLATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALAALHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLFLFLLSLMTLTLLSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILILAMIPILHVSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFIIIGQVASVLYFTTILILMPTISLIENKMLKWA |
6775076 | Homo sapiens neanderthalensis (Neanderthal) | ND3 | FUNCTION: Core subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (Complex I) which catalyzes electron transfer from NADH through the respiratory chain, using ubiquinone as an electron acceptor. Essential for the catalytic activity of complex I. {ECO:0000256|RuleBase:RU003640}. | MNFALILMINTLLALLLMIITFWLPQLNGYMEKSTPYECGFDPMSPARVPFSMKFFLVAITFLLFDLEIALLLPLPWALQTTNLPLMVTSSLLLIIILALSLAYEWLQKGLDWAE |