Retrieve information from Uniprot with Uniprot.ws

UniProt database

UniProt is amazing database that provide high-quality, hand curated, reliable and freely accessible information about protein sequences and it is integrated with other databases, if you want to find for example, all the genes reported in a organism and their function Uniprot.ws package is a great option to get that information from UniProt database.

Load required libraries

As always, you have to load the libraries you are going to use, in this case just UniProt.ws and tidyverse.

if (!require("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
    BiocManager::install("UniProt.ws")
}

library(UniProt.ws)
library(tidyverse)

Using UniProt.ws

First, you have specify the organism in which you are interested, for example Homo sapiens neanderthalensis, you can see all the species avalible using availableUniprotSpecies() function and with the ~pattern~ argument you can filter your search.

availableUniprotSpecies(pattern = "Homo")
taxon IDSpecies name
742918Human associated cyclovirus 1 (isolate Homo sapiens/Pakistan/PK5510/2007)
1301368Homostichanthus duerdeni
111945Homoroselaps lacteus
410822Homona magnanima
63221Homo sapiens neanderthalensis
2774376Homoeomma sp.
9606Homo sapiens

Then you use the taxon ID to clarify you want to find information about Homo sapiens neanderthalensis.

my_org <- UniProt.ws(taxId = c(63221))

What are you looking for?

With keytypes() function you can specify the main database you want to use, and with columns() you can see the different kind of information you can consult, also, with keys() you can get the gene IDs.

#look the kind of information you can retrieve
head(keytypes(my_org))
## [1] "AARHUS/GHENT-2DPAGE" "AGD"                 "ALLERGOME"          
## [4] "ARACHNOSERVER"       "BIOCYC"              "CGD"
UniProt.ws::columns(my_org)
##   [1] "3D"                         "AARHUS/GHENT-2DPAGE"       
##   [3] "AGD"                        "ALLERGOME"                 
##   [5] "ARACHNOSERVER"              "BIOCYC"                    
##   [7] "CGD"                        "CITATION"                  
##   [9] "CLEANEX"                    "CLUSTERS"                  
##  [11] "COMMENTS"                   "CONOSERVER"                
##  [13] "CYGD"                       "DATABASE(PDB)"             
##  [15] "DATABASE(PFAM)"             "DICTYBASE"                 
##  [17] "DIP"                        "DISPROT"                   
##  [19] "DMDM"                       "DNASU"                     
##  [21] "DOMAIN"                     "DOMAINS"                   
##  [23] "DRUGBANK"                   "EC"                        
##  [25] "ECHOBASE"                   "ECO2DBASE"                 
##  [27] "ECOGENE"                    "EGGNOG"                    
##  [29] "EMBL/GENBANK/DDBJ"          "EMBL/GENBANK/DDBJ_CDS"     
##  [31] "ENSEMBL"                    "ENSEMBL_GENOMES"           
##  [33] "ENSEMBL_GENOMES PROTEIN"    "ENSEMBL_GENOMES TRANSCRIPT"
##  [35] "ENSEMBL_PROTEIN"            "ENSEMBL_TRANSCRIPT"        
##  [37] "ENTREZ_GENE"                "ENTRY-NAME"                
##  [39] "EUHCVDB"                    "EUPATHDB"                  
##  [41] "EXISTENCE"                  "FAMILIES"                  
##  [43] "FEATURES"                   "FLYBASE"                   
##  [45] "FUNCTION"                   "GENECARDS"                 
##  [47] "GENEFARM"                   "GENEID"                    
##  [49] "GENENAME"                   "GENES"                     
##  [51] "GENETREE"                   "GENOLIST"                  
##  [53] "GENOMEREVIEWS"              "GENOMERNAI"                
##  [55] "GERMONLINE"                 "GI_NUMBER*"                
##  [57] "GO"                         "GO-ID"                     
##  [59] "H-INVDB"                    "HGNC"                      
##  [61] "HOGENOM"                    "HPA"                       
##  [63] "HSSP"                       "ID"                        
##  [65] "INTERACTOR"                 "INTERPRO"                  
##  [67] "KEGG"                       "KEYWORD-ID"                
##  [69] "KEYWORDS"                   "KO"                        
##  [71] "LAST-MODIFIED"              "LEGIOLIST"                 
##  [73] "LENGTH"                     "LEPROMA"                   
##  [75] "MAIZEGDB"                   "MEROPS"                    
##  [77] "MGI"                        "MIM"                       
##  [79] "MINT"                       "NEXTBIO"                   
##  [81] "NEXTPROT"                   "OMA"                       
##  [83] "ORGANISM"                   "ORGANISM-ID"               
##  [85] "ORPHANET"                   "ORTHODB"                   
##  [87] "PATHWAY"                    "PATRIC"                    
##  [89] "PDB"                        "PEROXIBASE"                
##  [91] "PHARMGKB"                   "PHOSSITE"                  
##  [93] "PIR"                        "POMBASE"                   
##  [95] "PPTASEDB"                   "PROTCLUSTDB"               
##  [97] "PROTEIN-NAMES"              "PSEUDOCAP"                 
##  [99] "REACTOME"                   "REBASE"                    
## [101] "REFSEQ_NUCLEOTIDE"          "REFSEQ_PROTEIN"            
## [103] "REVIEWED"                   "RGD"                       
## [105] "SCORE"                      "SEQUENCE"                  
## [107] "SGD"                        "SUBCELLULAR-LOCATIONS"     
## [109] "TAIR"                       "TAXONOMIC-LINEAGE"         
## [111] "TCDB"                       "TIGR"                      
## [113] "TOOLS"                      "TUBERCULIST"               
## [115] "UCSC"                       "UNIPARC"                   
## [117] "UNIPATHWAY"                 "UNIPROTKB"                 
## [119] "UNIPROTKB_ID"               "UNIREF100"                 
## [121] "UNIREF50"                   "UNIREF90"                  
## [123] "VECTORBASE"                 "VERSION"                   
## [125] "VIRUS-HOSTS"                "WORLD-2DPAGE"              
## [127] "WORMBASE"                   "WORMBASE_PROTEIN"          
## [129] "WORMBASE_TRANSCRIPT"        "XENBASE"                   
## [131] "ZFIN"
columns <- c("ORGANISM","GENENAME","FUNCTION","SEQUENCE")
my_keys <- keys(my_org, "ENTREZ_GENE")

length(my_keys)
## [1] 13

On the other hand, with lenght() function you can see the number of result you will get, in this case will be 13 gene sequences.

Retrieving the results

Finally with select() function you can get your results

res <- UniProt.ws::select(my_org, 
              keys = my_keys, 
              columns = columns,
              keytype = "ENTREZ_GENE")

Some time duplicates appears, you can remove them with distinct() from dplyr package.

res %>% 
  distinct(., ENTREZ_GENE, .keep_all = T) %>% 
  head(.,2)
ENTREZ_GENEORGANISMGENENAMEFUNCTIONSEQUENCE
6775065Homo sapiens neanderthalensis (Neanderthal)CYTBFUNCTION: Component of the ubiquinol-cytochrome c reductase complex (complex III or cytochrome b-c1 complex) that is part of the mitochondrial respiratory chain. The b-c1 complex mediates electron transfer from ubiquinol to cytochrome c. Contributes to the generation of a proton gradient across the mitochondrial membrane that is then used for ATP synthesis. {ECO:0000256|ARBA:ARBA00002566, ECO:0000256|RuleBase:RU362117}.MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSKTWNIGIILLLATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALAALHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLFLFLLSLMTLTLLSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILILAMIPILHVSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFIIIGQVASVLYFTTILILMPTISLIENKMLKWA
6775076Homo sapiens neanderthalensis (Neanderthal)ND3FUNCTION: Core subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (Complex I) which catalyzes electron transfer from NADH through the respiratory chain, using ubiquinone as an electron acceptor. Essential for the catalytic activity of complex I. {ECO:0000256|RuleBase:RU003640}.MNFALILMINTLLALLLMIITFWLPQLNGYMEKSTPYECGFDPMSPARVPFSMKFFLVAITFLLFDLEIALLLPLPWALQTTNLPLMVTSSLLLIIILALSLAYEWLQKGLDWAE
Diego Sierra Ramírez
Diego Sierra Ramírez
Msc. in Biological Science / Data analyst

Related