Retrieve information from Uniprot with Uniprot.ws

Last updated on Mar 31, 2022

UniProt database

UniProt is amazing database that provide high-quality, hand curated, reliable and freely accessible information about protein sequences and it is integrated with other databases, if you want to find for example, all the genes reported in a organism and their function Uniprot.ws package is a great option to get that information from UniProt database.

Load required libraries

As always, you have to load the libraries you are going to use, in this case just UniProt.ws and tidyverse.

if (!require("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
    BiocManager::install("UniProt.ws")
}

library(UniProt.ws)
library(tidyverse)

Using `UniProt.ws`

First, you have specify the organism in which you are interested, for example Homo sapiens neanderthalensis, you can see all the species avalible using availableUniprotSpecies() function and with the ~pattern~ argument you can filter your search.

availableUniprotSpecies(pattern = "Homo")

taxon ID	Species name
742918	Human associated cyclovirus 1 (isolate Homo sapiens/Pakistan/PK5510/2007)
1301368	Homostichanthus duerdeni
111945	Homoroselaps lacteus
410822	Homona magnanima
63221	Homo sapiens neanderthalensis
2774376	Homoeomma sp.
9606	Homo sapiens

Then you use the taxon ID to clarify you want to find information about Homo sapiens neanderthalensis.

my_org <- UniProt.ws(taxId = c(63221))

What are you looking for?

With keytypes() function you can specify the main database you want to use, and with columns() you can see the different kind of information you can consult, also, with keys() you can get the gene IDs.

#look the kind of information you can retrieve
head(keytypes(my_org))

## [1] "AARHUS/GHENT-2DPAGE" "AGD"                 "ALLERGOME"          
## [4] "ARACHNOSERVER"       "BIOCYC"              "CGD"

UniProt.ws::columns(my_org)

##   [1] "3D"                         "AARHUS/GHENT-2DPAGE"       
##   [3] "AGD"                        "ALLERGOME"                 
##   [5] "ARACHNOSERVER"              "BIOCYC"                    
##   [7] "CGD"                        "CITATION"                  
##   [9] "CLEANEX"                    "CLUSTERS"                  
##  [11] "COMMENTS"                   "CONOSERVER"                
##  [13] "CYGD"                       "DATABASE(PDB)"             
##  [15] "DATABASE(PFAM)"             "DICTYBASE"                 
##  [17] "DIP"                        "DISPROT"                   
##  [19] "DMDM"                       "DNASU"                     
##  [21] "DOMAIN"                     "DOMAINS"                   
##  [23] "DRUGBANK"                   "EC"                        
##  [25] "ECHOBASE"                   "ECO2DBASE"                 
##  [27] "ECOGENE"                    "EGGNOG"                    
##  [29] "EMBL/GENBANK/DDBJ"          "EMBL/GENBANK/DDBJ_CDS"     
##  [31] "ENSEMBL"                    "ENSEMBL_GENOMES"           
##  [33] "ENSEMBL_GENOMES PROTEIN"    "ENSEMBL_GENOMES TRANSCRIPT"
##  [35] "ENSEMBL_PROTEIN"            "ENSEMBL_TRANSCRIPT"        
##  [37] "ENTREZ_GENE"                "ENTRY-NAME"                
##  [39] "EUHCVDB"                    "EUPATHDB"                  
##  [41] "EXISTENCE"                  "FAMILIES"                  
##  [43] "FEATURES"                   "FLYBASE"                   
##  [45] "FUNCTION"                   "GENECARDS"                 
##  [47] "GENEFARM"                   "GENEID"                    
##  [49] "GENENAME"                   "GENES"                     
##  [51] "GENETREE"                   "GENOLIST"                  
##  [53] "GENOMEREVIEWS"              "GENOMERNAI"                
##  [55] "GERMONLINE"                 "GI_NUMBER*"                
##  [57] "GO"                         "GO-ID"                     
##  [59] "H-INVDB"                    "HGNC"                      
##  [61] "HOGENOM"                    "HPA"                       
##  [63] "HSSP"                       "ID"                        
##  [65] "INTERACTOR"                 "INTERPRO"                  
##  [67] "KEGG"                       "KEYWORD-ID"                
##  [69] "KEYWORDS"                   "KO"                        
##  [71] "LAST-MODIFIED"              "LEGIOLIST"                 
##  [73] "LENGTH"                     "LEPROMA"                   
##  [75] "MAIZEGDB"                   "MEROPS"                    
##  [77] "MGI"                        "MIM"                       
##  [79] "MINT"                       "NEXTBIO"                   
##  [81] "NEXTPROT"                   "OMA"                       
##  [83] "ORGANISM"                   "ORGANISM-ID"               
##  [85] "ORPHANET"                   "ORTHODB"                   
##  [87] "PATHWAY"                    "PATRIC"                    
##  [89] "PDB"                        "PEROXIBASE"                
##  [91] "PHARMGKB"                   "PHOSSITE"                  
##  [93] "PIR"                        "POMBASE"                   
##  [95] "PPTASEDB"                   "PROTCLUSTDB"               
##  [97] "PROTEIN-NAMES"              "PSEUDOCAP"                 
##  [99] "REACTOME"                   "REBASE"                    
## [101] "REFSEQ_NUCLEOTIDE"          "REFSEQ_PROTEIN"            
## [103] "REVIEWED"                   "RGD"                       
## [105] "SCORE"                      "SEQUENCE"                  
## [107] "SGD"                        "SUBCELLULAR-LOCATIONS"     
## [109] "TAIR"                       "TAXONOMIC-LINEAGE"         
## [111] "TCDB"                       "TIGR"                      
## [113] "TOOLS"                      "TUBERCULIST"               
## [115] "UCSC"                       "UNIPARC"                   
## [117] "UNIPATHWAY"                 "UNIPROTKB"                 
## [119] "UNIPROTKB_ID"               "UNIREF100"                 
## [121] "UNIREF50"                   "UNIREF90"                  
## [123] "VECTORBASE"                 "VERSION"                   
## [125] "VIRUS-HOSTS"                "WORLD-2DPAGE"              
## [127] "WORMBASE"                   "WORMBASE_PROTEIN"          
## [129] "WORMBASE_TRANSCRIPT"        "XENBASE"                   
## [131] "ZFIN"

columns <- c("ORGANISM","GENENAME","FUNCTION","SEQUENCE")
my_keys <- keys(my_org, "ENTREZ_GENE")

length(my_keys)

## [1] 13

On the other hand, with lenght() function you can see the number of result you will get, in this case will be 13 gene sequences.

Retrieving the results

Finally with select() function you can get your results

res <- UniProt.ws::select(my_org, 
              keys = my_keys, 
              columns = columns,
              keytype = "ENTREZ_GENE")

Some time duplicates appears, you can remove them with distinct() from dplyr package.

res %>% 
  distinct(., ENTREZ_GENE, .keep_all = T) %>% 
  head(.,2)

ENTREZ_GENE	ORGANISM	GENENAME	FUNCTION	SEQUENCE
6775065	Homo sapiens neanderthalensis (Neanderthal)	CYTB	FUNCTION: Component of the ubiquinol-cytochrome c reductase complex (complex III or cytochrome b-c1 complex) that is part of the mitochondrial respiratory chain. The b-c1 complex mediates electron transfer from ubiquinol to cytochrome c. Contributes to the generation of a proton gradient across the mitochondrial membrane that is then used for ATP synthesis. {ECO:0000256\|ARBA:ARBA00002566, ECO:0000256\|RuleBase:RU362117}.	MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHITRDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSKTWNIGIILLLATMATAFMGYVLPWGQMSFWGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALAALHLLFLHETGSNNPLGITSHSDKITFHPYYTIKDALGLFLFLLSLMTLTLLSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTILRSVPNKLGGVLALLLSILILAMIPILHVSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFIIIGQVASVLYFTTILILMPTISLIENKMLKWA
6775076	Homo sapiens neanderthalensis (Neanderthal)	ND3	FUNCTION: Core subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (Complex I) which catalyzes electron transfer from NADH through the respiratory chain, using ubiquinone as an electron acceptor. Essential for the catalytic activity of complex I. {ECO:0000256\|RuleBase:RU003640}.	MNFALILMINTLLALLLMIITFWLPQLNGYMEKSTPYECGFDPMSPARVPFSMKFFLVAITFLLFDLEIALLLPLPWALQTTNLPLMVTSSLLLIIILALSLAYEWLQKGLDWAE

R Bioinformatics

Retrieve information from Uniprot with Uniprot.ws

UniProt database

Load required libraries

Using `UniProt.ws`

What are you looking for?

Retrieving the results

Diego Sierra Ramírez

Msc. in Biological Science / Data analyst

Related

Retrieve information from Uniprot with Uniprot.ws

UniProt database

Load required libraries

Using UniProt.ws

What are you looking for?

Retrieving the results

Diego Sierra Ramírez

Msc. in Biological Science / Data analyst

Related

Using `UniProt.ws`