Accessing and downloading from entrez database system

Last updated on Mar 24, 2022

entrez is a biological database system with the purpose of integrate protein and nucleotide sequences data and related information in a single place, this system has more than 40 databases vinculated and interelationated with GenBank, EMBL and DDBL, and it is the first place you should visit when you want obtain relevant an reliable information or datasets.

You can go and visit the oficial website in entrez, on that site you can search on a friendly user space and get the information that you want and explore the site, but the purpose of this lecture is to use rentrez package to downloand efficiently sequences from different databases with and this is especial important if you need to download many sequences.

First thing you have to do is install rentrez package and then loading it.

install.packages("rentrez")

library(rentrez)

entrez has a lot of databases with specialized data on each one, if you want to know all the databases avalible you can use entrez_dbs() function.

entrez_dbs()

##  [1] "pubmed"          "protein"         "nuccore"         "ipg"            
##  [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
##  [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
## [13] "books"           "cdd"             "clinvar"         "gap"            
## [17] "gapplus"         "grasp"           "dbvar"           "gene"           
## [21] "gds"             "geoprofiles"     "homologene"      "medgen"         
## [25] "mesh"            "ncbisearch"      "nlmcatalog"      "omim"           
## [29] "orgtrack"        "pmc"             "popset"          "proteinclusters"
## [33] "pcassay"         "protfam"         "biosystems"      "pccompound"     
## [37] "pcsubstance"     "seqannot"        "snp"             "sra"            
## [41] "taxonomy"        "biocollections"  "gtr"

And if you want to know the purpose of certain database go to entrez databases other way to find additional information is using entrez_db_summary and entez_db_searchable() functions.

entrez_db_summary("nucleotide")

##  DbName: nuccore
##  MenuName: Nucleotide
##  Description: Core Nucleotide db
##  DbBuild: Build220322-1505m.1
##  Count: 491207219
##  LastUpdate: 2022/03/24 06:24

entrez_db_searchable("nucleotide")

## Searchable fields for database 'nuccore'
##   ALL 	 All terms from all searchable fields 
##   UID 	 Unique number assigned to each sequence 
##   FILT 	 Limits the records 
##   WORD 	 Free text associated with record 
##   TITL 	 Words in definition line 
##   KYWD 	 Nonstandardized terms provided by submitter 
##   AUTH 	 Author(s) of publication 
##   JOUR 	 Journal abbreviation of publication 
##   VOL 	 Volume number of publication 
##   ISS 	 Issue number of publication 
##   PAGE 	 Page number(s) of publication 
##   ORGN 	 Scientific and common names of organism, and all higher levels of taxonomy 
##   ACCN 	 Accession number of sequence 
##   PACC 	 Does not include retired secondary accessions 
##   GENE 	 Name of gene associated with sequence 
##   PROT 	 Name of protein associated with sequence 
##   ECNO 	 EC number for enzyme or CAS registry number 
##   PDAT 	 Date sequence added to GenBank 
##   MDAT 	 Date of last update 
##   SUBS 	 CAS chemical name or MEDLINE Substance Name 
##   PROP 	 Classification by source qualifiers and molecule type 
##   SQID 	 String identifier for sequence 
##   GPRJ 	 BioProject 
##   SLEN 	 Length of sequence 
##   FKEY 	 Feature annotated on sequence 
##   PORG 	 Scientific and common names of primary organism, and all higher levels of taxonomy 
##   COMP 	 Component accessions for an assembly 
##   ASSM 	 Assembly 
##   DIV 	 Division 
##   STRN 	 Strain 
##   ISOL 	 Isolate 
##   CULT 	 Cultivar 
##   BRD 	 Breed 
##   BIOS 	 BioSample

Searching inside of pubmed

Personally I prefer RISmed package to retrieve information from pubmed (Tutorial here) when you want a general search and get “large” information like the abstract of papers, but you can use rentrez too to find the IDs and the search them too and create more precisely terms.

To search on pubmed database first we have to know what kind of terms we can find with entrez_db_searchable().

entrez_db_searchable("pubmed")

## Searchable fields for database 'pubmed'
##   ALL 	 All terms from all searchable fields 
##   UID 	 Unique number assigned to publication 
##   FILT 	 Limits the records 
##   TITL 	 Words in title of publication 
##   WORD 	 Free text associated with publication 
##   MESH 	 Medical Subject Headings assigned to publication 
##   MAJR 	 MeSH terms of major importance to publication 
##   AUTH 	 Author(s) of publication 
##   JOUR 	 Journal abbreviation of publication 
##   AFFL 	 Author's institutional affiliation and address 
##   ECNO 	 EC number for enzyme or CAS registry number 
##   SUBS 	 CAS chemical name or MEDLINE Substance Name 
##   PDAT 	 Date of publication 
##   EDAT 	 Date publication first accessible through Entrez 
##   VOL 	 Volume number of publication 
##   PAGE 	 Page number(s) of publication 
##   PTYP 	 Type of publication (e.g., review) 
##   LANG 	 Language of publication 
##   ISS 	 Issue number of publication 
##   SUBH 	 Additional specificity for MeSH term 
##   SI 	 Cross-reference from publication to other databases 
##   MHDA 	 Date publication was indexed with MeSH terms 
##   TIAB 	 Free text associated with Abstract/Title 
##   OTRM 	 Other terms associated with publication 
##   INVR 	 Investigator 
##   COLN 	 Corporate Author of publication 
##   CNTY 	 Country of publication 
##   PAPX 	 MeSH pharmacological action pre-explosions 
##   GRNT 	 NIH Grant Numbers 
##   MDAT 	 Date of last modification 
##   CDAT 	 Date of completion 
##   PID 	 Publisher ID 
##   FAUT 	 First Author of publication 
##   FULL 	 Full Author Name(s) of publication 
##   FINV 	 Full name of investigator 
##   TT 	 Words in transliterated title of publication 
##   LAUT 	 Last Author of publication 
##   PPDT 	 Date of print publication 
##   EPDT 	 Date of Electronic publication 
##   LID 	 ELocation ID 
##   CRDT 	 Date publication first accessible through Entrez 
##   BOOK 	 ID of the book that contains the document 
##   ED 	 Section's Editor 
##   ISBN 	 ISBN 
##   PUBN 	 Publisher's name 
##   AUCL 	 Author Cluster ID 
##   EID 	 Extended PMID 
##   DSO 	 Additional text from the summary 
##   AUID 	 Author Identifier 
##   PS 	 Personal Name as Subject 
##   COIS 	 Conflict of Interest Statements

Those are all the categories where you can ask and get information, for example you have to find papers about some organism such as “Phoneutria boliviensis” so you need papers that include “Phoneutria boliviensis” in their title. To do it, we are going to use the [TITL] tag that we got previusly using entrez_db_searchable().

papers_phoneutria <- entrez_search(db="pubmed", term="Phoneutria boliviensis[TITL]")

papers_phoneutria

## Entrez search result with 3 hits (object contains 3 IDs and no web_history object)
##  Search term (as translated):  Phoneutria boliviensis[TITL]

There are 3 papers that include “Phoneutria boliviensis”.

You can also create a more detailed using AND, OR and multiple tags, on the next example we are going to search all the papers in PubMed that have “Phoneutria boliviensis” in their title AND “diet” as keyword.

papers_phoneutria_diet <- entrez_search(db="pubmed", term="Phoneutria boliviensis[TITL] AND diet[WORD]")

Downloading sequences with `rentrez`

When you want to download sequences from any databases you have to be sure about how many sequences do you want, and which sequences do you want, To answer the first question you have use the argument retmax inside entrez_search() function, in this way you can limit the number of sequences which you will work.

neurotoxins <- entrez_search(db = "protein", term = "(Bilateria AND neurotoxin)", retmax = 50)

neurotoxins$ids

##  [1] "662033932"  "297747292"  "295842492"  "223468574"  "56682964"  
##  [6] "18859383"   "18765729"   "5032137"    "4759184"    "2203845921"
## [11] "2055117517" "1939884204" "1915463589" "1059045644" "980958836" 
## [16] "952543491"  "952543490"  "902763185"  "387942508"  "387942507" 
## [21] "387942506"  "387942502"  "387942501"  "387942500"  "387912837" 
## [26] "317411723"  "317411690"  "306755727"  "298286916"  "298286829" 
## [31] "296439765"  "292494995"  "254790169"  "254790168"  "254790165" 
## [36] "254790163"  "254790157"  "166220141"  "166220138"  "160358672" 
## [41] "122129826"  "118572281"  "115311697"  "115311688"  "115311686" 
## [46] "114152898"  "114152794"  "114149260"  "74848311"   "62900065"

Now that you have the IDs you can review some information with entrez_summary() or you can download the directly using entrez_fetch()

First50_neurotoxins <- entrez_fetch(db = "protein", id = neurotoxins$ids, rettype = "fasta") #Downloading

cat(strwrap(First50_neurotoxins), sep="\n") # Print the result

## >NP_001284367.1 vesicle-associated membrane protein 1 isoform 4 [Homo
## sapiens]
## MSAPAQPPAEGTEGTAPGGGPPGPPPNMTSNRRLQQTQAQVEEVVDIIRVNVDKVLERDQKLSELDDRAD
## ALQAGASQFESSAAKLKRKYWWKNCKMMIMLGAICAIIVVVIVRRG
## 
## >NP_001172112.1 vesicle-associated membrane protein 7 isoform 3 [Homo
## sapiens]
## MAILFAVVARGTTILAKHAWCGGNFLEVTEQILAKIPSENNKLTYSHGNYLFHYICQDRIVYLCITDDDF
## ERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLAAQLKHHSENKGLDKVMETQAQVDELKGIM
## VRNIVCHLQNYQQKSCSSHVYEEPQAHYYHHHRINCVHLYHCFTSLWWIYMAKLCEEIGKKKLPLTKDMR
## EQGVKSNPCDSSLSHTDRWYLPVSSTLFSLFKILFHASRFIFVLSTSLFL
## 
## >NP_001171511.1 syntaxin-3 isoform 2 [Homo sapiens]
## MKDRLEQLKAKQLTQDDDTDAVEIAIDNTAFMDEFFSEIEETRLNIDKISEHVEEAKKLYSIILSAPIPE
## PKTKDDLEQLTTEIKKRANNVRNKLKSMEKHIEEDEVRSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD
## FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE
## LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKARDETKKAVKYQSQARKKLISLQTGVATLVFR
## 
## >NP_001138621.1 vesicle-associated membrane protein 7 isoform 2 [Homo
## sapiens]
## MAILFAVVARGTTILAKHAWCGGNFLEDFERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLA
## AQLKHHSENKGLDKVMETQAQVDELKGIMVRNIDLVAQRGERLELLIDKTENLVDSSVTFKTTSRNLARA
## MCMKNLKLTIIIIIVSIVFIYIIVSPLCGGFTWPSCVKK
## 
## >NP_001008530.1 legumain isoform 1 preproprotein [Homo sapiens]
## MVWKVAVFLSVALGIGAVPIDDPEDGGKHWVVIVAGSNGWYNYRHQADACHAYQIIHRNGIPDEQIVVMM
## YDDIAYSEDNPTPGIVINRPNGTDVYQGVPKDYTGEDVTPQNFLAVLRGDAEAVKGIGSGKVLKSGPQDH
## VFIYFTDHGSTGILVFPNEDLHVKDLNETIHYMYKHKMYRKMVFYIEACESGSMMNHLPDNINVYATTAA
## NPRESSYACYYDEKRSTYLGDWYSVNWMEDSDVEDLTKETLHKQYHLVKSHTNTSHVMQYGNKTISTMKV
## MQFQGMKRKASSPVPLPPVTHLDLTPSPDVPLTIMKRKLMNTNDLEESRQLTEEIQRHLDARHLIEKSVR
## KIVSLLAASEAEVEQLLSERAPLTGHSCYPEALLHFRTHCFNWHSPTYEYALRHLYVLVNLCEKPYPLHR
## IKLSMDHVCLGHY
## 
## >NP_571830.1 sodium-dependent dopamine transporter [Danio rerio]
## MPMLRGRPAVTHTRTRTHTHMSSVSGSSSAAGPREVELVLVKEQNGVQFTSSSLRNPGAHSHTHTHTHTH
## PSGQQRETWGKKIDFLLSVIGFAVDLANVWRFPYLCYKNGGGAFLVPYLLFMVIAGMPLFYMELALGQYN
## REGAAGVWKICPIFKGVGFTVILISLYVGSYYNVIIAWALFYLFSSFSGELPWIHCNNTWNSPNCSDPNA
## TLLNDTYKTTPALEYFERGVLHVHESSGIDDLGAPRWQLTACLAVVIVVLYFSLWKGVKTSGKVVWITAT
## MPYVVLTVLLLRGVTLPGAIDGIKAYLSVDFLRLYDAQVWIEAATQICFSLGVGFGVLIAFSSYNKFSNN
## CYRDAIITSSINSLTSFFSGFVIFSFLGYMSQKHNVALDKVATDGPGLVFIIYPEAIATLPGSSVWAVIF
## FIMLLTLGIDSAMGGMESVITGLIDEFKFLHKHRELFTLFIVVSTFLISLICVTNGGIYVFTLLDHFAAG
## TSILFGVLIEAIGIAWFYGVDRFSDDIEEMIGQRPGLYWRLCWKFVSPCFLLFMVVVSFATFNPPKYGSY
## YFPTWATMVGWCLSISSMIMVPLYAFYKFCSLPGSFCDKLAYAITPETDHHLVERGEVRQFTLHHWLVV
## 
## >NP_003816.2 synaptosomal-associated protein 23 isoform SNAP23A [Homo
## sapiens]
## MDNLSSEEIQQRAHQITDESLESTRRILGLAIESQDAGIKTITMLDEQKEQLNRIEEGLDQINKDMRETE
## KTLTELNKCCGLCVCPCNRTKNFESGKAYKTTWGDGGENSPCNVVSKQPGPVTNGQLQQPTTGAASGGYI
## KRITNDAREDEMEENLTQVGSILGNLKDMALNIGNEIDAQNPQIKRITDKADTNRDRIDIANARAKKLID
## S
## 
## >NP_005629.1 vesicle-associated membrane protein 7 isoform 1 [Homo
## sapiens]
## MAILFAVVARGTTILAKHAWCGGNFLEVTEQILAKIPSENNKLTYSHGNYLFHYICQDRIVYLCITDDDF
## ERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLAAQLKHHSENKGLDKVMETQAQVDELKGIM
## VRNIDLVAQRGERLELLIDKTENLVDSSVTFKTTSRNLARAMCMKNLKLTIIIIIVSIVFIYIIVSPLCG
## GFTWPSCVKK
## 
## >NP_004168.1 syntaxin-3 isoform 1 [Homo sapiens]
## MKDRLEQLKAKQLTQDDDTDAVEIAIDNTAFMDEFFSEIEETRLNIDKISEHVEEAKKLYSIILSAPIPE
## PKTKDDLEQLTTEIKKRANNVRNKLKSMEKHIEEDEVRSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD
## FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE
## LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKARDETKKAVKYQSQARKKLIIIIVLVVVLLGILAL
## IIGLSVGLN
## 
## >sp|P0DV30.1|SCXT2_MESMA RecName: Full=Sodium channel neurotoxin BmK
## NT2; AltName: Full=Alpha-scorpion toxin
## VRDAYIAKPENCVYHCAGNEGCNNLCTCNGAT
## 
## >sp|P0DUK8.1|KAX3V_MESMA RecName: Full=Toxin BmK NSPK; AltName:
## Full=Buthus martensii Karsch neurite-stimulating peptide targeting Kv
## channels VGKNVICIHSGQCLIPCIDAGMRFGICKNGICDCTPKG
## 
## >sp|P0DQN8.1|DEL1A_HOTJU RecName: Full=Delta-buthitoxin-Hj1a;
## Short=Delta-BUTX-Hj1a
## EEVRDAYIAQPHNCVYHCFRDSYCNDLCIKHGAESGECKWFTSSGNACWCVKLPKSEPIKVPGKCH
## 
## >sp|P0DQM8.1|CM3J_CONMA RecName: Full=Conotoxin MIIIJ; AltName:
## Full=AlphaM-MIIIJ QKCCSGGSCPLYFRDRLICPCC
## 
## >NP_001317220.1 dimethylaniline monooxygenase [N-oxide-forming] 1
## isoform 1 [Mus musculus]
## MVKRVAIVGAGVSGLASIKCCLEEGLEPTCFERSSDLGGLWRFTEHVEEGRASLYKSVVSNSSREMSCYP
## DFPFPEDYPNFVPNSLFLEYLKLYSTQFNLQRCIYFNTKVCSITKRPDFAVSGQWEVVTVTNGKQNSAIF
## DAVMVCTGFLTNPHLPLDSFPGILTFKGEYFHSRQYKHPDIFKDKRVLVVGMGNSGTDIAVEASHLAKKV
## FLSTTGGAWVISRVFDSGYPWDMIFMTRFQNMLRNLLPTPIVSWLISKKMNSWFNHVNYGVAPEDRTQLR
## EPVLNDELPGRIITGKVFIKPSIKEVKENSVVFNNTPKEEPIDIIVFATGYTFAFPFLDESVVKVEDGQA
## SLYKYIFPAHLPKPTLAVIGLIKPLGSMVPTGETQARWVVQVLKGATTLPPPSVMMEEVNERKKNKHSGF
## GLCYCKALQTDYITYIDDLLTSINAKPDLRAMLLTDPRLALSIFFGPCTPYHFRLTGPGKWEGARKAILT
## QWDRTVKVTKTRTIQESPSSFETLLKLFSFLALLIAVFLIFL
## 
## >sp|B3FIQ7.1|TX16C_CYRSC RecName: Full=U10-theraphotoxin-Hs2a;
## Short=U10-TRTX-Hs2a; AltName: Full=HWTX-XVIc; Flags: Precursor
## MNTVRVTFLLVFVLAVSLGQADEDGNRMEKRQKKTEAENLLLPKLEELDAKLWEEDSVESRNSRQKRCNG
## KDVPCDPDPAKNRRCCSGLECLKPYLHGIWYQDYYCYVEKSGR
## 
## >sp|B3FIP2.1|TZ722_CYRSC RecName: Full=U8-theraphotoxin-Hs1b;
## Short=U8-TRTX-Hs1b; AltName: Full=HWTX-XVa2; Flags: Precursor
## MKAILLLAIFSVLTVAICGVSQNYGNVRYNYTELPNGEYCYIPRRRCVTTEQCCKPYDTVNNFAACGMAW
## PEDKKRKVNKCYICDNELTLCTR
## 
## >sp|B3FIP1.1|TZ721_CYRSC RecName: Full=U8-theraphotoxin-Hs1a;
## Short=U8-TRTX-Hs1a; AltName: Full=HWTX-XVa1; Flags: Precursor
## MKAILLLAIFSVLTVAICGVSQNYGNVRYNYTELPNGEYCYIPRRRCVTTEQCCKPYDTVNNFAACGMAW
## PEDKKRKVNECYICDNELTLCTR
## 
## >NP_001298019.1 ly6/PLAUR domain-containing protein 1 isoform b [Mus
## musculus]
## MCQKEVMEQSAGIMYRKSCASSAACLIASAGYQSFCSPGKLNSVCISCCNTPLCNGPRPKKRGSSASAIR
## PGLLTTLLFFHLALCLAHC
## 
## >sp|H2ER22.1|KAX1X_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx BmKcug1a; Short=Kcug1a; Flags: Precursor
## MKISFLLLLAIVICSIGWTEAQFTNVSCSASSQCWPVCEKLFGTYRGKCMNSKCRCYS
## 
## >sp|H2ER23.1|KAX1F_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 1.15; AltName: Full=BmKcug2; Short=Kcug1; Flags: Precursor
## MKISFLLLALVICSIGWSEAQFTDVKCTASKQCWPVCNQMFGKPNGKCMNGKCRCYS
## 
## >sp|H2ETQ6.1|KAX1E_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 1.14; AltName: Full=BmKcug1; Short=Kcug1; Flags: Precursor
## MKKISFLLLLAIVICSIGWTDGQFTDVRCSASSKCWPVCKKLFGTYKGKCKNSKCRCYS
## 
## >sp|F1CJ80.1|KA23J_HOTJU RecName: Full=U10-hottentoxin-Hj3a; Flags:
## Precursor
## MQKLLIILILFCILKFNVDVEGRTATMCDLPECQERCKRQNKKGKCVIEPEMNIVYHLCKCY
## 
## >sp|F1CJ67.1|KA23I_HOTJU RecName: Full=U10-hottentoxin-Hj2a; Flags:
## Precursor
## MQKLLIILILFCILKFNVDVEGRTAFPCNQSKCQERCKKEIKKGKCILQFISVSASQSCRCY
## 
## >sp|F1CIY9.1|KA23H_HOTJU RecName: Full=U10-buthitoxin-Hj1a;
## Short=U10-BUTX-Hj1a; Flags: Precursor
## MQKIFIILVLFCILKFNVDVEGRIASQCDLSACKERCEKQNKNGKCVIETEMDLVYRLCKCY
## 
## >sp|Q8MUB1.2|KA221_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 22.1; AltName: Full=Neurotoxin BmK38; AltName: Full=Toxin
## Kcugx; Flags: Precursor
## MQKLFIVFVLFCILRLDAEVDGRTATFCTQSICEESCKRQNKNGRCVIEAEGSLIYHLCKCY
## 
## >sp|B5KF99.1|KA11M_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx J123; Flags: Precursor
## MNKVYLVAVLVLFLALTINESNEAVPTGGCPFSDFFCAKRCKDMKFGNTGRCTGPNKTVCKCSI
## 
## >sp|A7KJJ7.1|KA261_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 26.1; AltName: Full=Neurotoxin BmK86; Flags: Precursor
## MSRLFVFILIALFLSAIIDVMSNFKVEGACSKPCRKYCIDKGARNGKCINGRCHCYY
## 
## >sp|P0CH43.1|DKTX_CYRSC RecName: Full=Tau-theraphotoxin-Hs1a;
## Short=Tau-TRTX-Hs1a; AltName: Full=Double-knot toxin; Short=DkTx
## DCAKEGEVCSWGKKCCDLDNFYCPMEFIPHCKKYKPYVPVTTNCAKEGEVCGWGSKCCHGLDCPLAFIPY
## CEKYRGRND
## 
## >sp|Q6WJF5.3|LV1A_MESMA RecName: Full=Lipolysis-activating peptide
## 1-alpha chain; Short=BmLVP1-alpha; Short=LVP1-alpha; Contains: RecName:
## Full=Neurotoxin BmKBTx; Short=BmKBT; Flags: Precursor
## MMKFVLFGMIVILFSLMGSIRGDDDPGNYPTNAYGNKYYCTILGENEYCRKICKLHGVTYGYCYNSRCWC
## EKLEDKDVTIWNAVKNHCTNTILYPNGK
## 
## >sp|Q95P90.2|LV1B_MESMA RecName: Full=HMG-CoA reductase inhibitor
## bumarsin; AltName: Full=JCH2; AltName: Full=Lipolysis-activating
## peptide 1-beta chain; Short=BmLVP1-beta; Short=LVP1-beta; AltName:
## Full=Neurotoxin KITx; Short=BmKITx; AltName: Full=Putative toxin
## BmKTXLP2; Flags: Precursor
## MVKMQVIFIAFIAVIACSMVYGDSLSPWNEGDTYYGCQRQTDEFCNKICKLHLASGGSCQQPAPFVKLCT
## CQGIDYDNSFFFGALEKQCPKLRG
## 
## >sp|P0CF76.1|SCX11_MESMA RecName: Full=Toxin BmKNJX11 GRDAYIADSENCTYT
## 
## >sp|P69755.2|O16C_CONMA RecName: Full=Delta-conotoxin-like MVIC;
## Short=Delta-MVIC; Flags: Precursor
## MKLTCVMIVAVLFLTTWTFVTADDSRYGLKNLFPKARHEMKNPEASKLNKRDECYPPGTFCGIKPGLCCS
## AICLSFVCISFDF
## 
## >sp|B3FIS7.1|TXLB2_CYRSC RecName: Full=U5-theraphotoxin-Hs1b 2;
## Short=U5-TRTX-Hs1b; AltName: Full=Lectin SHL-Ib2; Flags: Precursor
## MQTSMFLTLTGLVLLFVVCYASESEEKEFPKELLSSIFAADSDFKEEERGCFGYKCDYYKGCCSGYVCSP
## TWKWCVRPGPGRR
## 
## >sp|B3FIS6.1|TXLB1_CYRSC RecName: Full=U5-theraphotoxin-Hs1b 1;
## Short=U5-TRTX-Hs1b; AltName: Full=Lectin SHL-Ib1; Flags: Precursor
## MKTSMFLTLTGLVLLFVVCYASESEEKEFPKELLSSIFAADSDFKEEERGCFGYKCDYYKGCCSGYVCSP
## TWKWCVRPGPGRR
## 
## >sp|B3FIS3.1|TXLA4_CYRSC RecName: Full=U5-theraphotoxin-Hs1a 4;
## Short=U5-TRTX-Hs1a; AltName: Full=Lectin SHL-Ia4; Flags: Precursor
## MKTSMFLTLTGLVLLFVDCYASESEEKEFPKELLSSIFAADSDFKVEERGCLGDKCDYNNGCCSGYVCSR
## TWKWCVLAGPWRR
## 
## >sp|B3FIS1.1|TXLD_CYRSC RecName: Full=U5-theraphotoxin-Hs1d;
## Short=U5-TRTX-Hs1d; AltName: Full=Lectin SHL-1a2; AltName: Full=Lectin
## SHL-Ia2; Flags: Precursor
## MKTSMFLTLTGLVLLFVVCYASESEEKEFPKELLSSIFAADSDFKVEERGCLGDKCDYNNGCCSGYVCPR
## TWKWCVLAGPWRR
## 
## >sp|B3FIU2.1|TX10A_CYRSC RecName: Full=U12-theraphotoxin-Hs1a;
## Short=U12-TRTX-Hs1a; AltName: Full=Huwentoxin-10a; AltName:
## Full=Huwentoxin-Xa; Short=HwTx-Xa; Flags: Precursor
## MNVKILLLLVGLNLVMHSNATGDSETNPAETLFIEEIFRRGCFKEGKWCPKSAPCCAPLKCKGPSIKQQK
## CVRE
## 
## >sp|P0C614.1|I1B5_CONMA RecName: Full=Iota-conotoxin-like M11.5
## GHVPCGKDGRKCGYHADCCNCCLSGICKPSTSWTGCSTSTFD
## 
## >sp|P0C613.1|I1B2_CONMA RecName: Full=Conotoxin M11.2
## TCSNKGQQCGDDSDCCWHLCCVNNKCAHLILLCNL
## 
## >sp|P0C5S7.1|SIXP1_MESMA RecName: Full=Insect toxin BmK AngP1
## KKNGYAVDSSGKVAE
## 
## >sp|A0ASK0.1|KA14_MESMA RecName: Full=Potassium channel toxin alpha-KTx
## 14.x; AltName: Full=BmKK14; Flags: Precursor
## MKIFFAILLILAVCSMAIWTVNGTPFEVRCATDADCARKCPGNPPCRNGFCACT
## 
## >sp|P0C257.1|I1B1_CONMA RecName: Full=Iota-conotoxin-like M11.1; Flags:
## Precursor GAVPCGKDGRQCRNHADCCNCCPIGTCAPSTNWILPGCSTGQFMTR
## 
## >sp|P0C1X1.1|CA4A_CONMA RecName: Full=Kappa-conotoxin-like MIVA; Flags:
## Precursor
## MGMRMMFTVFLLVVLATTVVSIPSDRASDGRNAVVHERAPELVVTATTNCCGYNPMTICPPCMCTYSCPP
## KRKPGRRND
## 
## >sp|P0C1W2.1|CA1B_CONMA RecName: Full=Alpha-conotoxin-like MIB
## NGRCCHPACARKYNC
## 
## >sp|P0C1W1.1|CA1A_CONMA RecName: Full=Alpha-conotoxin-like MIA
## DGRCCHPACAKHFNC
## 
## >sp|P68424.2|TXH10_CYRSC RecName: Full=Omega-theraphotoxin-Hs1a;
## Short=Omega-TRTX-Hs1a; AltName: Full=Huwentoxin-10; AltName:
## Full=Huwentoxin-X; Short=HwTx-X; Flags: Precursor
## MNMKILVLVAVLCLVVSTHAERHSKTDMEDMEDSPMIQERKCLPPGKPCYGATQKIPCCGVCSHNKCT
## 
## >sp|P56636.3|CA12_CONMA RecName: Full=Alpha-conotoxin MII;
## Short=Alpha-Ctx MII; Short=Alpha-MII; Flags: Precursor
## MGMRMMFTVFLLVVLATTVVSFPSDRASDGRNAAANDKASDVITLALKGCCSNPVCHLEHSNLCGRRR
## 
## >sp|P0C1U2.1|CM3A_CONMA RecName: Full=Mu-conotoxin MIIIA
## QGCCNVPNGCSGRWCRDHAQCC
## 
## >sp|Q9BKB4.1|KA144_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 14.4; AltName: Full=BmSKTx1; AltName: Full=Neurotoxin SKTx1;
## Flags: Precursor MKIFFAILLILAVCSMAIWTVNGTPFAIKCATNADCSRKCPGNPPCRNGFCACT
## 
## >sp|P69756.1|O16D_CONMA RecName: Full=Delta-conotoxin-like MVID;
## Short=Delta-MVID EACYNAGTFCGIKPGLCCSAICLSFVCISFDF

write(First50_neurotoxins,file = "First50_neurotoxins.fasta", sep = "")
#Store the sequences

One important thing is that you can not download more that 300 sequences with entrez_fetch(), to solve this you have to use the argument use_history to create an object to store the IDs and be called multiple times, then we use a for loop to download our sequences by ‘chunks’.

neurotoxins <- entrez_search(db = "protein", term = "(Bilateria AND neurotoxin)", retmax = 10000, use_history = T)

There are 7700 sequences to download

for (seq_start in seq(1,length(neurotoxins$ids),10)) {
        #create chunks of 10 seqs through the number of IDs
        reqs <- entrez_fetch(db = "protein",
                             web_history = neurotoxins$web_history,#call the web_history object
                             rettype = "fasta", #precise the output
                             retmax=10,#confirm the chunck length
                             retstart=seq_start) # start with the next chunck
        
        cat(reqs, file = "neurotoxins.fasta", append = T) #save our .fasta file
        cat(seq_start+9, "sequences downloaded\r") #create a counter
}

path <- "../Rmd_sorted/neurotoxins.fasta"
neurotoxins.fasta <- ape::read.dna(path, format = "fasta") # load the .fasta file inside R again

file.remove(path) #if you don't need more this file you can detete it

And that’s all for today.

R Bioinformatics

Accessing and downloading from entrez database system

Searching inside of pubmed

Downloading sequences with `rentrez`

Diego Sierra Ramírez

Msc. in Biological Science / Data analyst

Related

Accessing and downloading from entrez database system

Searching inside of pubmed

Downloading sequences with rentrez

Diego Sierra Ramírez

Msc. in Biological Science / Data analyst

Related

Downloading sequences with `rentrez`