Accessing and downloading from entrez database system
entrez is a biological database system with the purpose of integrate protein and nucleotide sequences data and related information in a single place, this system has more than 40 databases vinculated and interelationated with GenBank, EMBL and DDBL, and it is the first place you should visit when you want obtain relevant an reliable information or datasets.
You can go and visit the oficial website in entrez, on that site you can search on a friendly user space and get the information that you want and explore the site, but the purpose of this lecture is to use rentrez
package to downloand efficiently sequences from different databases with
and this is especial important if you need to download many sequences.
First thing you have to do is install rentrez
package and then loading it.
install.packages("rentrez")
library(rentrez)
entrez has a lot of databases with specialized data on each one, if you want to know all the databases avalible you can use entrez_dbs()
function.
entrez_dbs()
## [1] "pubmed" "protein" "nuccore" "ipg"
## [5] "nucleotide" "structure" "genome" "annotinfo"
## [9] "assembly" "bioproject" "biosample" "blastdbinfo"
## [13] "books" "cdd" "clinvar" "gap"
## [17] "gapplus" "grasp" "dbvar" "gene"
## [21] "gds" "geoprofiles" "homologene" "medgen"
## [25] "mesh" "ncbisearch" "nlmcatalog" "omim"
## [29] "orgtrack" "pmc" "popset" "proteinclusters"
## [33] "pcassay" "protfam" "biosystems" "pccompound"
## [37] "pcsubstance" "seqannot" "snp" "sra"
## [41] "taxonomy" "biocollections" "gtr"
And if you want to know the purpose of certain database go to entrez databases other way to find additional information is using entrez_db_summary and entez_db_searchable() functions.
entrez_db_summary("nucleotide")
## DbName: nuccore
## MenuName: Nucleotide
## Description: Core Nucleotide db
## DbBuild: Build220322-1505m.1
## Count: 491207219
## LastUpdate: 2022/03/24 06:24
entrez_db_searchable("nucleotide")
## Searchable fields for database 'nuccore'
## ALL All terms from all searchable fields
## UID Unique number assigned to each sequence
## FILT Limits the records
## WORD Free text associated with record
## TITL Words in definition line
## KYWD Nonstandardized terms provided by submitter
## AUTH Author(s) of publication
## JOUR Journal abbreviation of publication
## VOL Volume number of publication
## ISS Issue number of publication
## PAGE Page number(s) of publication
## ORGN Scientific and common names of organism, and all higher levels of taxonomy
## ACCN Accession number of sequence
## PACC Does not include retired secondary accessions
## GENE Name of gene associated with sequence
## PROT Name of protein associated with sequence
## ECNO EC number for enzyme or CAS registry number
## PDAT Date sequence added to GenBank
## MDAT Date of last update
## SUBS CAS chemical name or MEDLINE Substance Name
## PROP Classification by source qualifiers and molecule type
## SQID String identifier for sequence
## GPRJ BioProject
## SLEN Length of sequence
## FKEY Feature annotated on sequence
## PORG Scientific and common names of primary organism, and all higher levels of taxonomy
## COMP Component accessions for an assembly
## ASSM Assembly
## DIV Division
## STRN Strain
## ISOL Isolate
## CULT Cultivar
## BRD Breed
## BIOS BioSample
Searching inside of pubmed
Personally I prefer RISmed
package to retrieve information from pubmed (Tutorial here) when you want a general search and get “large” information like the abstract of papers, but you can use rentrez
too to find the IDs and the search them too and create more precisely terms.
To search on pubmed database first we have to know what kind of terms we can find with entrez_db_searchable().
entrez_db_searchable("pubmed")
## Searchable fields for database 'pubmed'
## ALL All terms from all searchable fields
## UID Unique number assigned to publication
## FILT Limits the records
## TITL Words in title of publication
## WORD Free text associated with publication
## MESH Medical Subject Headings assigned to publication
## MAJR MeSH terms of major importance to publication
## AUTH Author(s) of publication
## JOUR Journal abbreviation of publication
## AFFL Author's institutional affiliation and address
## ECNO EC number for enzyme or CAS registry number
## SUBS CAS chemical name or MEDLINE Substance Name
## PDAT Date of publication
## EDAT Date publication first accessible through Entrez
## VOL Volume number of publication
## PAGE Page number(s) of publication
## PTYP Type of publication (e.g., review)
## LANG Language of publication
## ISS Issue number of publication
## SUBH Additional specificity for MeSH term
## SI Cross-reference from publication to other databases
## MHDA Date publication was indexed with MeSH terms
## TIAB Free text associated with Abstract/Title
## OTRM Other terms associated with publication
## INVR Investigator
## COLN Corporate Author of publication
## CNTY Country of publication
## PAPX MeSH pharmacological action pre-explosions
## GRNT NIH Grant Numbers
## MDAT Date of last modification
## CDAT Date of completion
## PID Publisher ID
## FAUT First Author of publication
## FULL Full Author Name(s) of publication
## FINV Full name of investigator
## TT Words in transliterated title of publication
## LAUT Last Author of publication
## PPDT Date of print publication
## EPDT Date of Electronic publication
## LID ELocation ID
## CRDT Date publication first accessible through Entrez
## BOOK ID of the book that contains the document
## ED Section's Editor
## ISBN ISBN
## PUBN Publisher's name
## AUCL Author Cluster ID
## EID Extended PMID
## DSO Additional text from the summary
## AUID Author Identifier
## PS Personal Name as Subject
## COIS Conflict of Interest Statements
Those are all the categories where you can ask and get information, for example you have to find papers about some organism such as “Phoneutria boliviensis” so you need papers that include “Phoneutria boliviensis” in their title. To do it, we are going to use the [TITL]
tag that we got previusly using entrez_db_searchable().
papers_phoneutria <- entrez_search(db="pubmed", term="Phoneutria boliviensis[TITL]")
papers_phoneutria
## Entrez search result with 3 hits (object contains 3 IDs and no web_history object)
## Search term (as translated): Phoneutria boliviensis[TITL]
There are 3 papers that include “Phoneutria boliviensis”.
You can also create a more detailed using AND, OR and multiple tags, on the next example we are going to search all the papers in PubMed that have “Phoneutria boliviensis” in their title AND “diet” as keyword.
papers_phoneutria_diet <- entrez_search(db="pubmed", term="Phoneutria boliviensis[TITL] AND diet[WORD]")
Downloading sequences with rentrez
When you want to download sequences from any databases you have to be sure about how many sequences do you want, and which sequences do you want, To answer the first question you have use the argument retmax
inside entrez_search() function, in this way you can limit the number of sequences which you will work.
neurotoxins <- entrez_search(db = "protein", term = "(Bilateria AND neurotoxin)", retmax = 50)
neurotoxins$ids
## [1] "662033932" "297747292" "295842492" "223468574" "56682964"
## [6] "18859383" "18765729" "5032137" "4759184" "2203845921"
## [11] "2055117517" "1939884204" "1915463589" "1059045644" "980958836"
## [16] "952543491" "952543490" "902763185" "387942508" "387942507"
## [21] "387942506" "387942502" "387942501" "387942500" "387912837"
## [26] "317411723" "317411690" "306755727" "298286916" "298286829"
## [31] "296439765" "292494995" "254790169" "254790168" "254790165"
## [36] "254790163" "254790157" "166220141" "166220138" "160358672"
## [41] "122129826" "118572281" "115311697" "115311688" "115311686"
## [46] "114152898" "114152794" "114149260" "74848311" "62900065"
Now that you have the IDs you can review some information with entrez_summary() or you can download the directly using entrez_fetch()
First50_neurotoxins <- entrez_fetch(db = "protein", id = neurotoxins$ids, rettype = "fasta") #Downloading
cat(strwrap(First50_neurotoxins), sep="\n") # Print the result
## >NP_001284367.1 vesicle-associated membrane protein 1 isoform 4 [Homo
## sapiens]
## MSAPAQPPAEGTEGTAPGGGPPGPPPNMTSNRRLQQTQAQVEEVVDIIRVNVDKVLERDQKLSELDDRAD
## ALQAGASQFESSAAKLKRKYWWKNCKMMIMLGAICAIIVVVIVRRG
##
## >NP_001172112.1 vesicle-associated membrane protein 7 isoform 3 [Homo
## sapiens]
## MAILFAVVARGTTILAKHAWCGGNFLEVTEQILAKIPSENNKLTYSHGNYLFHYICQDRIVYLCITDDDF
## ERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLAAQLKHHSENKGLDKVMETQAQVDELKGIM
## VRNIVCHLQNYQQKSCSSHVYEEPQAHYYHHHRINCVHLYHCFTSLWWIYMAKLCEEIGKKKLPLTKDMR
## EQGVKSNPCDSSLSHTDRWYLPVSSTLFSLFKILFHASRFIFVLSTSLFL
##
## >NP_001171511.1 syntaxin-3 isoform 2 [Homo sapiens]
## MKDRLEQLKAKQLTQDDDTDAVEIAIDNTAFMDEFFSEIEETRLNIDKISEHVEEAKKLYSIILSAPIPE
## PKTKDDLEQLTTEIKKRANNVRNKLKSMEKHIEEDEVRSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD
## FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE
## LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKARDETKKAVKYQSQARKKLISLQTGVATLVFR
##
## >NP_001138621.1 vesicle-associated membrane protein 7 isoform 2 [Homo
## sapiens]
## MAILFAVVARGTTILAKHAWCGGNFLEDFERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLA
## AQLKHHSENKGLDKVMETQAQVDELKGIMVRNIDLVAQRGERLELLIDKTENLVDSSVTFKTTSRNLARA
## MCMKNLKLTIIIIIVSIVFIYIIVSPLCGGFTWPSCVKK
##
## >NP_001008530.1 legumain isoform 1 preproprotein [Homo sapiens]
## MVWKVAVFLSVALGIGAVPIDDPEDGGKHWVVIVAGSNGWYNYRHQADACHAYQIIHRNGIPDEQIVVMM
## YDDIAYSEDNPTPGIVINRPNGTDVYQGVPKDYTGEDVTPQNFLAVLRGDAEAVKGIGSGKVLKSGPQDH
## VFIYFTDHGSTGILVFPNEDLHVKDLNETIHYMYKHKMYRKMVFYIEACESGSMMNHLPDNINVYATTAA
## NPRESSYACYYDEKRSTYLGDWYSVNWMEDSDVEDLTKETLHKQYHLVKSHTNTSHVMQYGNKTISTMKV
## MQFQGMKRKASSPVPLPPVTHLDLTPSPDVPLTIMKRKLMNTNDLEESRQLTEEIQRHLDARHLIEKSVR
## KIVSLLAASEAEVEQLLSERAPLTGHSCYPEALLHFRTHCFNWHSPTYEYALRHLYVLVNLCEKPYPLHR
## IKLSMDHVCLGHY
##
## >NP_571830.1 sodium-dependent dopamine transporter [Danio rerio]
## MPMLRGRPAVTHTRTRTHTHMSSVSGSSSAAGPREVELVLVKEQNGVQFTSSSLRNPGAHSHTHTHTHTH
## PSGQQRETWGKKIDFLLSVIGFAVDLANVWRFPYLCYKNGGGAFLVPYLLFMVIAGMPLFYMELALGQYN
## REGAAGVWKICPIFKGVGFTVILISLYVGSYYNVIIAWALFYLFSSFSGELPWIHCNNTWNSPNCSDPNA
## TLLNDTYKTTPALEYFERGVLHVHESSGIDDLGAPRWQLTACLAVVIVVLYFSLWKGVKTSGKVVWITAT
## MPYVVLTVLLLRGVTLPGAIDGIKAYLSVDFLRLYDAQVWIEAATQICFSLGVGFGVLIAFSSYNKFSNN
## CYRDAIITSSINSLTSFFSGFVIFSFLGYMSQKHNVALDKVATDGPGLVFIIYPEAIATLPGSSVWAVIF
## FIMLLTLGIDSAMGGMESVITGLIDEFKFLHKHRELFTLFIVVSTFLISLICVTNGGIYVFTLLDHFAAG
## TSILFGVLIEAIGIAWFYGVDRFSDDIEEMIGQRPGLYWRLCWKFVSPCFLLFMVVVSFATFNPPKYGSY
## YFPTWATMVGWCLSISSMIMVPLYAFYKFCSLPGSFCDKLAYAITPETDHHLVERGEVRQFTLHHWLVV
##
## >NP_003816.2 synaptosomal-associated protein 23 isoform SNAP23A [Homo
## sapiens]
## MDNLSSEEIQQRAHQITDESLESTRRILGLAIESQDAGIKTITMLDEQKEQLNRIEEGLDQINKDMRETE
## KTLTELNKCCGLCVCPCNRTKNFESGKAYKTTWGDGGENSPCNVVSKQPGPVTNGQLQQPTTGAASGGYI
## KRITNDAREDEMEENLTQVGSILGNLKDMALNIGNEIDAQNPQIKRITDKADTNRDRIDIANARAKKLID
## S
##
## >NP_005629.1 vesicle-associated membrane protein 7 isoform 1 [Homo
## sapiens]
## MAILFAVVARGTTILAKHAWCGGNFLEVTEQILAKIPSENNKLTYSHGNYLFHYICQDRIVYLCITDDDF
## ERSRAFNFLNEIKKRFQTTYGSRAQTALPYAMNSEFSSVLAAQLKHHSENKGLDKVMETQAQVDELKGIM
## VRNIDLVAQRGERLELLIDKTENLVDSSVTFKTTSRNLARAMCMKNLKLTIIIIIVSIVFIYIIVSPLCG
## GFTWPSCVKK
##
## >NP_004168.1 syntaxin-3 isoform 1 [Homo sapiens]
## MKDRLEQLKAKQLTQDDDTDAVEIAIDNTAFMDEFFSEIEETRLNIDKISEHVEEAKKLYSIILSAPIPE
## PKTKDDLEQLTTEIKKRANNVRNKLKSMEKHIEEDEVRSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD
## FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE
## LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKARDETKKAVKYQSQARKKLIIIIVLVVVLLGILAL
## IIGLSVGLN
##
## >sp|P0DV30.1|SCXT2_MESMA RecName: Full=Sodium channel neurotoxin BmK
## NT2; AltName: Full=Alpha-scorpion toxin
## VRDAYIAKPENCVYHCAGNEGCNNLCTCNGAT
##
## >sp|P0DUK8.1|KAX3V_MESMA RecName: Full=Toxin BmK NSPK; AltName:
## Full=Buthus martensii Karsch neurite-stimulating peptide targeting Kv
## channels VGKNVICIHSGQCLIPCIDAGMRFGICKNGICDCTPKG
##
## >sp|P0DQN8.1|DEL1A_HOTJU RecName: Full=Delta-buthitoxin-Hj1a;
## Short=Delta-BUTX-Hj1a
## EEVRDAYIAQPHNCVYHCFRDSYCNDLCIKHGAESGECKWFTSSGNACWCVKLPKSEPIKVPGKCH
##
## >sp|P0DQM8.1|CM3J_CONMA RecName: Full=Conotoxin MIIIJ; AltName:
## Full=AlphaM-MIIIJ QKCCSGGSCPLYFRDRLICPCC
##
## >NP_001317220.1 dimethylaniline monooxygenase [N-oxide-forming] 1
## isoform 1 [Mus musculus]
## MVKRVAIVGAGVSGLASIKCCLEEGLEPTCFERSSDLGGLWRFTEHVEEGRASLYKSVVSNSSREMSCYP
## DFPFPEDYPNFVPNSLFLEYLKLYSTQFNLQRCIYFNTKVCSITKRPDFAVSGQWEVVTVTNGKQNSAIF
## DAVMVCTGFLTNPHLPLDSFPGILTFKGEYFHSRQYKHPDIFKDKRVLVVGMGNSGTDIAVEASHLAKKV
## FLSTTGGAWVISRVFDSGYPWDMIFMTRFQNMLRNLLPTPIVSWLISKKMNSWFNHVNYGVAPEDRTQLR
## EPVLNDELPGRIITGKVFIKPSIKEVKENSVVFNNTPKEEPIDIIVFATGYTFAFPFLDESVVKVEDGQA
## SLYKYIFPAHLPKPTLAVIGLIKPLGSMVPTGETQARWVVQVLKGATTLPPPSVMMEEVNERKKNKHSGF
## GLCYCKALQTDYITYIDDLLTSINAKPDLRAMLLTDPRLALSIFFGPCTPYHFRLTGPGKWEGARKAILT
## QWDRTVKVTKTRTIQESPSSFETLLKLFSFLALLIAVFLIFL
##
## >sp|B3FIQ7.1|TX16C_CYRSC RecName: Full=U10-theraphotoxin-Hs2a;
## Short=U10-TRTX-Hs2a; AltName: Full=HWTX-XVIc; Flags: Precursor
## MNTVRVTFLLVFVLAVSLGQADEDGNRMEKRQKKTEAENLLLPKLEELDAKLWEEDSVESRNSRQKRCNG
## KDVPCDPDPAKNRRCCSGLECLKPYLHGIWYQDYYCYVEKSGR
##
## >sp|B3FIP2.1|TZ722_CYRSC RecName: Full=U8-theraphotoxin-Hs1b;
## Short=U8-TRTX-Hs1b; AltName: Full=HWTX-XVa2; Flags: Precursor
## MKAILLLAIFSVLTVAICGVSQNYGNVRYNYTELPNGEYCYIPRRRCVTTEQCCKPYDTVNNFAACGMAW
## PEDKKRKVNKCYICDNELTLCTR
##
## >sp|B3FIP1.1|TZ721_CYRSC RecName: Full=U8-theraphotoxin-Hs1a;
## Short=U8-TRTX-Hs1a; AltName: Full=HWTX-XVa1; Flags: Precursor
## MKAILLLAIFSVLTVAICGVSQNYGNVRYNYTELPNGEYCYIPRRRCVTTEQCCKPYDTVNNFAACGMAW
## PEDKKRKVNECYICDNELTLCTR
##
## >NP_001298019.1 ly6/PLAUR domain-containing protein 1 isoform b [Mus
## musculus]
## MCQKEVMEQSAGIMYRKSCASSAACLIASAGYQSFCSPGKLNSVCISCCNTPLCNGPRPKKRGSSASAIR
## PGLLTTLLFFHLALCLAHC
##
## >sp|H2ER22.1|KAX1X_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx BmKcug1a; Short=Kcug1a; Flags: Precursor
## MKISFLLLLAIVICSIGWTEAQFTNVSCSASSQCWPVCEKLFGTYRGKCMNSKCRCYS
##
## >sp|H2ER23.1|KAX1F_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 1.15; AltName: Full=BmKcug2; Short=Kcug1; Flags: Precursor
## MKISFLLLALVICSIGWSEAQFTDVKCTASKQCWPVCNQMFGKPNGKCMNGKCRCYS
##
## >sp|H2ETQ6.1|KAX1E_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 1.14; AltName: Full=BmKcug1; Short=Kcug1; Flags: Precursor
## MKKISFLLLLAIVICSIGWTDGQFTDVRCSASSKCWPVCKKLFGTYKGKCKNSKCRCYS
##
## >sp|F1CJ80.1|KA23J_HOTJU RecName: Full=U10-hottentoxin-Hj3a; Flags:
## Precursor
## MQKLLIILILFCILKFNVDVEGRTATMCDLPECQERCKRQNKKGKCVIEPEMNIVYHLCKCY
##
## >sp|F1CJ67.1|KA23I_HOTJU RecName: Full=U10-hottentoxin-Hj2a; Flags:
## Precursor
## MQKLLIILILFCILKFNVDVEGRTAFPCNQSKCQERCKKEIKKGKCILQFISVSASQSCRCY
##
## >sp|F1CIY9.1|KA23H_HOTJU RecName: Full=U10-buthitoxin-Hj1a;
## Short=U10-BUTX-Hj1a; Flags: Precursor
## MQKIFIILVLFCILKFNVDVEGRIASQCDLSACKERCEKQNKNGKCVIETEMDLVYRLCKCY
##
## >sp|Q8MUB1.2|KA221_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 22.1; AltName: Full=Neurotoxin BmK38; AltName: Full=Toxin
## Kcugx; Flags: Precursor
## MQKLFIVFVLFCILRLDAEVDGRTATFCTQSICEESCKRQNKNGRCVIEAEGSLIYHLCKCY
##
## >sp|B5KF99.1|KA11M_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx J123; Flags: Precursor
## MNKVYLVAVLVLFLALTINESNEAVPTGGCPFSDFFCAKRCKDMKFGNTGRCTGPNKTVCKCSI
##
## >sp|A7KJJ7.1|KA261_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 26.1; AltName: Full=Neurotoxin BmK86; Flags: Precursor
## MSRLFVFILIALFLSAIIDVMSNFKVEGACSKPCRKYCIDKGARNGKCINGRCHCYY
##
## >sp|P0CH43.1|DKTX_CYRSC RecName: Full=Tau-theraphotoxin-Hs1a;
## Short=Tau-TRTX-Hs1a; AltName: Full=Double-knot toxin; Short=DkTx
## DCAKEGEVCSWGKKCCDLDNFYCPMEFIPHCKKYKPYVPVTTNCAKEGEVCGWGSKCCHGLDCPLAFIPY
## CEKYRGRND
##
## >sp|Q6WJF5.3|LV1A_MESMA RecName: Full=Lipolysis-activating peptide
## 1-alpha chain; Short=BmLVP1-alpha; Short=LVP1-alpha; Contains: RecName:
## Full=Neurotoxin BmKBTx; Short=BmKBT; Flags: Precursor
## MMKFVLFGMIVILFSLMGSIRGDDDPGNYPTNAYGNKYYCTILGENEYCRKICKLHGVTYGYCYNSRCWC
## EKLEDKDVTIWNAVKNHCTNTILYPNGK
##
## >sp|Q95P90.2|LV1B_MESMA RecName: Full=HMG-CoA reductase inhibitor
## bumarsin; AltName: Full=JCH2; AltName: Full=Lipolysis-activating
## peptide 1-beta chain; Short=BmLVP1-beta; Short=LVP1-beta; AltName:
## Full=Neurotoxin KITx; Short=BmKITx; AltName: Full=Putative toxin
## BmKTXLP2; Flags: Precursor
## MVKMQVIFIAFIAVIACSMVYGDSLSPWNEGDTYYGCQRQTDEFCNKICKLHLASGGSCQQPAPFVKLCT
## CQGIDYDNSFFFGALEKQCPKLRG
##
## >sp|P0CF76.1|SCX11_MESMA RecName: Full=Toxin BmKNJX11 GRDAYIADSENCTYT
##
## >sp|P69755.2|O16C_CONMA RecName: Full=Delta-conotoxin-like MVIC;
## Short=Delta-MVIC; Flags: Precursor
## MKLTCVMIVAVLFLTTWTFVTADDSRYGLKNLFPKARHEMKNPEASKLNKRDECYPPGTFCGIKPGLCCS
## AICLSFVCISFDF
##
## >sp|B3FIS7.1|TXLB2_CYRSC RecName: Full=U5-theraphotoxin-Hs1b 2;
## Short=U5-TRTX-Hs1b; AltName: Full=Lectin SHL-Ib2; Flags: Precursor
## MQTSMFLTLTGLVLLFVVCYASESEEKEFPKELLSSIFAADSDFKEEERGCFGYKCDYYKGCCSGYVCSP
## TWKWCVRPGPGRR
##
## >sp|B3FIS6.1|TXLB1_CYRSC RecName: Full=U5-theraphotoxin-Hs1b 1;
## Short=U5-TRTX-Hs1b; AltName: Full=Lectin SHL-Ib1; Flags: Precursor
## MKTSMFLTLTGLVLLFVVCYASESEEKEFPKELLSSIFAADSDFKEEERGCFGYKCDYYKGCCSGYVCSP
## TWKWCVRPGPGRR
##
## >sp|B3FIS3.1|TXLA4_CYRSC RecName: Full=U5-theraphotoxin-Hs1a 4;
## Short=U5-TRTX-Hs1a; AltName: Full=Lectin SHL-Ia4; Flags: Precursor
## MKTSMFLTLTGLVLLFVDCYASESEEKEFPKELLSSIFAADSDFKVEERGCLGDKCDYNNGCCSGYVCSR
## TWKWCVLAGPWRR
##
## >sp|B3FIS1.1|TXLD_CYRSC RecName: Full=U5-theraphotoxin-Hs1d;
## Short=U5-TRTX-Hs1d; AltName: Full=Lectin SHL-1a2; AltName: Full=Lectin
## SHL-Ia2; Flags: Precursor
## MKTSMFLTLTGLVLLFVVCYASESEEKEFPKELLSSIFAADSDFKVEERGCLGDKCDYNNGCCSGYVCPR
## TWKWCVLAGPWRR
##
## >sp|B3FIU2.1|TX10A_CYRSC RecName: Full=U12-theraphotoxin-Hs1a;
## Short=U12-TRTX-Hs1a; AltName: Full=Huwentoxin-10a; AltName:
## Full=Huwentoxin-Xa; Short=HwTx-Xa; Flags: Precursor
## MNVKILLLLVGLNLVMHSNATGDSETNPAETLFIEEIFRRGCFKEGKWCPKSAPCCAPLKCKGPSIKQQK
## CVRE
##
## >sp|P0C614.1|I1B5_CONMA RecName: Full=Iota-conotoxin-like M11.5
## GHVPCGKDGRKCGYHADCCNCCLSGICKPSTSWTGCSTSTFD
##
## >sp|P0C613.1|I1B2_CONMA RecName: Full=Conotoxin M11.2
## TCSNKGQQCGDDSDCCWHLCCVNNKCAHLILLCNL
##
## >sp|P0C5S7.1|SIXP1_MESMA RecName: Full=Insect toxin BmK AngP1
## KKNGYAVDSSGKVAE
##
## >sp|A0ASK0.1|KA14_MESMA RecName: Full=Potassium channel toxin alpha-KTx
## 14.x; AltName: Full=BmKK14; Flags: Precursor
## MKIFFAILLILAVCSMAIWTVNGTPFEVRCATDADCARKCPGNPPCRNGFCACT
##
## >sp|P0C257.1|I1B1_CONMA RecName: Full=Iota-conotoxin-like M11.1; Flags:
## Precursor GAVPCGKDGRQCRNHADCCNCCPIGTCAPSTNWILPGCSTGQFMTR
##
## >sp|P0C1X1.1|CA4A_CONMA RecName: Full=Kappa-conotoxin-like MIVA; Flags:
## Precursor
## MGMRMMFTVFLLVVLATTVVSIPSDRASDGRNAVVHERAPELVVTATTNCCGYNPMTICPPCMCTYSCPP
## KRKPGRRND
##
## >sp|P0C1W2.1|CA1B_CONMA RecName: Full=Alpha-conotoxin-like MIB
## NGRCCHPACARKYNC
##
## >sp|P0C1W1.1|CA1A_CONMA RecName: Full=Alpha-conotoxin-like MIA
## DGRCCHPACAKHFNC
##
## >sp|P68424.2|TXH10_CYRSC RecName: Full=Omega-theraphotoxin-Hs1a;
## Short=Omega-TRTX-Hs1a; AltName: Full=Huwentoxin-10; AltName:
## Full=Huwentoxin-X; Short=HwTx-X; Flags: Precursor
## MNMKILVLVAVLCLVVSTHAERHSKTDMEDMEDSPMIQERKCLPPGKPCYGATQKIPCCGVCSHNKCT
##
## >sp|P56636.3|CA12_CONMA RecName: Full=Alpha-conotoxin MII;
## Short=Alpha-Ctx MII; Short=Alpha-MII; Flags: Precursor
## MGMRMMFTVFLLVVLATTVVSFPSDRASDGRNAAANDKASDVITLALKGCCSNPVCHLEHSNLCGRRR
##
## >sp|P0C1U2.1|CM3A_CONMA RecName: Full=Mu-conotoxin MIIIA
## QGCCNVPNGCSGRWCRDHAQCC
##
## >sp|Q9BKB4.1|KA144_MESMA RecName: Full=Potassium channel toxin
## alpha-KTx 14.4; AltName: Full=BmSKTx1; AltName: Full=Neurotoxin SKTx1;
## Flags: Precursor MKIFFAILLILAVCSMAIWTVNGTPFAIKCATNADCSRKCPGNPPCRNGFCACT
##
## >sp|P69756.1|O16D_CONMA RecName: Full=Delta-conotoxin-like MVID;
## Short=Delta-MVID EACYNAGTFCGIKPGLCCSAICLSFVCISFDF
write(First50_neurotoxins,file = "First50_neurotoxins.fasta", sep = "")
#Store the sequences
One important thing is that you can not download more that 300 sequences with entrez_fetch(), to solve this you have to use the argument use_history
to create an object to store the IDs and be called multiple times, then we use a for
loop to download our sequences by ‘chunks’.
neurotoxins <- entrez_search(db = "protein", term = "(Bilateria AND neurotoxin)", retmax = 10000, use_history = T)
There are 7700 sequences to download
for (seq_start in seq(1,length(neurotoxins$ids),10)) {
#create chunks of 10 seqs through the number of IDs
reqs <- entrez_fetch(db = "protein",
web_history = neurotoxins$web_history,#call the web_history object
rettype = "fasta", #precise the output
retmax=10,#confirm the chunck length
retstart=seq_start) # start with the next chunck
cat(reqs, file = "neurotoxins.fasta", append = T) #save our .fasta file
cat(seq_start+9, "sequences downloaded\r") #create a counter
}
path <- "../Rmd_sorted/neurotoxins.fasta"
neurotoxins.fasta <- ape::read.dna(path, format = "fasta") # load the .fasta file inside R again
file.remove(path) #if you don't need more this file you can detete it
And that’s all for today.