R tidyverse: stringr Part two (regular expressions)

On the previous lecture “R tydiverse: stringr part one (Basics)” we learnt how to know the length of a string, sort, convert case and match some basic patterns, this time we are going to learn how to match more complex patterns, working with biological sequences and with multiple objects(in this case sequences) at the same time.

First of all, we are going to load tidyverse packages which has embedded the stringr package.

library(tidyverse)

Now, we are going to import a .fasta file with the function read.fasta() from seqinr package, this package need to be installed if you have not already installed.

if (!require("seqinr")){
        install.packages("seqinir")
        library(seqinr)
}else {
        library(seqinr)
}

#And then we read the .fasta file

my_seqs <- read.fasta("~/Documentos/practical computing for biologist/pcfb_examples/pcfb/examples/FPexamples.fta") #

The first thing we are going to do is extract the sequence names using attributes() function.

seq_names <- attributes(my_seqs)

And the then inspect the structure of “my_seqs” object where there are all our sequences with str() function, this fiction will shows that our object is a list of five elements (sequences) and the length of each one.

str(my_seqs)
## List of 5
##  $ CAA58790.1=: 'SeqFastadna' chr [1:238] "m" "s" "k" "g" ...
##   ..- attr(*, "name")= chr "CAA58790.1="
##   ..- attr(*, "Annot")= chr ">CAA58790.1= green fluorescent protein [Aequorea victoria]"
##  $ AAZ67342.1=: 'SeqFastadna' chr [1:221] "m" "s" "l" "s" ...
##   ..- attr(*, "name")= chr "AAZ67342.1="
##   ..- attr(*, "Annot")= chr ">AAZ67342.1= GFP-like red fluorescent protein [Corynactis californica]"
##  $ ACX47247.1=: 'SeqFastadna' chr [1:262] "m" "e" "f" "e" ...
##   ..- attr(*, "name")= chr "ACX47247.1="
##   ..- attr(*, "Annot")= chr ">ACX47247.1= green fluorescent protein [Haeckelia beehleri]"
##  $ ABC68474.1=: 'SeqFastadna' chr [1:236] "m" "r" "s" "s" ...
##   ..- attr(*, "name")= chr "ABC68474.1="
##   ..- attr(*, "Annot")= chr ">ABC68474.1= red fluorescent protein [Discosoma sp. RC-2004]"
##  $ AAQ01183.1=: 'SeqFastadna' chr [1:222] "m" "p" "a" "m" ...
##   ..- attr(*, "name")= chr "AAQ01183.1="
##   ..- attr(*, "Annot")= chr ">AAQ01183.1= green fluorescent protein 1 [Pontellina plumata]"

Also we can use $ symbol to inspect inside a list and use head() function to see the first six elements, where each letter represent an aminoacid.

head(my_seqs$`CAA58790.1=`, n =  6)
## [1] "m" "s" "k" "g" "e" "e"

As you saw previously, each aminoacid is interpreted as a sigle string surrounded by " “, if we want to use stringrfunctions to match patterns in our sequences then we need to collapse each of them. To do it, remember that you sequences are stored in a list, and we want to do the same operation with all of them. For this reason we are goning to map() function from purrr package, this is a really good way to make the same thing iteratively with different elements.

So, we are going to use str_flatten() inside map() function to collapse all the sequences at same time, indicating that we want to collapse the quotes "".

my_seqs <- my_seqs %>% map(.,~str_flatten(.,collapse = "")) 
my_seqs$`CAA58790.1=`
## [1] "mskgeelftgvvpilveldgdvngqkfsvrgegegdatygkltlkficttgklpvpwptlvttfsygvqcfsrypdhmkqhdflksampegyvqertifykddgnyktraevkfegdtlvnrielkgidfkedgnilghkmeynynshnvyimgdkpkngikvnfkirhnikdgsvqladhyqqntpigdgpvllpdnhylstqsalsqdphgkrdhmvllefvtsagithgmdelyk"

Usually, aminoacids are represented with letter in uppercase.

my_seqs <- my_seqs %>% map(.,str_to_upper)
my_seqs$`CAA58790.1=`
## [1] "MSKGEELFTGVVPILVELDGDVNGQKFSVRGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFLKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKMEYNYNSHNVYIMGDKPKNGIKVNFKIRHNIKDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSQDPHGKRDHMVLLEFVTSAGITHGMDELYK"

Now we are ready to find patterns in our sequences.

Search a term in a vector

You can find any possible pattern using regular expressions, they add flexibility and power to your searches, the easiest way to detect a pattern is surround it using quotes "" and str_detect() function, the output of this function is a Boolean (TRUE or FALSE), for example, imagine that you need to know if your sequence has this dipeptide “LL”:

# Store the firts sequece in an independent object to experiment
# and find patterns
first_seq <- my_seqs$`CAA58790.1=`
str_detect(first_seq, "LL") #mach the tripeptide

At this moment you know that our firts sequence has at least one “LL”, to know exactly whether there is just one ore more you can use str_count() function.

str_count(my_seqs$`CAA58790.1=`,"LL") #There are two "LL"
## [1] 2

But… Where are they?, which are their position in the sequences?, you can str_locale_all() to find them and str_sub() to confirm it.

startend
194195
220221
str_locate_all(my_seqs$`CAA58790.1=`,"LL")
str_sub(my_seqs$`CAA58790.1=`,194,195)
## [1] "LL"

Working with multiple sequences at the same time.

On the next exercises we are going evaluate if there is some particular pattern in our sequences using regular expression, and map() function from purrr package to work on multiple sequences at same time.

When you want to detect certain pattern regular expressions are your best option, for example, you want to know how many “SAM” or “SAL” are in your sequences,so you can use | to precise you want that pattern, take on mind they both start with the same two character and only the last one is not shared.

my_counts_SA <- my_seqs %>% map_chr(., ~str_count(.,pattern = "SA(M|L)"))
# map_chr() make the same work than map but its output is a string
# with the results.

my_counts_SA
## CAA58790.1= AAZ67342.1= ACX47247.1= ABC68474.1= AAQ01183.1= 
##         "2"         "0"         "0"         "0"         "0"
# There are only 2 SAM or SAL in the first sequence and the others
# don't even have one.

The second regular expression we are going to work is the [] and its functions is any character inside, now think you want to count all “G”, “L”, “Q” and “E” aminoacids without matter their order or position.

my_counts_GLQE <- my_seqs %>% map_chr(., ~str_count(., pattern = "[GLQE]"))

my_counts_GLQE
## CAA58790.1= AAZ67342.1= ACX47247.1= ABC68474.1= AAQ01183.1= 
##        "66"        "58"        "70"        "66"        "57"

Now imagine that you want to now what is behind some pattern such as “PV” followed by any other character. To achieve it we are going to use the regular expression +

coordinate <- my_seqs$`CAA58790.1=` %>% str_locate_all(.,"PV+") %>% as.data.frame()

coordinate <- mutate(coordinate, end = str_length(my_seqs$`CAA58790.1=`)) %>% slice(.,1) %>% as.character()
#see dplyr guide if you don't understand mutate() function

The purpose of slide() function is subset the first coordinate in case there are more than on match, then we can use str_sub() as we did before.

str_sub(my_seqs$`CAA58790.1=`,start = coordinate[1], end = coordinate[2])
## [1] "PVPWPTLVTTFSYGVQCFSRYPDHMKQHDFLKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKMEYNYNSHNVYIMGDKPKNGIKVNFKIRHNIKDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSQDPHGKRDHMVLLEFVTSAGITHGMDELYK"

There are other regular expressions such as ^ and $ to find patterns at the beginning or at the end respectively, or \s to find spaces, \d to match digits, \t to match tabs and more, but these are stuffs you should try by yourself. on the other hand other useful functions are str_slipt() to slit strings and str_delete() to delete certain pattern.

You can find more information here about the use of stringr package and regular expressions.

Diego Sierra Ramírez
Diego Sierra Ramírez
Msc. in Biological Science / Data analyst

Related