Pairwise alignments.
Why align two sequences?
When you are working with biological sequences no matter if they are short sequences with a few hundred of bases, aminoacid or big genomic dataset, usually you are interested in finding a pattern, a relationship or a biological function.
If two sequences are similar, the protein translated or the protein structure might be similar too, and if their structure is similar probably their function as well, for this reason when you are comparing whether two o more sequences the first step is align them and then comparing.
Pairwise sequence aligment
As the name indicate, this kind of alignment can align only two sequences and there are two main types, Global alignments and local alignments.
Global alignaments
They are useful when the sequences have a high similarity and they have the same or almost the same length, this algorithm was created by Needleman-Wunsch and its purpose is compare two sequences along their full length and find the best alignment possible.
Note: Previously you must have had it installed Biostring
using Bioconductor
.
install.packages("BiocManager")
BiocManager::install("Biostrings")
library(Biostrings)
Needleman-Wunsch algorithm
On the next block of code we have three hypothetical sequences, two pretty similar whereas the last one is very different.
sequence1 <- "CAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA"
sequence2 <- "GAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA"
sequence3 <- "GATTAATTAATTACAACACAACACACACACACCTTACTTCTTACACACACACACA"
Now we are going to compare two pairwise alignment. Note how the alignment score is greater if two sequences are similar, this score depends of the values of gaps and mismatch penalties.
Biostrings::pairwiseAlignment(sequence1,sequence2, type = "global")
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: CAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA
## subject: GAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA
## score: 101.1156
Biostrings::pairwiseAlignment(sequence2,sequence3, type = "global")
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: GAAT--TCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGA-ATTCGGATA
## subject: GATTAATTAATTACAACACAACACACACACACCTT-ACTTCT--TACACACACACACA
## score: -173.7369
You can change the value of this penalties to make more strict or more flexible you alignment. To do it you can use nucleotideSubstitutionMatrix() function.
First, give a value to the penalties.
#penalties
gapOpen <- 2 #penaltie for open gap
gapExtend <- 1 #penaltie for extended gap
match = 1 # match value
mismatch = -1 #penaltie for mismatches
Then create a substitution matrix
myScoringMat <- nucleotideSubstitutionMatrix(match = match,
mismatch = mismatch,
type = "DNA",
baseOnly = T)
Now, in this part of the example we are going to work with real sequences, so before we have to import some sequences, select two of them and make the pairwise alignment without the substitution matrix.
my_seqs <- readDNAStringSet("neurotoxins.fasta",format = "fasta")
my_alignment <- Biostrings::pairwiseAlignment(my_seqs[1],
my_seqs[2],
type = "global")
my_alignment
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: CTGGAATGCCACGACCAGCAATCATCGCAAGCTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## subject: CTGGAATGCCACGACCAGCAATCATCGGAAACTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## score: 366.7162
And now including our substitution matrix and gap penalties your alignment will be more strict.
my_alignment_SubMx <- Biostrings::pairwiseAlignment(my_seqs[1],
my_seqs[2],
type = "global",
substitutionMatrix = myScoringMat,
gapOpening = gapOpen,
gapExtension = gapExtend)
my_alignment_SubMx
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: CTGGAATGCCACGACCAGCAATCATCGCAAGCTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## subject: CTGGAATGCCACGACCAGCAATCATCGGAAACTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## score: 189
Local alignments
Smith-Waterman algotithm
This kind of alignment is useful when two sequences have strong dissimilarity but we suspect that this sequences have common regions.
sequence1 <- "ATATATCACGGCGTACGTATAGGATGCAGATGC"
sequence2 <- "ATATATGCGACTAGGATATAGCGAAGGCTCGAT"
my_local_alignment_SubMx <- Biostrings::pairwiseAlignment(sequence1,
sequence2,
type = "local",
substitutionMatrix = myScoringMat,
gapOpening = gapOpen,
gapExtension = gapExtend)
my_local_alignment_SubMx
## Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] ATATAT
## subject: [1] ATATAT
## score: 6
Sometimes more than one alignment are possible, in those case you can use summary() function to compare them.
summary(my_alignment)
## Global Single Subject Pairwise Alignments
## Number of Alignments: 1
##
## Scores:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 366.7 366.7 366.7 366.7 366.7 366.7
##
## Number of matches:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 191 191 191 191 191 191
##
## Top 2 Mismatch Counts:
## SubjectPosition Subject Pattern Count Probability
## 1 28 G C 1 1
## 2 31 A G 1 1
#In this case there is only one alignment
Extracting more information about your alingments
Apart of summary() function you can obtain more information with e score(), nedit(), nmatch()) and nmismatch().
my_global_alignment_SubMx <- Biostrings::pairwiseAlignment(sequence1,
sequence2,
type = "global",
substitutionMatrix = myScoringMat,
gapOpening = gapOpen,
gapExtension = gapExtend)
my_global_alignment_SubMx
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: ATATATCACGGCGTACG-TATAGG--ATGCA-GATGC
## subject: ATATATG-CGAC-TAGGATATAGCGAAGGCTCGAT--
## score: -3
Extracting just the score values.
score(my_global_alignment_SubMx)
## [1] -3
Computing Levenshtein distance, this is other way to evaluate how similar are two sequences, and it is useful to compare between alignments.
nedit(my_global_alignment_SubMx)
## [1] 12
nedit(my_alignment)
## [1] 2
Extracting the number of matchs
nmatch(my_global_alignment_SubMx)
## [1] 23
Extracting the number of mismatchs
nmismatch(my_global_alignment_SubMx)
## [1] 6
This function Computes the length of “gapped” sequences.
nchar(my_global_alignment_SubMx)
## [1] 35