Pairwise alignments.

Last updated on Mar 30, 2022

Why align two sequences?

When you are working with biological sequences no matter if they are short sequences with a few hundred of bases, aminoacid or big genomic dataset, usually you are interested in finding a pattern, a relationship or a biological function.

If two sequences are similar, the protein translated or the protein structure might be similar too, and if their structure is similar probably their function as well, for this reason when you are comparing whether two o more sequences the first step is align them and then comparing.

Pairwise sequence aligment

As the name indicate, this kind of alignment can align only two sequences and there are two main types, Global alignments and local alignments.

Global alignaments

They are useful when the sequences have a high similarity and they have the same or almost the same length, this algorithm was created by Needleman-Wunsch and its purpose is compare two sequences along their full length and find the best alignment possible.

Note: Previously you must have had it installed Biostring using Bioconductor.

install.packages("BiocManager")
BiocManager::install("Biostrings")

library(Biostrings)

Needleman-Wunsch algorithm

On the next block of code we have three hypothetical sequences, two pretty similar whereas the last one is very different.

sequence1 <- "CAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA"
sequence2 <- "GAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA"
sequence3 <- "GATTAATTAATTACAACACAACACACACACACCTTACTTCTTACACACACACACA"

Now we are going to compare two pairwise alignment. Note how the alignment score is greater if two sequences are similar, this score depends of the values of gaps and mismatch penalties.

Biostrings::pairwiseAlignment(sequence1,sequence2, type = "global")

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: CAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA
## subject: GAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGATA
## score: 101.1156

Biostrings::pairwiseAlignment(sequence2,sequence3, type = "global")

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: GAAT--TCGGCTAGAATTCGGCTAGAATTCGGCTAGAATTCGGCTAGA-ATTCGGATA
## subject: GATTAATTAATTACAACACAACACACACACACCTT-ACTTCT--TACACACACACACA
## score: -173.7369

You can change the value of this penalties to make more strict or more flexible you alignment. To do it you can use nucleotideSubstitutionMatrix() function.

First, give a value to the penalties.

#penalties
gapOpen <- 2 #penaltie for open gap
gapExtend <- 1 #penaltie for extended gap
match = 1 # match value
mismatch = -1 #penaltie for mismatches

Then create a substitution matrix

myScoringMat <- nucleotideSubstitutionMatrix(match = match, 
                                             mismatch = mismatch,
                                             type = "DNA",
                                             baseOnly = T)

Now, in this part of the example we are going to work with real sequences, so before we have to import some sequences, select two of them and make the pairwise alignment without the substitution matrix.

my_seqs <- readDNAStringSet("neurotoxins.fasta",format = "fasta")

my_alignment <- Biostrings::pairwiseAlignment(my_seqs[1],
                                              my_seqs[2],
                                              type = "global") 
my_alignment

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: CTGGAATGCCACGACCAGCAATCATCGCAAGCTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## subject: CTGGAATGCCACGACCAGCAATCATCGGAAACTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## score: 366.7162

And now including our substitution matrix and gap penalties your alignment will be more strict.

my_alignment_SubMx <- Biostrings::pairwiseAlignment(my_seqs[1],
                                              my_seqs[2], 
                                              type = "global",
                                              substitutionMatrix = myScoringMat, 
                                              gapOpening = gapOpen,
                                              gapExtension = gapExtend) 
my_alignment_SubMx

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: CTGGAATGCCACGACCAGCAATCATCGCAAGCTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## subject: CTGGAATGCCACGACCAGCAATCATCGGAAACTC...TGTTGCACAACCGACCGCTGCAACAACTGATAAA
## score: 189

Local alignments

Smith-Waterman algotithm

This kind of alignment is useful when two sequences have strong dissimilarity but we suspect that this sequences have common regions.

sequence1 <- "ATATATCACGGCGTACGTATAGGATGCAGATGC"
sequence2 <- "ATATATGCGACTAGGATATAGCGAAGGCTCGAT"

my_local_alignment_SubMx <- Biostrings::pairwiseAlignment(sequence1,
                                                    sequence2, 
                                              type = "local",
                                              substitutionMatrix = myScoringMat, 
                                              gapOpening = gapOpen,
                                              gapExtension = gapExtend) 
my_local_alignment_SubMx

## Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: [1] ATATAT
## subject: [1] ATATAT
## score: 6

Sometimes more than one alignment are possible, in those case you can use summary() function to compare them.

summary(my_alignment)

## Global Single Subject Pairwise Alignments
## Number of Alignments:  1
## 
## Scores:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   366.7   366.7   366.7   366.7   366.7   366.7 
## 
## Number of matches:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     191     191     191     191     191     191 
## 
## Top 2 Mismatch Counts:
##   SubjectPosition Subject Pattern Count Probability
## 1              28       G       C     1           1
## 2              31       A       G     1           1

#In this case there is only one alignment

Extracting more information about your alingments

Apart of summary() function you can obtain more information with e score(), nedit(), nmatch()) and nmismatch().

my_global_alignment_SubMx <- Biostrings::pairwiseAlignment(sequence1,
                                                    sequence2, 
                                              type = "global",
                                              substitutionMatrix = myScoringMat, 
                                              gapOpening = gapOpen,
                                              gapExtension = gapExtend) 
my_global_alignment_SubMx

## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: ATATATCACGGCGTACG-TATAGG--ATGCA-GATGC
## subject: ATATATG-CGAC-TAGGATATAGCGAAGGCTCGAT--
## score: -3

Extracting just the score values.

score(my_global_alignment_SubMx)

## [1] -3

Computing Levenshtein distance, this is other way to evaluate how similar are two sequences, and it is useful to compare between alignments.

nedit(my_global_alignment_SubMx)

## [1] 12

nedit(my_alignment)

## [1] 2

Extracting the number of matchs

nmatch(my_global_alignment_SubMx)

## [1] 23

Extracting the number of mismatchs

nmismatch(my_global_alignment_SubMx)

## [1] 6

This function Computes the length of “gapped” sequences.

nchar(my_global_alignment_SubMx)

## [1] 35

R Bioinformatics