Multiple sequence alignments

Multiple sequence alingments

Multiple sequence alignments are really useful when you have multiple sequences and you want to find conserved regions that could indicate evolutionary relationship or probable similar structure-function, other application is take an unknown sequence (protein) and group it with others to find conserved motifs within the proteins that could suggest some function.

With msa package you can use some popular algorithms such as ClustalW, ClustalOmega and Muscle, this package also allows to plot and export alignments.

Importing sequences

First of all you have to install msa package from Bioconductor with BioManager package.

installed.packages("BiocManager")
BiocManager::install("msa")

Then load required libraries…

library(msa)
library(Biostrings)
library(tidyverse)
library(seqinr)
library(ape)

And now you can import you sequences with readAAStringSet() or readDNAStringSet() from Biostrings package.

#import fastafile
path <-"~/Documentos/R/R_personal_website_2.0/content/project/Bioinformatics:rentrez/First50_neurotoxins.fasta"

my_sequences <- readAAStringSet(path)

You can work directly with your sequences or subset them if necessary.

my_sequences <- my_sequences[11:18]
my_sequences
## AAStringSet object of length 8:
##     width seq                                               names               
## [1]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309838.1 sy...
## [2]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309839.1 sy...
## [3]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309835.1 sy...
## [4]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309836.1 sy...
## [5]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309837.1 sy...
## [6]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309834.1 sy...
## [7]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309832.1 sy...
## [8]   206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309833.1 sy...

Aligning with msa

To make an alignment you have to use msa() function and specify the algorithm you want to use. Also you can add gap extension and gap opening penalties as we learned in Pairwise alignments

#clustalW
my_MSA_clustalW <- msa(my_sequences, method = "ClustalW", type = "protein") 
## use default substitution matrix
#muscle
my_MSA_muscle <- msa(my_sequences, method = "Muscle", type = "protein") 

#clustal_omega
my_MSA_clustal_omega <- msa(my_sequences, method = "ClustalOmega", type = "protein") 
## using Gonnet

Computing conservation scores

As you might notice, the last sequence within our alinments is the consensus, you can extract them with msaConsensusSequence() this sequence is useful to evaluate how well aligned are our sequences.

msaConsensusSequence(my_MSA_muscle) 
## [1] "MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNNQDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG"

Other more sensible way to analyze this is using substitution matrices like PAM or BLOSUM, the firsts ones are more used for reconstruct phylogenetic trees whereas the second ones are popular to compare regions between alignments.

To know which of them are better for you, you have you try some of them and take which of them give you the best alignment score, for example with BLOSUM matrices the BLOSUM100 is suitable to compare sequences with almost 100% of similarity, while BLOSUM65 and BLOSUM45 are used to detect weak protein similarities.

data("BLOSUM100") # load substitution matrices from Biostrings package
data("BLOSUM62")
# Compute conservation scores
msaConservationScore(my_MSA_muscle, BLOSUM62)
##   M   A   E   D   A   D   M   R   N   E   L   E   E   M   Q   R   R   A   D   Q 
## 320 256 320 384 256 384 320 320 384 320 256 320 320 320 320 320 320 256 384 320 
##   L   A   D   E   S   L   E   S   T   R   R   M   L   Q   L   V   E   E   S   K 
## 256 256 384 320 256 256 320 256 320 320 320 320 256 320 256 256 320 320 256 320 
##   D   A   G   I   R   T   L   V   M   L   D   E   Q   G   E   Q   L   E   R   I 
## 384 256 384 256 320 320 256 256 320 256 384 320 320 384 320 320 256 320 320 256 
##   E   E   G   M   D   Q   I   N   K   D   M   K   E   A   E   K   N   L   T   D 
## 320 320 384 320 384 320 256 384 320 384 320 320 320 256 320 320 384 256 320 384 
##   L   G   K   F   C   G   L   C   V   C   P   C   N   K   L   K   S   S   D   A 
## 256 384 320 384 576 384 256 576 256 576 448 576 384 320 256 320 256 256 384 256 
##   Y   K   K   A   W   G   N   N   Q   D   G   V   V   A   S   Q   P   A   R   V 
## 448 320 320 256 704 384 384 384 320 384 384 256 256 256 256 320 448 256 320 256 
##   V   D   E   R   E   Q   M   A   I   S   G   G   F   I   R   R   V   T   N   D 
## 256 384 320 320 320 320 320 256 256 256 384 384 384 256 320 320 256 320 384 384 
##   A   R   E   N   E   M   D   E   N   L   E   Q   V   S   G   I   I   G   N   L 
## 256 320 320 384 320 320 384 320 384 256 320 320 256 256 384 256 256 384 384 256 
##   R   H   M   A   L   D   M   G   N   E   I   D   T   Q   N   R   Q   I   D   R 
## 320 512 320 256 256 384 320 384 384 320 256 384 320 320 384 320 320 256 384 320 
##   I   M   E   K   A   D   S   N   K   T   R   I   D   E   A   N   Q   R   A   T 
## 256 320 320 320 256 384 256 384 320 320 320 256 384 320 256 384 320 320 256 320 
##   K   M   L   G   S   G 
## 320 320 256 384 256 384
# Compute conservation scores
msaConservationScore(my_MSA_muscle, BLOSUM100)
##    M    A    E    D    A    D    M    R    N    E    L    E    E    M    Q    R 
##  768  512  640  640  512  640  768  640  704  640  512  640  640  768  704  640 
##    R    A    D    Q    L    A    D    E    S    L    E    S    T    R    R    M 
##  640  512  640  704  512  512  640  640  576  512  640  576  576  640  640  768 
##    L    Q    L    V    E    E    S    K    D    A    G    I    R    T    L    V 
##  512  704  512  512  640  640  576  640  640  512  576  512  640  576  512  512 
##    M    L    D    E    Q    G    E    Q    L    E    R    I    E    E    G    M 
##  768  512  640  640  704  576  640  704  512  640  640  512  640  640  576  768 
##    D    Q    I    N    K    D    M    K    E    A    E    K    N    L    T    D 
##  640  704  512  704  640  640  768  640  640  512  640  640  704  512  576  640 
##    L    G    K    F    C    G    L    C    V    C    P    C    N    K    L    K 
##  512  576  640  704  896  576  512  896  512  896  768  896  704  640  512  640 
##    S    S    D    A    Y    K    K    A    W    G    N    N    Q    D    G    V 
##  576  576  640  512  768  640  640  512 1088  576  704  704  704  640  576  512 
##    V    A    S    Q    P    A    R    V    V    D    E    R    E    Q    M    A 
##  512  512  576  704  768  512  640  512  512  640  640  640  640  704  768  512 
##    I    S    G    G    F    I    R    R    V    T    N    D    A    R    E    N 
##  512  576  576  576  704  512  640  640  512  576  704  640  512  640  640  704 
##    E    M    D    E    N    L    E    Q    V    S    G    I    I    G    N    L 
##  640  768  640  640  704  512  640  704  512  576  576  512  512  576  704  512 
##    R    H    M    A    L    D    M    G    N    E    I    D    T    Q    N    R 
##  640  832  768  512  512  640  768  576  704  640  512  640  576  704  704  640 
##    Q    I    D    R    I    M    E    K    A    D    S    N    K    T    R    I 
##  704  512  640  640  512  768  640  640  512  640  576  704  640  576  640  512 
##    D    E    A    N    Q    R    A    T    K    M    L    G    S    G 
##  640  640  512  704  704  640  512  576  640  768  512  576  576  576

Making a tree

You can calculate the distances to make a tree or to use other methods to visualize the differences, this time we are going to use ape package to make easily a tree, in other lesson we are going to learn how to create fancy trees with ggtree.

#Convert the alignment
my_MSA_muscle_seqnir <- msaConvert(my_MSA_muscle, type = "seqinr::alignment")

#Obtain the distances
d <- dist.alignment(my_MSA_muscle_seqnir, "similarity") 

#Create a neighbor joining tree


nj_tree_MSA <- nj(d)
plot(nj_tree_MSA)

Take on mind that in this case all our sequences are the same, so they will occupy the same level on the tree.

Explorting the alignment

You can inspect the whole alignment using show argument from print() function.

#see alignment
print(my_MSA_muscle, show="complete") #to show a complete ugly alingment
## 
## MsaAAMultipleAlignment with 8 rows and 206 columns
##     aln (1..54)                                            names
## [1] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309838.1 sy...
## [2] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309839.1 sy...
## [3] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309835.1 sy...
## [4] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309836.1 sy...
## [5] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309837.1 sy...
## [6] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309834.1 sy...
## [7] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309832.1 sy...
## [8] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309833.1 sy...
## Con MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG Consensus 
## 
##     aln (55..108)                                          names
## [1] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309838.1 sy...
## [2] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309839.1 sy...
## [3] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309835.1 sy...
## [4] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309836.1 sy...
## [5] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309837.1 sy...
## [6] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309834.1 sy...
## [7] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309832.1 sy...
## [8] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309833.1 sy...
## Con EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN Consensus 
## 
##     aln (109..162)                                         names
## [1] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309838.1 sy...
## [2] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309839.1 sy...
## [3] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309835.1 sy...
## [4] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309836.1 sy...
## [5] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309837.1 sy...
## [6] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309834.1 sy...
## [7] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309832.1 sy...
## [8] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309833.1 sy...
## Con QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH Consensus 
## 
##     aln (163..206)                               names
## [1] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309838.1 sy...
## [2] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309839.1 sy...
## [3] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309835.1 sy...
## [4] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309836.1 sy...
## [5] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309837.1 sy...
## [6] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309834.1 sy...
## [7] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309832.1 sy...
## [8] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309833.1 sy...
## Con MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG Consensus

Or you can export it on .fasta format with writeXStringSet() function, If do not want to include the consensus use unmasked() function too.

#export alingment (without consensus)
writeXStringSet(unmasked(my_MSA_muscle), file="my_MSA_alingment.fasta")

Also you can export it on PDF like this.

#export aligment in PDF
msaPrettyPrint(my_MSA_clustalW, output="pdf", showLogo="none", askForOverwrite=FALSE, verbose=FALSE) 
## Image 
##   colorMode    : Color 
##   storage.mode : double 
##   dim          : 1497 148 4 
##   frames.total : 4 
##   frames.render: 1 
## 
## imageData(object)[1:5,1:6,1]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    1    1    1
## [2,]    1    1    1    1    1    1
## [3,]    1    1    1    1    1    1
## [4,]    1    1    1    1    1    1
## [5,]    1    1    1    1    1    1

Finally, if you are only interested in using a muscle alignment you should try muscle package.

BiocManager::install("muscle")
library(muscle)
my_muscle <- muscle(my_sequences,quiet = T)
print(my_muscle)
## AAMultipleAlignment with 8 rows and 206 columns
##      aln                                                    names               
## [1] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309838.1 sy...
## [2] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309839.1 sy...
## [3] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309835.1 sy...
## [4] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309836.1 sy...
## [5] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309837.1 sy...
## [6] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309834.1 sy...
## [7] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309832.1 sy...
## [8] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309833.1 sy...
Diego Sierra Ramírez
Diego Sierra Ramírez
Msc. in Biological Science / Data analyst

Related