Multiple sequence alignments
Multiple sequence alingments
Multiple sequence alignments are really useful when you have multiple sequences and you want to find conserved regions that could indicate evolutionary relationship or probable similar structure-function, other application is take an unknown sequence (protein) and group it with others to find conserved motifs within the proteins that could suggest some function.
With msa
package you can use some popular algorithms such as ClustalW, ClustalOmega and Muscle, this package also allows to plot and export alignments.
Importing sequences
First of all you have to install msa
package from Bioconductor with BioManager
package.
installed.packages("BiocManager")
BiocManager::install("msa")
Then load required libraries…
library(msa)
library(Biostrings)
library(tidyverse)
library(seqinr)
library(ape)
And now you can import you sequences with readAAStringSet() or readDNAStringSet() from Biostrings
package.
#import fastafile
path <-"~/Documentos/R/R_personal_website_2.0/content/project/Bioinformatics:rentrez/First50_neurotoxins.fasta"
my_sequences <- readAAStringSet(path)
You can work directly with your sequences or subset them if necessary.
my_sequences <- my_sequences[11:18]
my_sequences
## AAStringSet object of length 8:
## width seq names
## [1] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309838.1 sy...
## [2] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309839.1 sy...
## [3] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309835.1 sy...
## [4] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309836.1 sy...
## [5] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309837.1 sy...
## [6] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309834.1 sy...
## [7] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309832.1 sy...
## [8] 206 MAEDADMRNELEEMQRRADQLAD...KADSNKTRIDEANQRATKMLGSG NP_001309833.1 sy...
Aligning with msa
To make an alignment you have to use msa() function and specify the algorithm you want to use. Also you can add gap extension and gap opening penalties as we learned in Pairwise alignments
#clustalW
my_MSA_clustalW <- msa(my_sequences, method = "ClustalW", type = "protein")
## use default substitution matrix
#muscle
my_MSA_muscle <- msa(my_sequences, method = "Muscle", type = "protein")
#clustal_omega
my_MSA_clustal_omega <- msa(my_sequences, method = "ClustalOmega", type = "protein")
## using Gonnet
Computing conservation scores
As you might notice, the last sequence within our alinments is the consensus, you can extract them with msaConsensusSequence() this sequence is useful to evaluate how well aligned are our sequences.
msaConsensusSequence(my_MSA_muscle)
## [1] "MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNNQDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG"
Other more sensible way to analyze this is using substitution matrices like PAM or BLOSUM, the firsts ones are more used for reconstruct phylogenetic trees whereas the second ones are popular to compare regions between alignments.
To know which of them are better for you, you have you try some of them and take which of them give you the best alignment score, for example with BLOSUM matrices the BLOSUM100 is suitable to compare sequences with almost 100% of similarity, while BLOSUM65 and BLOSUM45 are used to detect weak protein similarities.
data("BLOSUM100") # load substitution matrices from Biostrings package
data("BLOSUM62")
# Compute conservation scores
msaConservationScore(my_MSA_muscle, BLOSUM62)
## M A E D A D M R N E L E E M Q R R A D Q
## 320 256 320 384 256 384 320 320 384 320 256 320 320 320 320 320 320 256 384 320
## L A D E S L E S T R R M L Q L V E E S K
## 256 256 384 320 256 256 320 256 320 320 320 320 256 320 256 256 320 320 256 320
## D A G I R T L V M L D E Q G E Q L E R I
## 384 256 384 256 320 320 256 256 320 256 384 320 320 384 320 320 256 320 320 256
## E E G M D Q I N K D M K E A E K N L T D
## 320 320 384 320 384 320 256 384 320 384 320 320 320 256 320 320 384 256 320 384
## L G K F C G L C V C P C N K L K S S D A
## 256 384 320 384 576 384 256 576 256 576 448 576 384 320 256 320 256 256 384 256
## Y K K A W G N N Q D G V V A S Q P A R V
## 448 320 320 256 704 384 384 384 320 384 384 256 256 256 256 320 448 256 320 256
## V D E R E Q M A I S G G F I R R V T N D
## 256 384 320 320 320 320 320 256 256 256 384 384 384 256 320 320 256 320 384 384
## A R E N E M D E N L E Q V S G I I G N L
## 256 320 320 384 320 320 384 320 384 256 320 320 256 256 384 256 256 384 384 256
## R H M A L D M G N E I D T Q N R Q I D R
## 320 512 320 256 256 384 320 384 384 320 256 384 320 320 384 320 320 256 384 320
## I M E K A D S N K T R I D E A N Q R A T
## 256 320 320 320 256 384 256 384 320 320 320 256 384 320 256 384 320 320 256 320
## K M L G S G
## 320 320 256 384 256 384
# Compute conservation scores
msaConservationScore(my_MSA_muscle, BLOSUM100)
## M A E D A D M R N E L E E M Q R
## 768 512 640 640 512 640 768 640 704 640 512 640 640 768 704 640
## R A D Q L A D E S L E S T R R M
## 640 512 640 704 512 512 640 640 576 512 640 576 576 640 640 768
## L Q L V E E S K D A G I R T L V
## 512 704 512 512 640 640 576 640 640 512 576 512 640 576 512 512
## M L D E Q G E Q L E R I E E G M
## 768 512 640 640 704 576 640 704 512 640 640 512 640 640 576 768
## D Q I N K D M K E A E K N L T D
## 640 704 512 704 640 640 768 640 640 512 640 640 704 512 576 640
## L G K F C G L C V C P C N K L K
## 512 576 640 704 896 576 512 896 512 896 768 896 704 640 512 640
## S S D A Y K K A W G N N Q D G V
## 576 576 640 512 768 640 640 512 1088 576 704 704 704 640 576 512
## V A S Q P A R V V D E R E Q M A
## 512 512 576 704 768 512 640 512 512 640 640 640 640 704 768 512
## I S G G F I R R V T N D A R E N
## 512 576 576 576 704 512 640 640 512 576 704 640 512 640 640 704
## E M D E N L E Q V S G I I G N L
## 640 768 640 640 704 512 640 704 512 576 576 512 512 576 704 512
## R H M A L D M G N E I D T Q N R
## 640 832 768 512 512 640 768 576 704 640 512 640 576 704 704 640
## Q I D R I M E K A D S N K T R I
## 704 512 640 640 512 768 640 640 512 640 576 704 640 576 640 512
## D E A N Q R A T K M L G S G
## 640 640 512 704 704 640 512 576 640 768 512 576 576 576
Making a tree
You can calculate the distances to make a tree or to use other methods to visualize the differences, this time we are going to use ape
package to make easily a tree, in other lesson we are going to learn how to create fancy trees with ggtree
.
#Convert the alignment
my_MSA_muscle_seqnir <- msaConvert(my_MSA_muscle, type = "seqinr::alignment")
#Obtain the distances
d <- dist.alignment(my_MSA_muscle_seqnir, "similarity")
#Create a neighbor joining tree
nj_tree_MSA <- nj(d)
plot(nj_tree_MSA)
Take on mind that in this case all our sequences are the same, so they will occupy the same level on the tree.
Explorting the alignment
You can inspect the whole alignment using show
argument from print() function.
#see alignment
print(my_MSA_muscle, show="complete") #to show a complete ugly alingment
##
## MsaAAMultipleAlignment with 8 rows and 206 columns
## aln (1..54) names
## [1] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309838.1 sy...
## [2] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309839.1 sy...
## [3] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309835.1 sy...
## [4] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309836.1 sy...
## [5] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309837.1 sy...
## [6] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309834.1 sy...
## [7] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309832.1 sy...
## [8] MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG NP_001309833.1 sy...
## Con MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQG Consensus
##
## aln (55..108) names
## [1] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309838.1 sy...
## [2] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309839.1 sy...
## [3] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309835.1 sy...
## [4] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309836.1 sy...
## [5] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309837.1 sy...
## [6] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309834.1 sy...
## [7] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309832.1 sy...
## [8] EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN NP_001309833.1 sy...
## Con EQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNN Consensus
##
## aln (109..162) names
## [1] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309838.1 sy...
## [2] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309839.1 sy...
## [3] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309835.1 sy...
## [4] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309836.1 sy...
## [5] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309837.1 sy...
## [6] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309834.1 sy...
## [7] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309832.1 sy...
## [8] QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH NP_001309833.1 sy...
## Con QDGVVASQPARVVDEREQMAISGGFIRRVTNDARENEMDENLEQVSGIIGNLRH Consensus
##
## aln (163..206) names
## [1] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309838.1 sy...
## [2] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309839.1 sy...
## [3] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309835.1 sy...
## [4] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309836.1 sy...
## [5] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309837.1 sy...
## [6] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309834.1 sy...
## [7] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309832.1 sy...
## [8] MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG NP_001309833.1 sy...
## Con MALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG Consensus
Or you can export it on .fasta format with writeXStringSet() function, If do not want to include the consensus use unmasked() function too.
#export alingment (without consensus)
writeXStringSet(unmasked(my_MSA_muscle), file="my_MSA_alingment.fasta")
Also you can export it on PDF like this.
#export aligment in PDF
msaPrettyPrint(my_MSA_clustalW, output="pdf", showLogo="none", askForOverwrite=FALSE, verbose=FALSE)
## Image
## colorMode : Color
## storage.mode : double
## dim : 1497 148 4
## frames.total : 4
## frames.render: 1
##
## imageData(object)[1:5,1:6,1]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 1 1 1
## [2,] 1 1 1 1 1 1
## [3,] 1 1 1 1 1 1
## [4,] 1 1 1 1 1 1
## [5,] 1 1 1 1 1 1
Finally, if you are only interested in using a muscle alignment you should try muscle
package.
BiocManager::install("muscle")
library(muscle)
my_muscle <- muscle(my_sequences,quiet = T)
print(my_muscle)
## AAMultipleAlignment with 8 rows and 206 columns
## aln names
## [1] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309838.1 sy...
## [2] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309839.1 sy...
## [3] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309835.1 sy...
## [4] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309836.1 sy...
## [5] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309837.1 sy...
## [6] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309834.1 sy...
## [7] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309832.1 sy...
## [8] MAEDADMRNELEEMQRRADQLADESL...MEKADSNKTRIDEANQRATKMLGSG NP_001309833.1 sy...