Alignments

Sequence Alignment
Interactive
Structure based
Sequences Alignment Program
Strap
Bioinformatics - Sequence Alignment
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Match
Mismatch
Indel
Gap
Seq1 : TTGCACGGCTTGGTCCA-GTGCGGTTTAC
||||||||||x|||||| ||||
|||
Seq2 : TTGCACGGCTCGGTCCACGTGC----TAC
Seq1 : .................-...........
Seq2 : ..........C...........----...
Cons : TTGCACGGCTtGGTCCAcGTGCggttTAC
* indels:
insertions & deletions
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Alignment
The process of lining up two or more sequences to achieve maximal levels
of identity (and conservation, in the case of amino acid sequences) for the
purpose of assessing the degree of similarity and the possibility of
homology.
Identity
The extent to which two (nucleotide or amino acid) sequences are invariant.
Similarity
The extent to which nucleotide or protein sequences are related. The extent
of similarity between two sequences can be based on percent sequence identity
and/or conservation. In BLAST similarity refers to a positive matrix score.
Homology
Similarity attributed to descent from a common ancestor.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Homolog
• A gene related to a second gene by descent from a common ancestral DNA sequence. The term,
homolog, may apply to the relationship between genes separated by the event of speciation (see
ortholog) or to the relationship betwen genes separated by the event of genetic duplication (see
paralog).
Ortholog
• Orthologs are genes in different species that evolved from a common ancestral gene by
speciation. Normally, orthologs retain the same function in the course of evolution. Identification
of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. (See
also Paralogs.).
Speciation
• Speciation is the origin of a new species capable of making a living in a new way from the
species from which it arose. As part of this process it has also aquired some barreir to genetic
exchage with the parent species.
Paralog
• Paralogs are genes related by duplication within a genome. Orthologs retain the same function in
the course of evolution, whereas paralogs evolve new functions, even if these are related to the
original one.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
seq1
seq2
Dotplot
Alignment
seq2
seq2
seq1
|||||||||||
|||||||||||
||
seq1
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A C A T G C A G T A
?
A C A T G C A G T A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
threshold
>1
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
threshold
>2
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
A
C
A
T
G
C
A
G
T
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A C A T G C A G T A
| | | | | | | | | |
A C A T G C A G T A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A C A T G C A G T A
?
C A G T A A C A T G
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
T
G
C
A
G
T
A
C
A
G
T
A
A
C
A
T
G
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A C A T G C A G T A
| | | | |
C A G T A A C A T G
HS14 | UniBas | JCW
Sequence alignment - Dotplot
A C A T G C A G T A
?
A C A T G C A G T A
Bioinformatics - Sequence Alignment
A
C
A
C
G
C
G
C
G
A
A
C
A
C
G
C
G
C
G
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
C
G
C
G
C
G
A
A
C
A
C
G
C
G
C
G
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
A
C
G
C
G
C
G
A
A
C
A
C
G
C
G
C
G
A
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Dot plots are two-dimensional plots where the x-axis
and y-axis each represents a sequence and the plot itself
shows a comparison of these two sequences by a
calculated score for each position of the sequence. If a
window of fixed size on one sequence (one axis) match
to the other sequence a dot is drawn at the plot. Dot
plots are one of the oldest methods for comparing two
sequences [Maizel and Lenk, 1981].
Contrary to simple sequence alignments dot plots can be
a very useful tool for spotting various evolutionary
events which may have happened to the sequences of
interest.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
The most simple example of a dot plot is obtained by plotting two homologous
sequences of interest. If very similar or identical sequences are plotted
against each other a diagonal line will occur.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
If the dot plot shows more than one diagonal in the same region of a sequence,
the regions depending to the other sequence are repeated.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Frame shifts in a nucleotide sequence can occur due to insertions, deletions
or mutations. In this figure, three frame shifts for the sequence on the y-axis
are found.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
In dot plots you can see an inversion of sequence as contrary diagonal to
the diagonal showing similarity.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Low-complexity regions in sequences can be found as regions around the
diagonal all obtaining a high score. Low complexity regions are calculated from
the redundancy of amino acids within a limited region [Wootton and Federhen,
1993]. These are most often seen as short regions of only a few different
amino acids. In the middle of the figure is a square shows the low-complexity
region of this sequence.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
http://arbl.cvmbs.colostate.edu/molkit/
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
HS14 | UniBas | JCW
Question | Exercises
A
B
C
D
A
1
0
0
0
B
0
1
0
0
C
0
0
1
0
D
0
0
0
1
ABCD
ABCD
A
B
C
D
A
4
0
0
0
B
0
3
0
0
C
0
0
2
0
D
0
0
0
1
Score:4
A
C
D
E
A
1
0
0
0
B
0
0
0
0
C
0
1
0
0
D
0
0
1
0
A CDE
ABCD
A
C
D
E
A
3
0
0
0
B
0
2
0
0
C
0
2
0
0
D
0
0
1
0
Score:3
Bioinformatics - Sequence Alignment
RefSeq1: A B B C E A
RefSeq2: A B C C F A
Substitution Matrices
AS
A
B
C
E
F
# changes
0
1
1
1
1
# occurrence
4
3
3
1
1
1
1
0
0.33 0.33
➥ Amino acid A is well conserved compare to B and C,
and E and F.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Percent Accepted Mutation (PAM) - A unit introduced by Margaret
Dayhoff et al. (1978) to quantify the amount of evolutionary change in
a protein sequence. 1.0 PAM unit, is the amount of evolution which will
change, on average, 1% of amino acids in a protein sequence. A PAM
(x) substitution matrix is a look-up table in which scores for each amino
acid substitution have been calculated based on the frequency of that
substitution in closely related proteins that have experienced a certain
amount (x) of evolutionary divergence.
The PAM matrices imply a Markov chain model of protein mutation.
The PAM matrices are normalized so that, for instance, the PAM1 matrix
gives substitution probabilities for sequences that have experienced one
point mutation for every hundred amino acids. The mutations may
overlap so that the sequences reflected in the PAM250 matrix have
experienced 250 mutation events for every 100 amino acids, yet only 80
out of every 100 amino acids have been affected.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
G
A Markov chain, named for Andrey Markov, is a mathematical system
that undergoes transitions from one state to another in a chainlike
manner. It is a random process characterized as memoryless: the
next state depends only on the current state and not on the entire past.
This specific kind of "memorylessness" is called the Markov property.
Markov chains have many applications as statistical models of realworld processes.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Blocks Substitution Matrix (BLOSUM). A substitution
matrix in which scores for each position are derived from
observations of the frequencies of substitutions in blocks of
local alignments in related proteins. Each matrix is tailored
to a particular evolutionary distance. In the BLOSUM62
matrix, for example, the alignment from which scores were
derived was created using sequences sharing no more than
62% identity. Sequences more identical than 62% are
represented by a single sequence in the alignment so as to
avoid over-weighting closely related family members.
(Henikoff and Henikoff 1992)
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
The BLOSUM62 matrix
Sij = (
1
λ
) log (
pij
pi *q j
)
pij is the probability of two amino acids i and j replacing
each other in a homologous sequence, and qi and qj are
the background probabilities of finding the amino acids i
and j in any protein sequence at random. The factor λ is
a scaling factor, set such that the matrix contains easily
computable integer values.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
When choosing a matrix, it is important to consider the alternatives. Do not simply
choose the default setting without some initial consideration.
Suggested uses for common substitution matrices. The matrices highlighted in bold are
available through NCBI’s BLAST web interface. BLOSUM62 has been shown to provide
the best results in BLAST searches overall due to its ability to detect large ranges of
similarity. Nevertheless, the other matrices have their strengths. For example, if your
goal is to only detect sequences of high similarity to infer homology within a species,
the PAM30, BLOSUM90, and PAM70 matrices would provide the best results.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
D
E
L
S
S
A
4
0
-2 -1 -1 -1
1
1
S
1
-1
0
4
4
C
0
9
-3 -4 -1 -1 -1 -1
E
-1 -4
2
5
-2 -3
0
0
E
-1 -4
2
5
-2 -3
0
0
L
-1 -1 -4 -3
2
-2 -2
S
1
-1
0
0
-1 -2
4
4
S
1
-1
0
0
-1 -2
4
4
0
M
-1 -2
4
1) 4-1-3+5-2+4+4+4=15
2) 4+1+9+2+5+2+4+4+4=35
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
A
C
D
L
S
S
A
4
0
-2 -1 -1 -1
1
1
S
1
-1
0
4
4
9
E
0
M
-1 -2
C
0
-3 -4 -1 -1 -1 -1
E
-1 -4
2
5
-2 -3
0
0
E
-1 -4
2
5
-2 -3
0
0
L
-1 -1 -4 -3
2
-2 -2
S
1
-1
0
0
-1 -2
4
4
S
1
-1
0
0
-1 -2
4
4
4
Gap penalty (-1)
A CDEMLSS
ASCEE LSS
Score: 35 -2=33
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
query word
(w=3)
AALNTPQGVNWGS
PQG 18
PEG 15 score threshold
PRG 14
PKG 13
PQA 12
...
Q: ...AALNTPQGVNWGS...
...A+L TP G+NWGS...
S: ...ASLRTPEGLNWGS...
The BLAST algorithm.The BLAST algorithm is a heuristic search method that
seeks words of length W (default = 3 in blastp) that score at least T when aligned
with the query and scored with a substitution matrix. Words in the database that
score T or greater are extended in both directions in an attempt to fina a locally
optimal ungapped alignment or HSP (high scoring pair) with a score of at least S
or an E value lower than the specified threshold. HSPs that meet these criteria
will be reported by BLAST, provided they do not exceed the cutoff value
specified for number of descriptions and/or alignments to report.
HS14 | UniBas | JCW
Local Alignment
Global Alignment
Bioinformatics - Sequence Alignment
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Seq 1
Seq 2
Seq 3
Seq 4
global msa (with gaps)
Seq 1
Seq 2
Seq 3
Seq 4
local msa (without gaps)
Seq 1
Seq 2
Seq 3
Seq 4
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
The most widely used approach to multiple sequence alignments (MSA) uses a heuristic search known as
progressive technique (also known as the hierarchical or tree method), that builds up a final MSA by
combining pairwise alignments beginning with the most similar pair and progressing to the most distantly
related. All progressive alignment methods require two stages: a first stage in which the relationships
between the sequences are represented as a tree, called a guide tree, and a second step in which the MSA is
built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial
guide tree is determined by an efficient clustering method such as neighbor-joining or UPGMA.
➪ clustalw, MAFFT
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - Guide Trees Example
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - Guide Trees Example
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - Iterative Refinement
The progressive method has a drawback in that once a gap is incorrectly introduced,
especially at an early step (near a leaf of the guide tree), the gap is never removed in
later steps. To overcome this drawback, there are two types of solutions: the iterative
refinement method and the consistency-based method. These two procedures are
quite different; the former tries to correct mistakes in the initial alignment, whereas
the latter tries to avoid mistakes in advance. However, both work well to improve
alignment accuracy.
➪ MAFFT or COFFEE
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - Guide Trees Example
ClustalW ftp://ftp.ebi.ac.uk/pub/software/clustalw2/
PRRN http://www.genome.ist.i.kyoto-u.ac.jp/
DIALIGN http://bibiserv.techfak.uni-bielefeld.de/dialign/
TCoffee http://www.tcoffee.org/
MAFFT http://align.bmr.kyushu- u.ac.jp/mafft/software/
MUSCLE http://www.drive5.com/muscle/
ProbConsRNA http://probcons.stanford.edu/
Kalign http://msa.cgb.ki.se/
ProbAlign http://www.cs.njit.edu/usman/probalign/
PRIME http://prime.cbrc.jp/
MLAGAN http://lagan.stanford.edu/lagan_web/ index.shtml
MAVID http://baboon.math.berkeley.edu/mavid/
MUMmer http://mummer.sourceforge.net/
CCGB (TBA and BLASTZ) http://www.bx.psu.edu/miller_lab
MAUVE http://gel.ahabs.wisc.edu/mauve/
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - SeaView
SEQ_A:
SEQ_B:
SEQ_C:
SEQ_D:
SEQ_E:
SEQ_F:
SEQ_G:
SEQ_H:
0
10
20
30
40
50
60
|--------|---------|---------|---------|---------|---------|
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACATAAGCTCCCGTGGAAAAGGGTTGCCAACCCGGCCGATCAACAGCCCTGGTCGGC
TTACGAAATCGAGGCCGGAAAAGAATTCCCAACCCGGCCGATCCTGTTGCATTCGTACC
GGACCGCCTCCGTACACGAAAAGATCGGCCAACCCGGCCGATCACGCCTCGTGAGATCA
GGACATCCGCGTGATTACGAGTCGTGGCCCAACCCGGCCGATCAAATTTGGTCTGGCTG
GGAAGACTAGCCGCTGGTAAACACACCACCAACCCGGCCGATCTGACCCCGGCTCTCCA
GGAAACTAGCATGCGATAAGTCCCTAACCCAACCCGGCCGATCTGACTATGGCCTTCTG
GGGATCAAGTCTGGTTAACCATCAAACAGCCAACCCGGCCGATCGTCTTGAGTCTAAAA
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - SeqLOGO
SEQ_A:
SEQ_B:
SEQ_C:
SEQ_D:
SEQ_E:
SEQ_F:
SEQ_G:
SEQ_H:
0
10
20
30
40
50
60
|--------|---------|---------|---------|---------|---------|
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACATAAGCTCCCGTGGAAAAGGGTTGCCAACCCGGCCGATCAACAGCCCTGGTCGGC
TTACGAAATCGAGGCCGGAAAAGAATTCCCAACCCGGCCGATCCTGTTGCATTCGTACC
GGACCGCCTCCGTACACGAAAAGATCGGCCAACCCGGCCGATCACGCCTCGTGAGATCA
GGACATCCGCGTGATTACGAGTCGTGGCCCAACCCGGCCGATCAAATTTGGTCTGGCTG
GGAAGACTAGCCGCTGGTAAACACACCACCAACCCGGCCGATCTGACCCCGGCTCTCCA
GGAAACTAGCATGCGATAAGTCCCTAACCCAACCCGGCCGATCTGACTATGGCCTTCTG
GGGATCAAGTCTGGTTAACCATCAAACAGCCAACCCGGCCGATCGTCTTGAGTCTAAAA
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment - SeaView
SEQ_A:
SEQ_B:
SEQ_C:
SEQ_D:
SEQ_E:
SEQ_F:
SEQ_G:
SEQ_H:
0
10
20
30
40
50
60
|--------|---------|---------|---------|---------|---------|
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCTACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC
SEQ_A:
SEQ_B:
SEQ_C:
SEQ_D:
SEQ_E:
SEQ_F:
SEQ_G:
SEQ_H:
0
10
20
30
40
50
60
|--------|---------|---------|---------|---------|---------|
...........................................................
...........................................................
...........................................................
...........................................................
..............................T............................
...........................................................
...........................................................
...........................................................
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Sequence similarity
Amino acids in the same column are those that yield an alignment with
maximum similarity. Most programs use sequence similarity because it is the
easiest criterion. When the sequences are closely related, their structural,
evolutionary, and functional similarities are equivalent to sequence similarity.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Structural similarity
Amino acids that play the same role in each structure are in the same column.
Structure-superposition programs are the only ones that use this criterion.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Amino acids or nucleotides related to the same amino acid (or nucleotide) in the
common ancestor of all the sequences are put in the same column. No
Evolutionary similarity
automatic program explicitly uses this criterion, but they all try to deliver an
alignment that respects it.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Functional similarity
Amino acids or nucleotides with the same function are in the same column. No
automatic program explicitly uses this criterion, but if the information is
available, you can force some programs to respect it — or you can edit your
alignment manually.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Structural similarity
Amino acids that play the same role in each structure are in the same column.
Structure-superposition programs are the only ones that use this criterion.
Amino acids or nucleotides related to the same amino acid (or nucleotide) in the
common ancestor of all the sequences are put in the same column. No
Evolutionary similarity
automatic program explicitly uses this criterion, but they all try to deliver an
alignment that respects it.
Functional similarity
Amino acids or nucleotides with the same function are in the same column. No
automatic program explicitly uses this criterion, but if the information is
available, you can force some programs to respect it — or you can edit your
alignment manually.
Sequence similarity
Amino acids in the same column are those that yield an alignment with
maximum similarity. Most programs use sequence similarity because it is the
easiest criterion. When the sequences are closely related, their structural,
evolutionary, and functional similarities are equivalent to sequence similarity.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Proteins or DNA
Use proteins whenever possible. You can turn them back into DNA
after doing the multiple alignment.
Many sequences
Start with 10–15 sequences; avoid aligning more than 50 sequences.
Very different sequences
Sequences that are less than 30 percent identical to more than half
the other sequences in the set often cause troubles.
Identical sequences
They never help. Unless you have a very good reason to do so, avoid
incorporating into your multiple alignment any sequence that’s more
than 95 percent identical to another sequence in the set.
Partial sequences
Multiple-sequence-alignment programs prefer sequences that are
roughly the same length. Programs often have difficulties comparing
items in a mixture of complete sequences and shorter fragments.
Repeated domains
Sequences with repeated domains cause trouble for most multiplealignment programs — especially if the number of domains is
different. When this happens, you may be better off extracting the
domains yourself with for example Dotlet and making a multiple
alignment of those segments.
HS14 | UniBas | JCW
Bioinformatics - Sequence Alignment
Part of knowing when to use multiple
sequence alignments involves knowing
when not to use them!
HS14 | UniBas | JCW
Question | Exercises
Sequence alignment
(i) Dynamic Programming
- Needleman and Wunsch (1970) => global sequence alignment
- Smith and Waterman (1981) => local sequence alignment
(ii) Weighting/Models
- Dayhoff (1978) => PAM Matrices
- Henikhoff and Henikhoff (1992) => BLOSUM Matrices
(iii) BLAST Programs (detecting sequence similarity)
- Nucleotide blast: compares a nucleotide query against a nucleotide sequence
database
- Protein blast: compares a protein query against a protein sequence database
- blastx: compares a nucleotide query translated in all six reading frames
against a protein database
- tblastn: compares a protein query against a nucleotide sequence database
dynamically translated in all six reading frames
- tblastx: compares a nucleotide query in all six reading frames against a
nucleotide sequence database in all six reading frames
Sequence alignment
http://www.ch.embnet.org/software/LALIGN_form.html