Sequence Alignment Interactive Structure based Sequences Alignment Program Strap Bioinformatics - Sequence Alignment HS14 | UniBas | JCW Bioinformatics - Sequence Alignment HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Match Mismatch Indel Gap Seq1 : TTGCACGGCTTGGTCCA-GTGCGGTTTAC ||||||||||x|||||| |||| ||| Seq2 : TTGCACGGCTCGGTCCACGTGC----TAC Seq1 : .................-........... Seq2 : ..........C...........----... Cons : TTGCACGGCTtGGTCCAcGTGCggttTAC * indels: insertions & deletions HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Similarity The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. Homology Similarity attributed to descent from a common ancestor. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Homolog • A gene related to a second gene by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship betwen genes separated by the event of genetic duplication (see paralog). Ortholog • Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. (See also Paralogs.). Speciation • Speciation is the origin of a new species capable of making a living in a new way from the species from which it arose. As part of this process it has also aquired some barreir to genetic exchage with the parent species. Paralog • Paralogs are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment seq1 seq2 Dotplot Alignment seq2 seq2 seq1 ||||||||||| ||||||||||| || seq1 HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A ? A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A threshold >1 HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A threshold >2 HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A | | | | | | | | | | A C A T G C A G T A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A ? C A G T A A C A T G HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A C A G T A A C A T G HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A T G C A G T A | | | | | C A G T A A C A T G HS14 | UniBas | JCW Sequence alignment - Dotplot A C A T G C A G T A ? A C A T G C A G T A Bioinformatics - Sequence Alignment A C A C G C G C G A A C A C G C G C G A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A C G C G C G A A C A C G C G C G A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C A C G C G C G A A C A C G C G C G A HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Dot plots are two-dimensional plots where the x-axis and y-axis each represents a sequence and the plot itself shows a comparison of these two sequences by a calculated score for each position of the sequence. If a window of fixed size on one sequence (one axis) match to the other sequence a dot is drawn at the plot. Dot plots are one of the oldest methods for comparing two sequences [Maizel and Lenk, 1981]. Contrary to simple sequence alignments dot plots can be a very useful tool for spotting various evolutionary events which may have happened to the sequences of interest. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment The most simple example of a dot plot is obtained by plotting two homologous sequences of interest. If very similar or identical sequences are plotted against each other a diagonal line will occur. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment If the dot plot shows more than one diagonal in the same region of a sequence, the regions depending to the other sequence are repeated. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Frame shifts in a nucleotide sequence can occur due to insertions, deletions or mutations. In this figure, three frame shifts for the sequence on the y-axis are found. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment In dot plots you can see an inversion of sequence as contrary diagonal to the diagonal showing similarity. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Low-complexity regions in sequences can be found as regions around the diagonal all obtaining a high score. Low complexity regions are calculated from the redundancy of amino acids within a limited region [Wootton and Federhen, 1993]. These are most often seen as short regions of only a few different amino acids. In the middle of the figure is a square shows the low-complexity region of this sequence. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment HS14 | UniBas | JCW Bioinformatics - Sequence Alignment http://arbl.cvmbs.colostate.edu/molkit/ HS14 | UniBas | JCW Bioinformatics - Sequence Alignment HS14 | UniBas | JCW Question | Exercises A B C D A 1 0 0 0 B 0 1 0 0 C 0 0 1 0 D 0 0 0 1 ABCD ABCD A B C D A 4 0 0 0 B 0 3 0 0 C 0 0 2 0 D 0 0 0 1 Score:4 A C D E A 1 0 0 0 B 0 0 0 0 C 0 1 0 0 D 0 0 1 0 A CDE ABCD A C D E A 3 0 0 0 B 0 2 0 0 C 0 2 0 0 D 0 0 1 0 Score:3 Bioinformatics - Sequence Alignment RefSeq1: A B B C E A RefSeq2: A B C C F A Substitution Matrices AS A B C E F # changes 0 1 1 1 1 # occurrence 4 3 3 1 1 1 1 0 0.33 0.33 ➥ Amino acid A is well conserved compare to B and C, and E and F. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Percent Accepted Mutation (PAM) - A unit introduced by Margaret Dayhoff et al. (1978) to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM (x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. The PAM matrices imply a Markov chain model of protein mutation. The PAM matrices are normalized so that, for instance, the PAM1 matrix gives substitution probabilities for sequences that have experienced one point mutation for every hundred amino acids. The mutations may overlap so that the sequences reflected in the PAM250 matrix have experienced 250 mutation events for every 100 amino acids, yet only 80 out of every 100 amino acids have been affected. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A G A Markov chain, named for Andrey Markov, is a mathematical system that undergoes transitions from one state to another in a chainlike manner. It is a random process characterized as memoryless: the next state depends only on the current state and not on the entire past. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of realworld processes. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Blocks Substitution Matrix (BLOSUM). A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff 1992) HS14 | UniBas | JCW Bioinformatics - Sequence Alignment The BLOSUM62 matrix Sij = ( 1 λ ) log ( pij pi *q j ) pij is the probability of two amino acids i and j replacing each other in a homologous sequence, and qi and qj are the background probabilities of finding the amino acids i and j in any protein sequence at random. The factor λ is a scaling factor, set such that the matrix contains easily computable integer values. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment When choosing a matrix, it is important to consider the alternatives. Do not simply choose the default setting without some initial consideration. Suggested uses for common substitution matrices. The matrices highlighted in bold are available through NCBI’s BLAST web interface. BLOSUM62 has been shown to provide the best results in BLAST searches overall due to its ability to detect large ranges of similarity. Nevertheless, the other matrices have their strengths. For example, if your goal is to only detect sequences of high similarity to infer homology within a species, the PAM30, BLOSUM90, and PAM70 matrices would provide the best results. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C D E L S S A 4 0 -2 -1 -1 -1 1 1 S 1 -1 0 4 4 C 0 9 -3 -4 -1 -1 -1 -1 E -1 -4 2 5 -2 -3 0 0 E -1 -4 2 5 -2 -3 0 0 L -1 -1 -4 -3 2 -2 -2 S 1 -1 0 0 -1 -2 4 4 S 1 -1 0 0 -1 -2 4 4 0 M -1 -2 4 1) 4-1-3+5-2+4+4+4=15 2) 4+1+9+2+5+2+4+4+4=35 HS14 | UniBas | JCW Bioinformatics - Sequence Alignment A C D L S S A 4 0 -2 -1 -1 -1 1 1 S 1 -1 0 4 4 9 E 0 M -1 -2 C 0 -3 -4 -1 -1 -1 -1 E -1 -4 2 5 -2 -3 0 0 E -1 -4 2 5 -2 -3 0 0 L -1 -1 -4 -3 2 -2 -2 S 1 -1 0 0 -1 -2 4 4 S 1 -1 0 0 -1 -2 4 4 4 Gap penalty (-1) A CDEMLSS ASCEE LSS Score: 35 -2=33 HS14 | UniBas | JCW Bioinformatics - Sequence Alignment query word (w=3) AALNTPQGVNWGS PQG 18 PEG 15 score threshold PRG 14 PKG 13 PQA 12 ... Q: ...AALNTPQGVNWGS... ...A+L TP G+NWGS... S: ...ASLRTPEGLNWGS... The BLAST algorithm.The BLAST algorithm is a heuristic search method that seeks words of length W (default = 3 in blastp) that score at least T when aligned with the query and scored with a substitution matrix. Words in the database that score T or greater are extended in both directions in an attempt to fina a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified threshold. HSPs that meet these criteria will be reported by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report. HS14 | UniBas | JCW Local Alignment Global Alignment Bioinformatics - Sequence Alignment HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Seq 1 Seq 2 Seq 3 Seq 4 global msa (with gaps) Seq 1 Seq 2 Seq 3 Seq 4 local msa (without gaps) Seq 1 Seq 2 Seq 3 Seq 4 HS14 | UniBas | JCW Bioinformatics - Sequence Alignment The most widely used approach to multiple sequence alignments (MSA) uses a heuristic search known as progressive technique (also known as the hierarchical or tree method), that builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a tree, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering method such as neighbor-joining or UPGMA. ➪ clustalw, MAFFT HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - Guide Trees Example HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - Guide Trees Example HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - Iterative Refinement The progressive method has a drawback in that once a gap is incorrectly introduced, especially at an early step (near a leaf of the guide tree), the gap is never removed in later steps. To overcome this drawback, there are two types of solutions: the iterative refinement method and the consistency-based method. These two procedures are quite different; the former tries to correct mistakes in the initial alignment, whereas the latter tries to avoid mistakes in advance. However, both work well to improve alignment accuracy. ➪ MAFFT or COFFEE HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - Guide Trees Example ClustalW ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ PRRN http://www.genome.ist.i.kyoto-u.ac.jp/ DIALIGN http://bibiserv.techfak.uni-bielefeld.de/dialign/ TCoffee http://www.tcoffee.org/ MAFFT http://align.bmr.kyushu- u.ac.jp/mafft/software/ MUSCLE http://www.drive5.com/muscle/ ProbConsRNA http://probcons.stanford.edu/ Kalign http://msa.cgb.ki.se/ ProbAlign http://www.cs.njit.edu/usman/probalign/ PRIME http://prime.cbrc.jp/ MLAGAN http://lagan.stanford.edu/lagan_web/ index.shtml MAVID http://baboon.math.berkeley.edu/mavid/ MUMmer http://mummer.sourceforge.net/ CCGB (TBA and BLASTZ) http://www.bx.psu.edu/miller_lab MAUVE http://gel.ahabs.wisc.edu/mauve/ HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - SeaView SEQ_A: SEQ_B: SEQ_C: SEQ_D: SEQ_E: SEQ_F: SEQ_G: SEQ_H: 0 10 20 30 40 50 60 |--------|---------|---------|---------|---------|---------| TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACATAAGCTCCCGTGGAAAAGGGTTGCCAACCCGGCCGATCAACAGCCCTGGTCGGC TTACGAAATCGAGGCCGGAAAAGAATTCCCAACCCGGCCGATCCTGTTGCATTCGTACC GGACCGCCTCCGTACACGAAAAGATCGGCCAACCCGGCCGATCACGCCTCGTGAGATCA GGACATCCGCGTGATTACGAGTCGTGGCCCAACCCGGCCGATCAAATTTGGTCTGGCTG GGAAGACTAGCCGCTGGTAAACACACCACCAACCCGGCCGATCTGACCCCGGCTCTCCA GGAAACTAGCATGCGATAAGTCCCTAACCCAACCCGGCCGATCTGACTATGGCCTTCTG GGGATCAAGTCTGGTTAACCATCAAACAGCCAACCCGGCCGATCGTCTTGAGTCTAAAA HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - SeqLOGO SEQ_A: SEQ_B: SEQ_C: SEQ_D: SEQ_E: SEQ_F: SEQ_G: SEQ_H: 0 10 20 30 40 50 60 |--------|---------|---------|---------|---------|---------| TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACATAAGCTCCCGTGGAAAAGGGTTGCCAACCCGGCCGATCAACAGCCCTGGTCGGC TTACGAAATCGAGGCCGGAAAAGAATTCCCAACCCGGCCGATCCTGTTGCATTCGTACC GGACCGCCTCCGTACACGAAAAGATCGGCCAACCCGGCCGATCACGCCTCGTGAGATCA GGACATCCGCGTGATTACGAGTCGTGGCCCAACCCGGCCGATCAAATTTGGTCTGGCTG GGAAGACTAGCCGCTGGTAAACACACCACCAACCCGGCCGATCTGACCCCGGCTCTCCA GGAAACTAGCATGCGATAAGTCCCTAACCCAACCCGGCCGATCTGACTATGGCCTTCTG GGGATCAAGTCTGGTTAACCATCAAACAGCCAACCCGGCCGATCGTCTTGAGTCTAAAA HS14 | UniBas | JCW Bioinformatics - Sequence Alignment - SeaView SEQ_A: SEQ_B: SEQ_C: SEQ_D: SEQ_E: SEQ_F: SEQ_G: SEQ_H: 0 10 20 30 40 50 60 |--------|---------|---------|---------|---------|---------| TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCTACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC TTACCCCGCGAGGATTCGAAAAGGTGAGCCAACCCGGCCGATCCGGAGAGACGGGCCTC SEQ_A: SEQ_B: SEQ_C: SEQ_D: SEQ_E: SEQ_F: SEQ_G: SEQ_H: 0 10 20 30 40 50 60 |--------|---------|---------|---------|---------|---------| ........................................................... ........................................................... ........................................................... ........................................................... ..............................T............................ ........................................................... ........................................................... ........................................................... HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Sequence similarity Amino acids in the same column are those that yield an alignment with maximum similarity. Most programs use sequence similarity because it is the easiest criterion. When the sequences are closely related, their structural, evolutionary, and functional similarities are equivalent to sequence similarity. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Structural similarity Amino acids that play the same role in each structure are in the same column. Structure-superposition programs are the only ones that use this criterion. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Amino acids or nucleotides related to the same amino acid (or nucleotide) in the common ancestor of all the sequences are put in the same column. No Evolutionary similarity automatic program explicitly uses this criterion, but they all try to deliver an alignment that respects it. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Functional similarity Amino acids or nucleotides with the same function are in the same column. No automatic program explicitly uses this criterion, but if the information is available, you can force some programs to respect it — or you can edit your alignment manually. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Structural similarity Amino acids that play the same role in each structure are in the same column. Structure-superposition programs are the only ones that use this criterion. Amino acids or nucleotides related to the same amino acid (or nucleotide) in the common ancestor of all the sequences are put in the same column. No Evolutionary similarity automatic program explicitly uses this criterion, but they all try to deliver an alignment that respects it. Functional similarity Amino acids or nucleotides with the same function are in the same column. No automatic program explicitly uses this criterion, but if the information is available, you can force some programs to respect it — or you can edit your alignment manually. Sequence similarity Amino acids in the same column are those that yield an alignment with maximum similarity. Most programs use sequence similarity because it is the easiest criterion. When the sequences are closely related, their structural, evolutionary, and functional similarities are equivalent to sequence similarity. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Proteins or DNA Use proteins whenever possible. You can turn them back into DNA after doing the multiple alignment. Many sequences Start with 10–15 sequences; avoid aligning more than 50 sequences. Very different sequences Sequences that are less than 30 percent identical to more than half the other sequences in the set often cause troubles. Identical sequences They never help. Unless you have a very good reason to do so, avoid incorporating into your multiple alignment any sequence that’s more than 95 percent identical to another sequence in the set. Partial sequences Multiple-sequence-alignment programs prefer sequences that are roughly the same length. Programs often have difficulties comparing items in a mixture of complete sequences and shorter fragments. Repeated domains Sequences with repeated domains cause trouble for most multiplealignment programs — especially if the number of domains is different. When this happens, you may be better off extracting the domains yourself with for example Dotlet and making a multiple alignment of those segments. HS14 | UniBas | JCW Bioinformatics - Sequence Alignment Part of knowing when to use multiple sequence alignments involves knowing when not to use them! HS14 | UniBas | JCW Question | Exercises Sequence alignment (i) Dynamic Programming - Needleman and Wunsch (1970) => global sequence alignment - Smith and Waterman (1981) => local sequence alignment (ii) Weighting/Models - Dayhoff (1978) => PAM Matrices - Henikhoff and Henikhoff (1992) => BLOSUM Matrices (iii) BLAST Programs (detecting sequence similarity) - Nucleotide blast: compares a nucleotide query against a nucleotide sequence database - Protein blast: compares a protein query against a protein sequence database - blastx: compares a nucleotide query translated in all six reading frames against a protein database - tblastn: compares a protein query against a nucleotide sequence database dynamically translated in all six reading frames - tblastx: compares a nucleotide query in all six reading frames against a nucleotide sequence database in all six reading frames Sequence alignment http://www.ch.embnet.org/software/LALIGN_form.html
© Copyright 2024 ExpyDoc