Tutorial “Molecular Systematics”, 29 April – 13 May 2015 1

Tutorial “Molecular Systematics”, 29 April – 13 May 2015
Christian Printzen, Tetiana Lutsak, Heike Kappes, Sabine Matuszak, Frank Lappe
1. Introduction
„... any systematic attempt to construct ... natural relationships becomes the purest speculation,
completely unsupported by any sort of evidence. ... the ultimate goal of biological classification
can not be achieved in the case of bacteria.“
Stanier, Doudoroff & Adelberg, The microbial world, 2nd ed. (1964)
„A revolution is occurring in bacterial taxonomy. ... molecular sequencing techniques permit a
direct measurement of genealogical relationships.
Fox et al., Science 209: 457–463 (1980)
Systematics, and particularly phylogenetics, deals with the evolutionary relationships among
organisms. Starting with Charles Darwin’s main work “On the Origin of Species by Means of
Natural Selection” (Darwin 1859), systematists have attempted to discover genealogical relations among species. The icon for these relationships is a phylogenetic tree resembling the wellknown family trees with which we outline genealogical realtionships among our family members.
The first “tree of life” was published in 1866 by Ernst Haeckel (Haeckel 1866). Until the middle
of the 20th century phylogenetic trees were almost exclusively based on morphological characters. This worked relatively well for highly evolved organisms with many observable characters,
but failed regularly in systematic groups with few visible and measurable features.
Bacteriologists were particularly frustrated. In 1964 (see first citation above) Stanier and his coauthors threw in the towel and buried a whole discipline. In their eyes bacterial systematics was
“the purest speculation”. Only 16 years later the situation had changed completely. Leading
1
bacteriologists declared a scientific revolution in Science. With the help of molecular data it had
become possible to reconstruct the phylogenetic history of bacteria. Today, the analysis of molecular data can help us answer questions that were thought to be unanswerable 40 years ago.
The history of an organism – from its evolutionary roots to its familial past – leaves traces in its
genome. Of course, Darwin and his contemporaries knew nothing about the mechanisms and
molecules of heredity. The way from the discovery of the origin of species and the laws of heredity to the calculation of molecular phylogenetic trees was long. Gregor Mendel’s work on the
heredity of traits was published in 1866 (Mendel 1866), but first recognized in professional circles more than 30 years later. In 1869, the Swiss physiologist Friedrich Miescher discovered a
molecule which he called “nuclein” in nuclei of pus cells washed from bandages gathered from a
nearby hospital. He suspected this substance to be the molecule of heredity (Miescher 1874).
But it took 75 more years until Oswald Avery and his co-workers, through their groundbreaking
experiments on pneumococci, were able to show that desoxyribonucleic acid was indeed the
inherited material (Avery et al. 1944). Soon afterwards, its chemical and physical structure was
revealed by Erwin Chargaff et al. (1951), James Watson and Francis Crick (1953). The first
piece of a DNA sequence (only 20 base pairs!) was published another 15 years later by Ray Wu
and Dale Kaiser (1968).
At that time, de-ciphering a DNA or protein sequence was technically challenging and extremely
expensive. Molecular data was therefore not usable for taxonomists and systematists, who
needed large numbers of sequences for their datasets. This changed with the publication of two
articles that altered the whole of biology and, together with it, systematics and evolutionary science. In the first paper the English biochemist Frederic Sanger and two of his colleagues described how the nucleotide sequence of a DNA molecule could be identified with a relatively
simple procedure (Sanger et al. 1977). The other manuscript was rejected by Science and Nature (these journals also make mistakes!) and was therefore first published in a volume of symposium procedures. In it, the American biochemist Kary Mullis and co-workers described a
method to amplify DNA in a test tube, the polymerase chain reaction (PCR, Mullis et al. 1986,
Mullis & Faloona 1987). Both methods together enabled systematists to determine the nucleotide sequence of particular gene loci in different species in a relatively easy and cheap way.
Already in the early 1960s, Emil Zuckerkandl and Linus Pauling (1962) had postulated that phylogenetic relationships could be elucidated with the help of macromolecules. Three years later
they published what is probably the first molecular phylogenetic tree based on protein sequences (Zuckerkandl & Pauling 1965). At about the same time, the development of statistical
methods for phylogenetic reconstruction set on, at first without direct reference to molecular
data. Charles Michener and Robert Sokal (1957) had made a start with the first statistical
method for the calculation of genealogical relationships based on “traditional” morphological
characters. Events came thick and fast in the 1960s, mainly through the works of Anthony Edwards and Luca Cavalli-Sforza. In 1963 they introduced the principle of “minimum evolution”,
better known under the name “parsimony method” popularized by Joseph Camin and Robert
Sokal (1965). The works of Thomas Jukes and Charles Cantor (1969) as well as Walter Fitch
(1971) made
2
this principle applicable to molecular data. In 1967 Anthony Edwards and Luigi Cavalli-Sforza
described a method for the calculation of phylogenetic trees based on genetic distances. Using a similar method, Walter Fitch and Emanuel Margoliash (1967) published a phylogeny of
vertebrates in the same year. And already in 1964 Edwards and Cavalli-Sforza had theorized on
a technique to reconstruct phylogenis by maximum likelihood. This method is a well established statistical tool to compare models and hypotheses. Its application failed because, at that
time, there were no computers that could tackle massive amount of computer operations. It took
only eight years to develop and publish the basic methods of tree reconstruction (some of them
more than once in parallel), but another 17 years until Joseph Felsenstein (1981) introduced a
practical maximum likelihood method for phylogenetic reconstructions. Further milestones on
the way to the calculation of evolutionary trees were the neighbour joining method by Naruya
Saitou and Masatoshi Nei (1987) based on genetic distances and Bayesian tree reconstruction introduced by Yang and Rannala in 1997.
This compact course is designed to familiarize you with some of the methods of DNA sequence
analysis used in systematics today. We concentrate on the most widely used methods: parsimony, maximum likelihood and Bayesian tree reconstruction. Distance methods are nowadays
rarely used in phylogenetic reconstruction. The course can be regarded as a “cooking class” for
molecular systematics and this tutorial is the cookbook. It briefly explains the background of the
methods (normal text and blue boxes) and gives step by step instructions for lab procedures
and use of computer programs.
We use different computer programs for data analysis. Some of them have graphical user interfaces, others are command-line operated. This tutorial uses the following conventions:
Bold face denotes commands from drop-down menus or dialog boxes. >File>Open, for example, means “choose Open from the menu File”.
Courier denotes command line entries. hsearch addseq=random nreps=10000; <Enter> means “type these exact commands into the command line and then press Enter”.
Generating DNA sequences has become rather simple routine work that is not hard to understand (although routine success in this field only comes with experience). The analytical methods of phylogenetic tree reconstruction are considerably more difficult to understand. This
course will teach you how to compute phylogenetic trees from DNA sequence data, how to find
the tree that is best supported by the data and how to quantify the statistical support for trees.
We try to explain the statistical background with as little mathematics as possible. But we cannot do entirely without mathematics and statistics. In order to follow the course (and to find out
where you need more help and explanations!) you will have to read this tutorial carefully. We
also offer the e-learning program “E-volution” (Schulte et al. 2015) in parallel to this course to
familiarize yourself more deeply with the most important aspects of phylogenetic reconstruction.
3
2. Course Objectives
Some of the methods for calculating phylogenetic trees are not easy to understand. Although
the computer does all of the computations, you need at least an intuitive understanding of these
methods and the calculation steps to interpret your data. This tutorial tries to explain the background of these methods with as little mathematics as possible using examples and analogies.
But it cannot do entirely without some simple algebra and statistics. In order to make the most
of this course you should work through this tutorial thoroughly beforehand, even if you do not
understand everything. This will enable you to ask specific questions during the course. Participants with a bascic knowledge of German can also use the e-learning platform E-volution
(Schulte et al. 2013, accessible under “Biowissenschaften” at: http://lernbar.uni-frankfurt.de/) to
complement the information and explanations given in this course.
After this course you should be able to:
1) describe and understand phylogenetic trees,
2) compute phylogenetic trees using maximum parsimony, maximum likelihood and Bayesian methods,
3) describe weaknesses and advantages of these methods, and
4) understand and calculate the statistical support of branches in a phylogenetic tree.
5) date evolutionary events using molecular clock trees, and
6) reconstruct character evolution using phylogenetic trees.
4
3. Datengewinnung
DNA zu „sequenzieren“ bedeutet, die Reihenfolge der Nukleotid-Basen – die Basensequenz – eines
DNA-Abschnitts zu ermitteln. Die für den Organismus wichtigen Informationen zur Synthese von Proteinen sind in Form von Basentripletts (Dreiergruppen von Nukleotidbasen) auf der DNA gespeichert.
Daneben gibt es bei Eukaryoten eine große Menge nicht-codierender DNA, die scheinbar keinerlei Informationen trägt. DNA-Sequenzen werden von einer Generation an die nächste weiter gegeben. Durch
gelegentliche Fehler bei der DNA-Replikation (Mutationen) entstehen im Laufe der Zeit immer größere
Sequenz-Unterschiede zwischen den Organismen. Die für Systematiker wichtigen Informationen sind
genau diese erblich fixierten Veränderungen (Substitutionen), die Sequenzen im Laufe der Evolutionsoder Populationsgeschichte durchmachen. Vereinfacht kann man sagen: Je näher zwei Organismen
miteinander verwandt sind, desto ähnlicher sind sich ihre DNA-Sequenzen. Es geht in der molekularen
Systematik also darum, anhand von DNA- (oder Protein-) Sequenzunterschieden die Evolutionsgeschichte der Organismen nachzuvollziehen. Dies ist ein Prozess, der eine ganze Reihe von Arbeitsschritten umfasst. Die folgende Zusammenfassung soll Ihnen die Übersicht erleichtern.
1) Der erste Schritt der DNA-Sequenzierung ist die Extraktion der DNA aus Gewebeproben. Die
DNA muss aus den Zellen freigesetzt und alle anderen Zellbestandteile beseitigt werden.
2) Ein oder mehrere vorher bestimmte DNA-Abschnitte müssen in so großer Konzentration vorliegen, dass man ihre Basensequenz bestimmen kann. Die Polymerase-Kettenreaktion (PCR)
dient dazu, die ausgewählten DNA-Abschnitte zu vermehren.
3) Das Endprodukt der PCR-Reaktion dient als Ausgangsprodukt für die eigentliche Sequenzierreaktion. Um optimale Ergebnisse zu erzielen, setzt man eine genau definierte Menge PCRProdukt ein, dessen Konzentration man hierfür bestimmen muss.
Die folgenden Arbeitsschritte werden im Kurs nicht praktisch behandelt.
4) Durch Reinigung der PCR-Produkte entfernt man alle Substanzen, die die Sequenzierreaktion
stören.
5) Das gereinigt PCR-Produkt wird in der Sequenzierreaktion einer weiteren PCR unterzogen.
Diesmal verwendet man nur einen Primer, so dass es zu keiner Verdoppelung der DNA mehr
kommt. Meist möchte man zur Sicherheit und zum Datenabgleich beide DNA-Stränge sequenzieren und setzt deshalb zwei oder mehr Reaktionen mit jeweils unterschiedlichen Primern an.
6) Im letzten Schritt wird das Produkt der Sequenzierreaktion in einem automatischen DNASequenzierer elektrophoretisch aufgetrennt, wobei mit Hilfe von Fluoreszenzmarkern die DNASequenz bestimmt wird.
DNA-Isolierung
Aufarbeitung des Pflanzenmaterials und DNA-Extraktion
Frisch gesammeltes Pflanzenmaterial liefert die beste Ausbeute an DNA. Auch die Qualität (ausgedrückt
im Molekulargewicht der DNA) ist bei Einsatz von Frischmaterial am höchsten. Im Alltag greift man aber
oft auf konserviertes Material zurück (wer hat schon ein DNA-Extraktionslabor auf Sammelreisen dabei).
Am besten konserviert man die DNA von Pflanzen durch rasches Trocknen mit Silicagel-Perlen, Lagerung in einem speziellen Puffergemisch oder Tieffrieren (allerdings hat man im Gelände meist auch keinen Gefrierschrank bei sich). Auch aus Herbarbelegen, durch Pressen getrocknetes Material, kann man
DNA isolieren, die allerdings häufig stark degradiert ist. Für verschiedene Organismengruppen sind die
unterschiedlichsten DNA-Extraktionsmethoden entwickelt worden. Im Kurs wenden wir exemplarisch
5
zwei Methoden an: Eine Extraktion mit CTAB und Chloroform sowie eine mit einem kommerziell erhältlichen Kit („DNeasy Plant Mini Kit“ der Firma Qiagen).
Bei beiden Methoden ist der erste Arbeitsgang der einfachste: Zerstören Sie das Pflanzenmaterial bis
zur Unkenntlichkeit! Dies ist notwendig, weil im nächsten Schritt die DNA aus den Zellen freigesetzt werden soll. Je größer die Oberfläche, desto schneller wirken die dazu notwendigen Enzyme. Trotzdem
heißt es vorsichtig vorgehen: Handschuhe tragen!
DNA-Extraktion mit der CTAB-Methode
Von diesem Protokoll gibt es unzählige Varianten für tierisches und pflanzliches Material. Fast jeder Forscher schwört dabei auf sein eigenes Rezept. Im Kurs wird ein für die Familie der Bromeliaceae optimiertes Protokoll, basierend auf dem Grundrezept von Doyle & Doyle (1987), verwendet:
1. Jeder Teilnehmer erhält 4 Pflanzenproben, 3 von unterschiedlich alten Herbarproben sowie 1 in
Silicagel getrocknete Probe.
2. Beschriften Sie für jede Probe Deckel und Seitenwand eines 2 mL Eppendorf-Gefäßes mit der
Probennummer. Schützen Sie die seitliche Beschriftung mit einem kleinen Stück Tesafilm. Wiegen Sie jeweils 20-30 mg einer Pflanzenprobe ein.
3. Pipettieren Sie 650 µL CTAB-Extraktionspuffer (CTAB = Cetyltrimethylammoniumbromid) in jedes Gefäß. Geben Sie 1,3 µL Mercaptoethanol (= MET) hinzu. Vorsicht, MET ist gif- tig – unter
dem Abzug arbeiten!
4. Stellen Sie die Gefäße bei 60 °C in den Heizblock.
5. Stellen Sie für jede Pflanzenprobe einen Mörser samt Pistill bereit, geben Sie eine Messerspitze
Sand und das Pistill hinein.
6. Ziehen Sie Schutzbrille und Baumwollhandschuhe, darüber Latexhandschuhe an!
7. Füllen Sie den Mörser zur Hälfte mit flüssigem Stickstoff. Vorsicht! Auch mit Baumwoll- handschuhen den gekühlten Mörser nur kurzzeitig berühren. Geben Sie die Pflanzenprobe hinzu und
warten Sie, bis der Stickstoff beinahe vollständig verdunstet ist.
8. Beginnen Sie unverzüglich mit dem Mörsern.
9. Überführen Sie das Pflanzenpulver mit einem Spatel und Trichter in das entsprechende Eppendorf-Gefäß bevor es aufgetaut ist! Unter dem Abzug arbeiten!
10. Durchmischen Sie Puffer und Pflanzenmaterial durch kurzes Umschütteln des geschlos- senen
Gefäßes.
11. Wenn alle Proben versorgt sind und im Heizblock stehen, warten Sie 30-60 Minuten. In dieser
Zeit die Gefäße alle 5 Minuten kräftig aufschütteln. Nach Absprache mit der Kursleitung können
Sie während dieser Zeit eventuell eine Pause einlegen.
12. Anschließend Proben auf Raumtemperatur (RT) abkühlen lassen.
13. Zugabe von 650 µL „Chloroform-Isoamylalkohol 24 + 1“ (gründlich gemischt). Vorsicht: Chloroform ist gesundheitsschädlich – unter dem Abzug arbeiten! Proben 10 Minuten mit der Hand
schwenken.
14. Zentrifugieren Sie 15 Minuten bei 9000 rpm und Raumtemperatur. Achten Sie darauf, dass der
Rotor austariert ist!
15. Beschriften Sie neue 1,5 mL Eppendorf-Gefäße mit der Probennummer.
16. Obere, DNA-haltige Phase vorsichtig mit der 1000er Pipette abziehen und in das vorbe- reitete
1,5 mL Eppendorf-Gefäß geben. Unter dem Abzug arbeiten!
6
17. Volumen der DNA-Lösung abschätzen und 0,6 Volumen Isopropanol (RT) zugeben, De- ckel
schließen und vorsichtig mischen. Es dürfen keine Schlieren mehr zu sehen sein.
18. Für 1 Stunde bei –20 °C stehen lassen. Nach Absprache mit der Kursleitung können Sie während dieser Zeit eventuell eine Pause einlegen.
19. Zentrifugieren Sie 15 Minuten bei 13.000 rpm und 10 °C. Nun liegt die DNA als Pellet am Gefäßboden vor.
20. Vorsichtiges Abgießen des Überstandes in ein Abfallgefäß. Rand des 1,5 mL Eppendorf-Gefäßes
gründlich auf Zellstoff abtupfen.
21. Zugabe von 500 µL 70% unvergälltem Ethanol. Gefäß kurz ausschwenken.
22. Zentrifugieren Sie 15 Minuten bei 13.000 rpm und 10 °C. DNA liegt als Pellet am Gefäßboden
vor.
23. Vorsichtiges Abgießen des Überstandes in ein Abfallgefäß. Rand des 1,5 mL Eppendorf-Gefäßes
gründlich auf Zellstoff abtupfen.
24. Pellet im offenen 1,5 mL Eppendorf-Gefäß bei Raumtemperatur für 20-30 Minuten trocknen.
Darauf achten, dass die DNA nicht komplett trocken fällt.
25. 100 µL TE-Puffer (pH 8) und 1 μL RNAse-Lösung (10 mg/mL) zugeben. 1-2 Stunden bei 37 °C
inkubieren, bis sich das Pellet gelöst hat.
26. DNA für weitere Verwendung bei 4 °C im Kühlschrank aufbewahren.
Zunächst werden bei diesem Verfahren die Zellwände aufgebrochen. Im anschließend zugegebenen
Extraktionspuffer wirken verschiedene Stoffe: CTAB ist ein Detergens und zerstört deshalb Zellmembranen, entfernt Polysaccharide (gerade bei Pflanzen meist reichlich vorhanden) und denaturiert Proteine
(z.B. Nukleasen). EDTA blockiert DNA-modifizierende Enzyme und MET schützt vor oxidativen Schädigungen. Auch die Inkubation bei 60 °C sorgt für die Denaturierung von Proteinen. Zur Abtrennung der
Nukleinsäuren von unerwünschten Stoffen wird Chloroform verwendet. Der zugesetzte Isoamylalkohol
verhindert Schaumbildung. In der oberen, wässrigen Phase sammeln sich die Nukleinsäuren, die denaturierten Proteine bilden eine weißliche Grenzschicht zur unteren organischen Phase, welche u.a. Lipide
enthält. Ganz unten haben sich Sand und Fasern abgesetzt. Mit Isopropanol werden Nukleinsäuren gefällt und durch Waschen mit 70%igem Ethanol von Salzen und anderen wasserlöslichen Stoffen gereinigt. Die Zugabe von RNAse sorgt für den Abbau der RNA (es soll ja mit DNA gearbeitet werden).
Die DNA-Lösung ist im Kühlschrank mehrere Jahre lang haltbar. Frieren Sie DNA-Lösungen nur ein,
wenn Sie sicher sind, dass sie nur noch selten gebraucht werden. Häufiges Auftauen und Einfrieren zerstört die DNA.
DNA-Extraktion mit dem „DNeasy Plant Mini Kit“ (Qiagen)
1. Für jeden Teilnehmer werden wie oben 4 Pflanzenproben bereitgestellt. Beschriften Sie für jede
Probe jeweils ein Schraubdeckelröhrchen.
2. Wiegen Sie 20-30 mg von jeder Probe ab und geben Sie sie in das Röhrchen. Geben Sie vorsichtig zu jeder Probe eine oder zwei Keramikkugeln hinzu.
3. Heizen Sie einen Heizblock auf 65 °C vor.
4. Erwärmen Sie die AP1-Pufferflasche im Wasserbad.
5. Mörsern Sie das Material in der Kugelmühle zu feinem Pulver. Überprüfen Sie dabei regelmäßig
den Mahlgrad Ihrer Probe.
7
6. Pipettieren Sie 400 µL Puffer AP1 (Lysispuffer) und 4 µL RNAse zu jeder Probe. Durchmischen
Sie Puffer und Pflanzenmaterial durch kurzes Umschütteln des geschlossenen Gefäßes.
7. Stellen Sie die Proben sofort in den vorgewärmten Heizblock.
8. Wenn alle Proben versorgt sind und im Heizblock stehen, warten Sie 30-60 Minuten. In dieser
Zeit die Gefäße alle 5 Minuten kräftig aufschütteln und die folgenden Arbeitsschritte vorbereiten:
9. Beschriften Sie für jede Probe 1 violette Säule und 1 weiße Säule im Collecting tube sowie zwei
1,5 mL Gefäße mit der jeweiligen Probennummer. Auch die Collecting tubes beschriften.
10. Erwärmen Sie Puffer AE auf einem Heizblock (65 °C).
11. Nach Abschluss der Inkubation pipettieren Sie zu jeder Probe 130 µL Puffer P3 (Fällungspuffer).
Pipettenspitze wechseln!
12. Mischen Sie durch Umschütteln und stellen Sie die Proben 5 Minuten auf Eis.
13. Zentrifugieren Sie das Gemisch 5 Minuten bei 14.000 rpm. Achten Sie darauf, dass der Rotor
austariert ist!
14. Pipettieren Sie den Überstand auf die violette Säule.
15. Zentrifugieren Sie 2 Minuten bei 14.000 rpm. Verwerfen Sie die Säule.
16. Pipettieren Sie den Durchlauf (ohne Pellet!) in eines der 1,5 mL Eppendorf-Gefäße.
17. Fügen Sie das 1,5-fache Volumen Puffer AW1 (Bindepuffer) zum Durchlauf hinzu und mischen
Sie sofort durch gründliches Auf- und Abpipettieren. Pipettenspitze wechseln! Wie stellen Sie
das Volumen des Durchlaufs fest?
18. Überführen Sie 650 µL dieses Gemischs auf die weiße Säule. Pipettenspitze wechseln!
19. Zentrifugieren Sie 1 Minute bei 8000 rpm.
20. Verwerfen Sie den Durchfluss. Wiederholen Sie die Schritte 18 und 19, bis die gesamte Probe
durch die Säule gelaufen ist. Verwerfen Sie danach die 1,5 mL Gefäße.
21. Setzen Sie die weiße Säule in ein neues 2 mL Collecting tube und pipettieren Sie 500 µL Puffer
AW2 (Waschpuffer) auf die Säule.
22. Zentrifugieren Sie 1 Minute bei 8000 rpm.
23. Verwerfen Sie den Durchfluss.
24. Pipettieren Sie 500 µL Puffer AW2 auf die Säule.
25. Zentrifugieren Sie 2 Minuten bei 14000 rpm.
26. Übertragen Sie die weiße Säule in das zweite 1,5 mL Eppendorf-Gefäß und pipettieren Sie 100
µL warmen Puffer AE (Elutionspuffer) auf die Säule. Lassen Sie die Säulen 5 Minuten bei Raumtemperatur stehen.
27. Zentrifugieren Sie 1 Minute bei 8000 rpm.
28. Wiederholen Sie die die Elution mit weiteren 50 µL Puffer AE. Verwerfen Sie die weiße Säule und
schließen Sie die Eppendorf-Gefäße. Beschriften Sie diese seitlich mit Artname und Datum.
Nach dem letzten Schritt halten Sie nun die extrahierte DNA in einer Pufferlösung in Ihren Händen. Im
Einzelnen haben Sie in Schritt 1-4 das Material mechanisch aufgeschlossen und in Schritt 5-7 die Zellmembranen lysiert und die DNA freigesetzt. In Schritt 9-15 haben Sie feste Bestandteile und die meisten
Polysaccharide entfernt. In Schritt 16-19 wurde die DNA an eine Trägermembran gebunden. Das so
gebundene Material haben Sie in Schritt 20-24 von Proteinresten, weiteren Polysacchariden, Nukleotiden und anorganischen Zellbestandteilen gereinigt. In Schritt 25-27 wurde die DNA von der Trägermembran wieder gelöst.
8
Sollte es Probleme bei der PCR geben (s.u.), kann man versuchen, statt mit Elutionspuffer mit Wasser
zu eluieren, dem danach 10% TE-Puffer beigefügt wird. Das in vielen Elutionspuffern in höherer Konzentration vorhandene EDTA bindet Mg-Ionen, was in der PCR oft zu schlechter Produktausbeute führt.
Photometrische Quantifizierung der DNA mit dem Qubit Fluorometer
Viele PCR-Reaktionen verlaufen besser, wenn man die Menge an eingesetzter DNA genau einstellen
kann. Deshalb muss im nächsten Schritt die DNA-Menge in jedem Eluat bestimmt werden. Mit Aqua
bidest kann dann die benötigte Verdünnung eingestellt werden.
1. Beschriften Sie für jede Probe (CTAB-Extraktion und Säulchen-Extraktion) und für Standard 1
und 2 je ein 0,5 mL Gefäß.
2. Für jeden Ansatz (Proben und Standards) werden 200 µL Workingsolution hergestellt: 199 µL
Puffer + 1 µL Farbstoff (kein Glasgefäß verwenden), mischen.
3. Legen Sie für jeden Standard 190 µl, für jede Probe dagegen 198 µL der Workingsolution im 0,5
mL Gefäß vor.
4. Geben Sie anschließend 10 µL pro Standard und 2 µL pro Probe hinzu.Gründlich vortexen und 2
min bei RT stehenlassen.
5. Qubit-Gerät anschalten, >Choose you assay >DNA >dsDNA broad range >Read new standards >Yes.
6. Den Anweisungen gemäß erst Standard 1, dann Standard 2 messen (>Read).
7. Mit >Insert Assaytube >Read die Messung der ersten Probe starten.
8. Anschließend >Calculate stock concentration wählen, und mit dem Rad das >Volume of
sample used auf 2 µL einstellen. Wert in µg/mL (entspricht ng/µl) notieren.
9. Mit >Read next sample die nächste Probe messen.
10. Berechnen Sie, wie Sie die Proben verdünnen müssen, um jeweils 10 µL mit einer Kon- zentration von 5 ng/µL zu erhalten. Halten Sie Ihre Ergebnisse in einer Tabelle fest.
11. Beschriften Sie ein neues 1,5 mL Gefäß für jede Verdünnung und stellen sie diese anhand Ihrer
Berechnungen her.
PCR-Reaktion
Die Polymerase-Kettenreaktion (polymerase chain reaction = PCR) ist eine Abfolge biochemischer Reaktionen, die bei verschiedenen Temperaturen in einem einzigen Reaktionsgefäß ablaufen. Im Grunde
simuliert die PCR-Reaktion die in der Natur vor der Zellteilung erfolgende Replikation der DNA im Reagenzglas. Die isolierte genomische DNA wird in einer Pufferlösung mit einer DNA-Polymerase, Mg2+Ionen und Desoxyribonukleotiden (dNTPs) zusammengebracht. Damit die Polymerase die DNA als Vorlage verwenden kann und mit den in der Lösung vorhandenen Nukleotiden einen Komplementärstrang
synthetisieren kann, müssen die beiden Stränge der DNA zuerst getrennt (denaturiert) werden. Dies
geschieht bei 90-95 °C. Zusätzlich benötigt das Enzym einen Anfangsstrang, an den es weitere Nukleotide anfügen kann. Diesen kurzen Strang fügt man in Form von zwei „Primern“ hinzu (s. Abb. 1). Diese
kurzen Oligonukleotide binden bei 40-60 °C an passende, komplementäre DNA-Abschnitte (annealing).
Damit hat man die Möglichkeit zu bestimmen, welche Teile der DNA man amplifizieren will. Die am häufigsten verwendete Polymerase stammt aus dem in heißen Quellen lebenden Bakterium Thermophilus
aquaticus und ist deshalb hitzestabil (taq-Polymerase). Die optimale Reaktionstemperatur liegt bei 72
°C. Der Zyklus von DNA-Denaturierung, Primer-annealing und DNA-Synthese wird in einem automati-
9
schen Thermocycler 30 bis 40 mal durchlaufen, wobei sich die Menge des ausgewählten DNAAbschnitts im (nie erreichten) Idealfall jedesmal verdoppelt.
Im Kurs soll ein Abschnitt der Chloroplasten-DNA amplifiziert werden, nämlich die Barcoding-Region von
rbcL (Gen für die große Untereinheit des Enzyms RuBisCO).
Abb. 1: Primerpositionen für die Sequenzierung des 5’-Endes der rbcL.
PCR-Reaktionen können sehr launisch sein; geringe Veränderungen im Protokoll führen oft zu kompletten Fehlschlägen. Trotzdem werden in verschiedenen Laboren oft ganz unterschiedliche Protokolle verwendet. Das am häufigsten verwendete Verfahren zur Optimierung von PCR Reaktionen ist „Versuch
und Irrtum“: Verläuft die PCR schließlich wie erwünscht, verändert man das Protokoll nicht mehr (oder
nur noch, wenn man testen will, ob sich das Resultat noch verbessern lässt).
Multiple Banden, unsaubere Banden oder fehlendes PCR-Produkt sind die häufigsten Probleme. Fehlendes PCR-Produkt z. B. kann die unterschiedlichsten Ursachen haben:
- Das Extraktionsprotokoll hat versagt; das Eluat enthält keine oder zu wenig DNA (Sollte nach Quantifizierung der DNA ausgeschlossen sein).
- Man hat zuviel DNA hinzugefügt (auch das verursacht Probleme, lässt sich aber durch Quantifizierung vermeiden).
- Die Ausgangs-DNA ist degeneriert (z. B. durch häufiges Auftauen und wieder Einfrieren).
- Der DNA-Extrakt enthält Stoffe (z. B. Polysaccharide, EDTA), die die PCR stören.
- Die Polymerase ist überaltert/ nicht tiefgefroren gelagert.
- Einer oder beide Primer sind überaltert/ nicht tiefgefroren gelagert.
- Irgendein Idiot hat die Primer falsch beschriftet.
- Die Primer passen nicht auf die Bindungsstellen (kann sogar bei angeblich universellen Primern hin
und wieder geschehen).
- Man hat eine Zutat im Ansatz vergessen.
- Die Annealing-Temperatur ist zu hoch.
- Der Thermocycler ist defekt.
Der Lösung kann man nur auf die Spur kommen, wenn man der Reihe nach die verschiedenen Möglichkeiten ausschließt. Im Zweifelsfall empfiehlt es sich, erfahrene Kollegen oder das Internet um Rat zu
fragen.
Die Polymerase-Kettenreaktion erfolgt in PCR-Gefäßen (0,2 ml). Obwohl die optimale Reaktionstemperatur bei 72° C liegt, ist die Polymerase auch bei Zimmertemperatur schon aktiv und beginnt, hier und da
Nukleotide an Primer anzubauen. Das kann die PCR-Reaktion stören, weshalb das Ansetzen der PCR
auf Eis erfolgt. Teurere sog. „hot start“ Enzyme enthalten Antikörper, die die Polymerase deaktivieren.
Durch mehrminütiges Vorerhitzen auf 96 °C im Thermocycler werden die Antikörper denaturiert und das
Enzym aktiviert. Mit solchen Enzymen kann man auch bei Zimmertemperatur arbeiten.
1. Ziehen Sie Handschuhe an und tauen Sie die folgenden Zutaten auf:
- destilliertes und autoklaviertes Wasser
- Peqlab 10x Reaktionspuffer S (blauer Deckel) = PCR-Puffer
- MgCl2
- dNTP-Mix
10
- Primer rbcLa_F und rbcLa_R
- DMSO
- BSA
- Trehalose
Nach dem Auftauen mischen Sie jedes Reagenz durch kurzes Schwenken, zentrifugieren es kurz
an und bewahren es in Eis auf (ausgenommen DMSO).
2. Beschriften Sie für jeden PCR-Ansatz und die Negativkontrolle je ein PCR-Gefäß (0,2 ml) und
stellen Sie es auf Eis.
3. Die Taq-Polymerase ist trotz Aufbewahrung im Gefrierschrank flüssig und wird direkt auf Eis überführt.
4. Ihr DNA-Extrakt überführen sie direkt vom Kühlschrank auf Eis.
Abgesehen von der Ausgangs-DNA enthalten Ihre PCR-Ansätze die gleichen Chemikalien in gleicher
Konzentration. Um sich unnötiges Pipettieren zu ersparen, stellt man daher zuerst einen sog. „Mastermix“ her, der alles enthält außer der zu amplifizierenden DNA. Jeder PCR-Ansatz enthält 25 μL, davon
entfallen 5 μL auf das DNA-Extrakt. Reaktionen mit 50 μL sind auch üblich, 100 μL Ansätze wegen der
hohen Kosten nur, wenn große Mengen DNA benötigt werden.
Aufgabe: Der Mastermix enthält die folgenden Bestandteile.
DNA-Template
PCR-Puffer (blau)
dNTP-Mix
MgCl2
Primer rbcLa_F
Primer rbcLa_R
DMSO
BSA
Trehalose
Taq-Polymerase
Wasser
Ausgangskonzentration
Konzentration/
Reaktion
Volumen/
Reaktion
5 ng/µL
10 x
2 mM
25 mM
10 µM
10 µM
1 ng/µL
1x
0,32 mM
2 mM
0,4 μM
0,4 μM
5 µl
100 x
3M
5 U/ μL
-
4x
0,25 M
2,5 U
ad 25 µL
Volumen/
x Reaktionen
-
2 μl
Berechnen Sie die Volumina der einzelnen Bestandteile für eine 25 μL-Reaktion und einen Mastermix für
die x 25 μL-Reaktionen. Bedenken Sie dabei, dass Sie 5 μL DNA-Extrakt hinzufügen müssen.
1. Berechnen Sie die Mengen für den Mastermix. Berechnen Sie hierfür 1 Ansatz für jede PCRReaktion, sowie 1 Reaktion zusätzlich (= Negativkontrolle).
2. Stellen Sie ein Eppendorf-Gefäß (1,5 ml) auf Eis und pipettieren Sie erst Wasser, dann die anderen Zutaten in der angegebenen Reihenfolge zusammen. Benutzen Sie für jedes Reagens eine
frische Pipettenspitze. Mischen Sie jedes Reagens (außer der Polymerase, s. u.!) vor dem Pipettieren durch Ein- und Auspipettieren bei gleichzeitigem Rühren!
3. Pipettieren Sie als letztes die Polymerase. Rühren Sie die Polymerase-Lösung nur sehr vorsichtig, aber sorgfältig mit der Pipettenspitze um. Die Lösung ist sehr zähflüssig; sorgen Sie dafür,
dass nichts an der Außenseite der Pipettenspitze hängenbleibt. Dieses Enzym ist extrem teuer
(0,5 mL kosten ungefähr 1000 Euro).
11
4. Mischen Sie den Mastermix bis Sie keine Schlieren mehr sehen. Pipettieren Sie 20 μL in jeden
Ansatz.
5. Pipettieren Sie zuletzt 5 µL DNA-Extrakt (entsprechende Menge H2O für die Negativkontrolle) in
das jeweilige Gefäß und verschließen Sie es.
6. Zentrifugieren Sie die PCR-Ansätze kurz ab (wenige Umdrehungen) und stellen Sie sie sofort
wieder auf Eis.
7. Schalten Sie den Thermocycler (Modell Arktik Thermal Cycler) ein und starten Sie das Programm
rbcl_pwg:
8. >Run >Carmen Jung (über Pfeiltasten auswählen) >Open >rbcl_pwg (über Pfeiltasten auswählen) >Start >Lidtemp 099 °C >ok > TempMode >Sample >Vesseltype >Common >Sample
Volume >25 µl.
9. Deckel öffnen, Proben einstellen und Deckel schließen >Accept >Yes.
PCR-Programm rbcl_pwg:
initiale Denaturierung:
Denaturierung:
Annealing:
Elongation:
finale Elongation:
95 °C
94 °C
55 °C
72 °C
72 °C
8 °C
4 min
30 sec
60 sec
35 x
60 sec
10 min
unendlich
Drei wichtige Hinweise:
1) Protokollieren Sie die PCR-Reaktion, indem Sie für jedes Gefäß das DNA-Extrakt, die PCR-Nummer
und die Mengen aller Bestandteile des Ansatzes notieren (am besten in Tabellenform wie folgt).
Extr.-Nr.
127
128
…
PCR-Nr.
1345
1346
…
DNA
5 μL
H2O
x μL
Puffer
x μL
dNTPs
x μL
Primer1
x μL
Primer2
x μL
Enzym
x μL
Dieses Protokoll ist sehr wichtig. Zu Beginn einer neuen Versuchsreihe (neue Organismen, neue Genabschnitte usw.) müssen PCR-Reaktionen fast immer optimiert werden. Das ist nur möglich, wenn man
die Versuchsbedingungen bei jedem Ansatz protokolliert.
2) DNA-Extrakten von verschiedenen Organismen lassen sie sich äußerlich nicht mehr unterscheiden.
Deshalb muss man sich peinlich genau vor Verwechslungen der Reaktionsgefäße oder -nummern hüten. Deshalb alle Eppendorf-Gefäße sauber und permanent beschriften, und unbedingt auch das Datum
jedes Ansatzes notieren! Nur so hat man eine Chance, Verwechslungen auch im Nachhinein noch auf
die Spur zu kommen.
3) Arbeiten Sie in der „Post-PCR-Phase“ mit pingeliger Genauigkeit, sonst zerrütten Sie das Verhältnis
zu Ihren Kollegen. PCR-Produkte enthalten DNA in millionenfach erhöhter Konzentration, die im Labor
mit genomischer DNA um Polymerase, Primer und dNTPs konkurriert. Wenn Sie Pech haben, erhalten
Sie dann schöne PCR-Produkte, die von Organismen stammen, von denen Sie gar nichts wissen wollen.
Dann muss das gesamte Labor einschließlich Mobiliar, Geräten, Glasflaschen usw. gereinigt und kontaminiertes Verbrauchsmaterial weggeworfen werden.
12
Nachweis und densitometrische Quantifizierung von PCR-Produkten
Die spannende Frage lautet nun: Hat die PCR funktioniert oder nicht? Um diese Frage zu beantworten
führen Sie eine Agarose-Gel-Elektrophorese durch. Das Gel wird mit SYBRgreen gefärbt, das sich mit
DNA zu einem fluoreszierenden Komplex verbindet. Bei Betrachten des Gels auf einem UV-bzw. Blaulicht-Transilluminator finden Sie so heraus, ob (1) überhaupt PCR-Produkt entstanden ist, (2) nur ein
spezifisches oder mehrere Produkte unterschiedlicher Länge entstanden sind, und (3) wie lang das
amplifizierte DNA-Stück ungefähr ist, d. h. ob das richtige Stück amplifiziert wurde. Bei Arbeiten mit Ethidiumbromid oder SYBRgreen müssen Sie immer blaue Nitrilhandschuhe tragen!
Die Qualität der Sequenzen hängt empfindlich von der DNA-Menge im Sequenzier-Ansatz ab. Unterhalb
eines gewissen Schwellenwertes versagt die Sequenzierreaktion, oberhalb werden die Sequenzen oft
unsauber. Solche schlechten Sequenzchromatogramme muss man in nervenaufreibender Arbeit editieren. Selbst dann bleiben oft viele Positionen der Sequenz unsicher, was die Datenanalyse erschwert.
Um sich diesen Ärger zu ersparen, ist es besser, die DNA zu quantifizieren. Wir haben die genomische
DNA bereits fluorometrisch quantifiziert. Die folgende Methode liefert wesentlich genauere Werte, ist
allerdings auch umständlicher. Normalerweise wird nur die Konzentration der aufgereinigten PCRProdukte bestimmt. In diesem Kurs geht es aber darum, die Qualität der genomischen DNA in verschieden aufbewahrten Pflanzenproben abzuschätzen. Die wichtigste Frage ist dabei, wie gut sich das Material für die PCR eignet. Da bei der Aufreinigung DNA in schwer zu kontrollierendem Umfang verloren
geht, quantifizieren wir hier das ungereinigte PCR-Produkt.
1. Stellen Sie einprozentige Agarosegele entsprechend der Gesamtzahl an PCR-Produkten her.
Wählen Sie die passenden Kammergrößen und Kammeinsätze aus. In jeder Reihe muss eine
Tasche für einen Größenmarker frei bleiben. Für jedes Gel verwenden Sie bitte eine Glasflasche.
2. BEISPIEL für ein mittelgroßes Gel: Wiegen Sie 1 g Agarose in eine Glasflasche ein. Vorsicht,
Agarose und Kokain unterscheiden sich in Aussehen und Preis nicht sonderlich (vermutlich aber
in der Wirkung). Fügen Sie 100 mL 1 × TAE und einen Magnetfisch hinzu und setzen einen Deckel lose auf.
3. Erhitzen Sie die Mischung 1-2 min auf höchster Stufe in der Mikrowelle.
4. Rühren Sie einmal mit dem Magnetrührer durch.
5. Erhitzen Sie jetzt in kleinen Schritten (20-30 sec mit zwischenzeitlichem Rühren) weiter in der
Mikrowelle bis die Lösung völlig klar ist. Agarose kocht in Sekundenschnelle über (eine Riesenschweinerei!). Beaufsichtigen Sie die Flasche gut und stoppen Sie die Mikrowelle, sobald die ersten Blasen erscheinen. Auch auf dem Magnetrührer beginnt heiße Agarose-Lösung sehr leicht zu
schäumen.
6. Wenn die Agarose-Lösung klar ist, lassen Sie sie auf dem Magnetrührer bis etwa 60° C abkühlen
(mit Handschuhen knapp unterhalb der Schmerzgrenze).
7. Setzen Sie für jedes Gel den Gelträger um 90° gedreht in die Elektrophoresekammer ein, so
dass die Gummidichtungen an den Wänden liegen. Hängen Sie einen oder zwei Kämme ein.
8. Tragen Sie bei den folgenden Schritten Nitrilhandschuhe.
9. Pipettieren Sie 8 μL SYBRgreen pro 100 mL Agaroselösung hinzu. Vermeiden Sie jeden direkten
Kontakt mit dem Farbstoff und entsorgen Sie die Pipettenspitze nur im dafür vorgesehenen Behälter.
10. Gießen Sie das Gemisch vorsichtig in die jeweilige Gelwanne ohne Blasen zu erzeugen. Etwaige
Blasen mit einer Pipettenspitze an den Rand des Gels manövrieren. Warten Sie ca. 20 Minuten,
bevor Sie mit den folgenden Arbeitsschritten beginnen.
13
11. Wenn das Gel milchig aussieht, ziehen Sie zunächst ganz vorsichtig die Kämme aus dem Gel.
Anschließend heben Sie die Gelwanne aus dem Tank und drehen sie um 90°. Füllen Sie Puffer 1
× TAE in den Elektrophoresetank, bis das Gel bedeckt ist. Achten Sie auf die richtige Ausrichtung
des Gels, die DNA wandert zur Anode (roter Anschluss).
12. Schneiden Sie einen Streifen Parafilm ab und legen Sie ihn direkt auf den Arbeitstisch.
13. Pipettieren Sie nebeneinander für jedes PCR-Produkt einen Tropfen mit 7 µL Wasser.
14. Pipettieren Sie in jeden Tropfen 2 µl 6xLadepuffer.
15. Fügen Sie zuletzt zu jedem Tropfen 3 µL PCR-Produkt hinzu. Mischen Sie die PCR-Produkte
vorher gründlich und wechseln Sie die Pipettenspitzen! Notieren Sie auf einem Papier Namen
und Reihenfolge der PCR-Produkte.
16. Pipettieren Sie jeden Tropfen in eine Tasche des Gels.
17. Pipettieren Sie in die letzte Tasche jeder Reihe 5 µL Größenmarker („Easy Ladder“).
18. Setzen Sie den Deckel auf, kontrollieren Sie noch einmal die richtige Ausrichtung des Gels und
starten Sie die Elektrophorese bei 85 V.
19. Wenn der blaue Marker drei Viertel der verfügbaren Strecke zurückgelegt hat, kann das Gel auf
dem Transilluminator betrachtet werden.
20. Ziehen Sie wieder blaue Nitrilhandschuhe an. Heben Sie die Gelwanne vorsichtig aus dem Elektrophoresetank und trocknen sie mit Papier ab.
21. Überführen Sie das Gel ohne zu tropfen auf den Transilluminator in der Geldokumentationskammer. Gießen Sie den gebrauchten Puffer vorsichtig in die entsprechende Sammelflasche.
22. Schließen Sie die Tür der Kammer und schalten Sie das Gerät und die UV-Beleuchtung („Transillumination“) ein. Arbeiten Sie nun zügig, da UV-Licht die DNA schädigt und dieerfolgreiche Sequenzierung des PCR-Produktes verhindern kann.
23. Starten Sie das Programm „Herolab EasyWin32“ (liegt auf dem Desktop), übernehmen Sie „wie
letzte Benutzung“ und klicken in der Menüzeile >Datei >Kamera aktivieren an.
24. Zoomen und fokussieren Sie die Kamera, so dass das ganze Gel und alle Banden scharf zu sehen sind.
25. Machen Sie ein Bild, indem Sie mit der rechten Maustaste auf das Livebild klicken, speichern Sie
es, drucken Sie ein Foto des Gels aus und kleben es zur Dokumentation unter ihr PCR-Protokoll.
26. Wenn Sie das Bild des Gels auf dem Bildschirm gespeichert haben, wählen Sie im Hauptmenü
>Auswertung >Modul C Spot Einzelvolumenanalyse an.
27. Es erscheint ein Kreuz anstelle des Mauspfeils. Setzen Sie das Kreuz auf die linke obere Ecke
der Referenzbande des Größenmarkers (500 bp), ziehen Sie den erscheinenden Rahmen nach
rechts unten auf und fixieren ihn mit einem Klick der linken Maustaste. Der Rahmen sollte etwas
größer als die Bande sein.
28. Markieren Sie entsprechend alle anderen Banden. Mit einem Klick auf die rechte Maustaste beenden Sie das Setzen der Rahmen.
29. Mit einem Doppelklick der linken Maustaste in einen beliebigen Rahmen öffnet sich die Tabelle
mit den Messergebnissen.
30. Zum Quantifizieren der einzelnen DNA-Banden verwenden wir die 500 bp Bande des Größenmarkers mit bekannter Konzentration (50 ng DNA in 5 μL Marker). Markieren Sie in der Tabelle
die Zeile der Referenzbande und öffnen mit der rechten Maustaste das Kontextmenü >Volumen
zuweisen. Setzen Sie einen Haken bei >Verwenden und geben Sie unter >Neu 50 ein, >Bestätigen.
31. Die Spalte „Vol Kalib“ der Tabelle enthält nun die DNA-Konzentration je 2 μL PCR- Produkt.
14
32. Berechnen Sie die Konzentration für 1 µL und wieviel PCR-Produkt Sie in der Sequenz- reaktion
einsetzen müssten, um auf 20 ng DNA zu kommen.
33. Schalten Sie das UV-Licht aus und entsorgen Sie das Gel im entsprechenden Behälter. Reinigen
Sie die Oberfläche des Transilluminators mit destilliertem Wasser und Papiertüchern.
34. Entsorgen Sie Ihre blauen Nitrilhandschuhe im Sondermüll.
Die amplifizierte DNA einer erfolgreichen Reaktion liegt als mehr oder weniger breite Bande auf dem Gel
vor. Diese Banden sollten deutlich sichtbar und sauber begrenzt sein, und die Fragmente sollten in der
Länge dem erwarteten Produkt entsprechen. Die Gelspur der Nullprobe sollte schwarz sein. Sollten Sie
in der Nullprobe eine Bande finden, die einer der Banden in Ihren PCR-Ansätzen entspricht, sind ihre
Reaktionen möglicherweise kontaminiert. Meist sind verunreinigte Reagenzien die Ursache solcher Kontaminationen. Diese Verunreinigungen entstehen z. B., wenn Pipettenspitzen nicht gewechselt wurden
und genomische DNA übertragen wurde. In diesem Fall muss die PCR wiederholt werden. Im schlimmsten Fall hat man es mit Laborkontaminationen zu tun (s.o.).
Reinigung der PCR-Produkte
Die Aufreinigung der PCR-Produkte und die Sequenzierung führen wir im Kurs nicht durch. Rückstände
der PCR-Reaktion, besonders Polymerase und nicht verbrauchte Primer müssen beseitigt werden, bevor die eigentliche Sequenzierreaktion angesetzt werden kann. Beim gründlichsten Verfahren trennt man
das gesamte PCR-Produkt auf einem Agarosegel auf und schneidet die Banden aus. Durch die Elektrophorese wird das PCR-Produkt von anderen Bestandteilen der PCR-Reaktion abgetrennt. Die ausgeschnittene Bande mit der gewünschten DNA wird in einer enzymatischen Reaktion aus der Agarose
gelöst. Die weiteren Reinigungsschritte funktionieren nach dem gleichen Prinzip wie DNAIsolierungskits. Die DNA (PCR-Produkt) wird an eine Membran gebunden, gereinigt und mit Wasser oder Puffer eluiert.
DNA-Sequenzierungen werden heute routinemäßig von zentralen Einrichtungen an Universitäten oder
von kommerziellen Laborzentren durchgeführt. Versuchen Sie trotzdem, Aufgabe 5 zu lösen.
Sequenzierreaktion
In der ersten PCR-Reaktion ging es darum, einen spezifischen Abschnitt der genomischen DNA exponentiell zu vermehren, um ihn später sequenzieren zu können. In der Sequenzierreaktion wird diese
hoch konzentrierte Template-DNA nicht mehr exponentiell vermehrt. Man fügt deshalb nur einen Primer
hinzu, so dass die Polymerase nur einen der beiden DNA-Stränge synthetisieren kann. Die Sequenzierreaktion enthält neben gewöhnlichen dNTPs einen kleinen Anteil Didesoxribonukleotide (ddNTPs). Gewöhnlich verwendet man fertige Mischungen (sog. Terminator-Kits mit z. T. blumigen Namen), die von
den Herstellern automatischer Sequenzierer angeboten werden. An jedes der vier unterschiedlichen
ddNTPs ist ein anderer Fluoreszenzfarbstoff gebunden. Wird ein solches dd-Nukleotid zufällig eingebaut, kann der DNA-Strang nicht weiter verlängert werden und die Reaktion bricht ab. Am Ende vieler
Reaktionszyklen erhält man so ein Gemisch von fluoreszenzmarkierten DNA-Strängen aller unterschiedlichen Längen, bei denen das letzte Nukleotid sich durch seine spezifische Fluoreszenz verrät.
Diese vielen hundert verschiedenen DNA-Moleküle werden im letzten Schritt auf einem PolyacrylamidGel elektrophoretisch nach ihrer Größe aufgetrennt. In automatischen Sequenzierern, die hierfür verwendet werden, befinden sich am Ende des Gels zwei Dioden-Laser, die die vorbeiwandernden DNAMoleküle zur Fluoreszenz anregen. Diese Fluoreszenz wird von einem Detektor aufgezeichnet, dessen
Signale in Form eines Sequenz-Chromatogramms als Computerdatei gespeichert werden.
15
Abb. 2: Sequenzchromatogramm.
Solche Chromatogramme sind nur über eine begrenzte Länge, bestenfalls ca. 1000 Basenpaare (bp),
hinweg lesbar. Oft sind die sequenzierten Genabschnitte aber länger, so dass man schon aus diesem
Grund beide Stränge sequenzieren muss. Um sicher zu gehen, kann man zusätzlich interne Sequenzierprimer einsetzen.
16
4. Data Analysis: General Considerations
Using DNA sequence data we want to answer the question how closely or distantly related different species are to each other. Generating data is an important first step in the empirical sciences. However, analysing these data is equally important. It is only after data analysis that a
scientist knows whether a particular hypothesis is supported or rejected or which of the many
possible hypotheses fits the data best. Molecular systematists and evolutionary biologists use
analytical methods that are widespread in the sciences. However, the application of these
methods in systematics is often not easy to understand. The reason for this is not only the
widespread dread of mathematics among biologists. The questions put forward by systematists
are often more complicated to answer than those in physics or chemistry. Consider the following
example to illustrate this: In 1919 the astrophysicist Arthur Stanley Eddington travelled to the
island of Príncipe in the Gulf of Guinea. During a total solar eclipse he gathered data to test certain conclusions from the theory of relativity. Ultimately, he wanted to know whether these data
supported Newtonian or Einsteinian physics. Eddington only had to deal with two theories and
three possible outcomes. The data could support either the theory of relativity or Newtonian
physics or it could contradict both theories.
Which hypotheses does a systematist test with molecular data? The surprising answer is: the
phylogenetic trees. But are these trees not simply “calculated” from the data? Using the three
methods of tree reconstruction treated in this course the trees do not inevidently follow from the
data. Instead, we have to calculate, how well the data fit a certain tree. Every possible phylogenetic tree is a hypothesis of phylogenetic relationships and we have to identify the tree that is
best supported by our data. Unfortunately, the number of possible phylogenetic trees gets unmanageably large even for small datasets with few species (see Box 1 on pages 6-7). Unlike
Eddington, we cannot just make a simple yes-no decision, we have to decide among a myriad
of possible hypotheses.
A further difficulty lies in the way in which scientists test hypotheses and theories. Usually, certain “expected” observations are deduced from a theory. The next step is then to test whether
the experimental observations are in line with the “theoretical” ones. For example, according to
the theory of relativity the light does not follow an exactly straight path but is bent by gravity
fields. This is exactly what Eddington observed during the eclipse: the light of stars that were
optically close to the sun was deflected by the sun’s gravitiy field. But which theoretical and
testable observations can be deduced from a phylogenetic tree? Even entirely absurd trees, in
which for example all flying animals are united in a single group, cannot be rejected entirely by
our data, even if they fit the data much worse than trees in which birds, bats and insects form
different groups.
5. Phylogenetic Trees and Methods
The question most commonly asked by systematists is how organisms are related to each
other. According to the theory of evolution all extant organisms have descended from a single
common ancestor in the distant past and are therefore related to each other. Because the DNA
is replicated and passed on from one generation to the next all homologous DNA sequences
are similarly related to each other. As are protein sequences because their amino acid sequence follows directly from the DNA sequence of their coding gene. The mechanism of DNA
replication is not error-free. Every now and then read errors occur and one nucleotide is replaced by another, a process called mutation. If these mutations are not lethal they can be
passed on to the next generation. By the actions of genetic drift and selection, some mutations
17
will eventually occur in all individuals of a species and are then called substitutions. Mutation is
a random process. The longer two sequences have existed independently from each other the
more substitutions will have occurred in both of them. The property of DNA and protein sequences to accumulate genetic differences over time can be used to reconstruct the evolutionary history and relationships of organisms. As a rule of thumb one can say that two highly similar individuals or species are more closely related to each other than two very different ones.
In systematics and taxonomy, relationships are commonly illustrated by genealogical or phylogenetic trees. As stated above, such a tree is a model or a hypothesis of the relationships
among the studied species. Before we care about the methods with which we can turn DNA sequence data into evolutionary trees we will briefly discuss some properties of phylogenetic
trees.
Box 1: Phylogenetic trees
A phylogenetic tree illustrates the evolutionary relationships among organisms. It is a graphical representation of the hypothesis: “The species in this data set could be related to each other in this particular
way.”
A phylogenetic tree consists of branches (or “edges” in the language of mathematicians) which are
joined at nodes. In the following example two internal nodes are denoted as X and Y. The organisms
(or better their DNA sequences) are found at the tip of branches and are also called terminal nodes
(here called A-L). The particular way in which branches and nodes are tied together is called the topology of the tree. In this example K and L are more closely related to each other than H and L. The length
of the branches is proportional to the number of substitutions between one nodes and the next. The
scale bar below the tree shows the branch length for 0.1 substitutions per nucleotide position.
A
B
C
Y
X
D
E
F
G
H
I
J
K
L
0.1
Most phylogenetic trees are dichotomously branched or bifurcating. This means that exactly two branches arise from each ancestral node. If more than two branches arise from a single node this is called a
polytomy. The branches of a phylogenetic tree can be rotated around their respective nodes without
changing the information content of the tree, pretty much like the parts of a mobile. In the tree above, D
is not more closely related to E-F than C. You are free to rotate them around node Y, for example, if this
makes it easier for you to describe the relationships. Phylogenetic trees may have a root or not. A root
confers additional information to a tree. It tells you from where the evolution of a group took its origin. In
principle, you can fix a root to any branch of an unrooted tree. In practice you usually include species in
your dataset that are closely related to the group of interest but not part of it. This “outgroup” will automatically provide a root for the “ingroup”.
18
The number of possible trees increases with the number of taxa included in the dataset and can become
incredibly large. For n taxa it can be calculated as:
Un = (2n-5)(2n-7) … (3)(1) for unrooted trees and
Rn = (2n-3)(2n-5) … (3)(1) for rooted trees.
For 20 species, R20 = 8 200 794 532 637 891 559 000, for 100 species there are more possible trees
than atoms in the universe.
A convenient way to summarize trees for computer programs is the Newick format. In this format relationships are expressed as brackets around taxon names. In Newick format the tree above looks like
this: (((((I(J(K,L)))(G,H))((E,F)(C,D)))B)A). Additional information, such as branch lengths, can be included after a colon behind taxon names. For example: (((((I:0.25(J:0.11(K:0.03,L:0.06)))(...
0.1
A group is called monophyletic when all its members can be traced back to the same common ancestor
and this ancestor has not produced any descendants that are not a member of this group (in this example the red group). All members of the yellow group also descend from a single ancestor, but in contrast
to the red group, this ancestor has a descendant that is not a member of the yellow group. Such groups
are called paraphlyetic. A typical example for a paraphyly are reptiles, because birds and mammals
also evolved from reptilian ancestors. The members of the black group have different ancestors. This
group is called polyphyletic. Of course, if we go further back in time we will also find a common ancestor for all members of the black group (in this case the root of the tree). The distinction between paraand polyphyly is therefore somewhat blurred.
Which criteria shall we use to find the phylogenetic tree that best represents the evolution of our
group? In order to find the „best“ among all possible trees we first need a quantity that tells us
how „good“ a tree is in relation to others. We call this an optimization criterion that in one way or
the other tells us how well our data fits to a certain evolutionary hypothesis. In this course we
will learn about three such criteria: (1) the principle of maximum parsimony (MP), (2) the concept of maximum likelihood (ML) and (3) the concept of posterior probability that forms the core
of Bayesian statistics.
(1) The principle of maximum parsimony (MP) is the simplest of these criteria. Its philosophical
foundation dates back to the medieval friar William of Ockham. Ockham argued that the theory
that explains a phenomenon with the smallest number of prior assumptions should be preferred
over all other explanations. In the case of phlyogenies this means that the tree that requires the
smallest number of substitutions to explain the genetic differences between the taxa is the pre-
19
ferred evolutionary hypothesis. MP is not so much a statistical method but rather founded on
philosophical principles.
In order to determine the minimal number of substitutions for a particular tree one has to insert
all possible combinations of nucleotides at the internal nodes, then count how many substitutions this would require and do this for all nucleotide positions of the dataset and for all possible
trees. Fortunately, Walter Fitch (1971) invented a method that greatly facilitates this process,
particularly for computers. Starting from the terminal nodes one moves to the next internal node
and forms the intersecting set of the nucleotides at the superior nodes. If this set is empty (because the two superior nodes have different nucleotides or sets of nucleotides like in the two
nodes “1” below) one forms the set union instead. Every time this happens one counts a substitution event. All one has to do now is to add these numbers over all nodes and nucleotide positions of the dataset, do this for all trees and choose the one with the smallest grand total of substitutions.
set union
intersecting set
MP does not use all nucleotide positions of an alignment. Positions that are identical in all sequences are generally uninformative under MP. If a single sequence deviates from all others
at a certain position, the most parsimonious solution is always that the substitution occurred on
the terminal branch to this sequence. Such a position does also not allow any conclusions on
phylogenetic relationships of a species. Only positions in which more than one taxon shows a
certain substitution are parsimony-informative. MP may have serious drawbacks because
substitutions are merely counted without modelling the substitution process. Datasets with many
homoplasies (see Box 2) can be difficult or impossible to analyse under MP.
Box 2: Homology and Homoplasy
Two or more organisms may share a certain trait or character state because they inherited it from a
common ancestor. A good example are the wings of birds. In this case the trait is called homologous.
Identical character states can also evolve independently as, for example, the wings of birds and bats.
Such a trait is called a homoplasy. In the case of complex organs such as wings it is fairly easy to distinguish homologous from homoplasious character states. In the case of nucleotides this is often impossible at first view. In most cases we can assume that identical nucleotides at a certain alignment position
are homologous. There are, however, two events that can lead to homoplasies at the molecular level:
parallel substitutions (also called parallelisms) and reversals, in which a nucleotide mutates
back to a previous character state. With only four possible character states, both events are relatively
likely to be observed in big datasets, particularly at rapidly evolving gene loci.
20
position
sequence A
sequence B
sequence C
sequence D
123456789
AAACCTTGG
AATCGTTGG
AATCGTTCG
AAACCTTCG
The example above illustrates homoplasy at the molecular level. If we assume the tree ((A,D)(B,C)) either A and B or C and D have independently acquired their identical nucleotide at position 8. If the true
topology is ((A,B)(C,D)) the same holds for the nucleotides at positions 3 and 5.
Many authors make a logical link between the principle of parsimony and the terms homoplasy and homology. On the background of common descent, identical character states of two or more species
should be regarded as homologous as long as nothing argues against this assumption. The assumption
that identical traits are the result of two or more independent evolutionary events is an ad hoc-hypothesis
to explain the data. The smaller the number of such ad hoc-hypotheses, the more parsimonious is the
tree. Unfortunately, the story is not as simple as it sounds (see Felsenstein 2004: 136-146).
Homoplasious positions in a dataset lead to conflicts in the choice of the optimal tree because they favour different tree topologies. A high number of homoplasies in a dataset make it difficult to decide between competing phylogenetic hypotheses. As a measure of the reliability of an MP tree homoplasy indices are often calculated.
The consistency index CI is calculated by forming the ratio
ci = mi / si
for every alignment position, where mi is the minimal number of substitutions in all possible trees and si
the observed number of substitutions for the tree under consideration.
CI is then calculated as the mean of the values for all positions:
CI = Σimi / Σisi
This index is not identical for all tree topologies and its lower limit is not 0. Its reverse is called the homoplasy index (HI = 1 – CI) and is a measure of the frequency of positions with parallel or reverse mutations.The retention index
RI = (Σigi - Σisi) / (Σigi - Σimi),
where gi is the maximal number of substitutions in all possible trees, and the rescaled consistency index
RC = CI × RI
are identical for all topologies and range from 0 to 1.
Homoplasies may positively mislead phylogenetic analyses. A notorious example is the so-called “long
branch attraction” under MP. Imagine a tree with the following topology and branch lengths:
A
B
C
D
The taxa A and C are most closely related to each other, as are B and D. Simulation studies have shown
that in most cases MP would recover a topology in which A and B are most closely related. The reason
21
is the high number of mutations on the long branches that lead to A and B. Some of these substitutions
will be reverse and parallel mutations creating the impression that A and B share a lot of homologies.
Instead of resolving the problem adding more data will make it ever more certain to end up with the
wrong topology in such a case. This is one of the reasons why model based methods such as ML and
Bayesian methods have more or less replaced MP analyses.
Maximum likelihood and Bayesian posterior probability, are statistical methods that try to answer two different questions about data and hypotheses. A proponent of likelihood would ask:
“What does the experimental evidence (or the data) tell me about two competing hypotheses?”
A Bayesian statistician asks: “How does the available evidence change my confidence in a certain hypothesis?”
(2) The law of likelihood states: “The data D favour hypothesis H1 over hypothesis H2 if and only
if the probability to observe the data is higher under the assumption of hypothesis H1 than under
H2. A mathematical shorthand for the last part of this statement could look like this:
P(D| H1) > P(D| H2).
The probability of the data under the assumption of a certain hypothesis or model P(D| H) is
called the likelihood of the model or hypothesis. This likelihood tells us nothing about the probability of the hypothesis as Sober (2008) illustrates by a nice example. Imagine somebody sitting at home and hearing a noise from above. A superstitious person might believe that some
gremlins are bowling in his attic. The likelihood of this hypothesis is very high because the
probability is very high that in this case you would hear some kind of noise from above, although
this is of course complete nonsense. This is the reason why single likelihood values are of no
use when trying to evaulate different hypotheses. What is very useful is the so-called likelihood
ratio of two hypotheses defined as:
P(D| H1) / P(D| H2).
This value is larger than 1 if hypothesis H1 has a higher likelihood than H2 and smaller than 1 in
the opposite case. By calculating this ratio one can find out, which of two hypotheses is more
strongly supported by the data and how strong this support is. As we already know, systematists have to decide between a very high number of competing hypotheses. Theoretically, this is
no problem. The topology with the highest likelihood is supported best and is to be preferred.
The computation of likelihood values for phylogenetic trees is only problematic because it involves many more arithmetic operations than an MP analysis.
To explain how the likelihood of a phylogenetic hypothesis is calculated, look at the following
figure.
2
1
v1
3
v3
v2
5
6
v6
v5
0
22
4
v4
It consists of four taxa 1, 2, 3 and 4 and two internal nodes 5 and 6. The taxa and nodes are
connected by branches of different length l1, l2 … l6. The branch lengths correspond to the product of substitution rate and the time between the origin of the branch and its split-up, or simply:
The longer a branch the higher the number of substitutions on it.
In order to compute the likelihood for a certain position k of the alignment we put random nucleotides on nodes 0, 5 and 6. The nucleotides at the terminal nodes 1-4 are given by our dataset. Remember that the likelihood is the probability of observing these nucleotides (the data)
given the tree (the hypothesis). We start by putting a random nucleotide (say a G) at node 0 and
call it x0. We must now calculate Px0, the probability of observing this nucleotide there. At this
point, don’t worry about how this is done, we are currently only assembling the formula for calculating the likelihood. Next, we need the probabilities that this nucleotide mutates to x5 along
branch l5 (Px0->x5 · l5) and to x6 along branch l6 (Px0->x6 · l6). The product tells us that these probabilities depend on the branch lengths. The longer a branch the more opportunities to mutate. If
we extend to the final branches leading to the terminal nodes, we get the following long but not
very complicated formula:
lk = Px0 · (Px0->x5 · l5) · (Px5->x1 · l1) · … · (Px6->x4 · l4)
Because we do not know the nucleotides at nodes 0, 5 and 6, we must calculate this quantity
for all possible combinations of nucleotides at these nodes and then add up these values.
Lk = lk1 + lk2 + ...,
where the subscripts 1, 2 etc. denote the different combinations of nucleotides. This value Lk
(mind the capital L!) is called the likelihood function for position k of the alignment. Finally, this
calculation has to be done for every single position of the alignment. If we assume that all positions evolve independently the likelihood for the total dataset is simply the product of the values
for the single positions
L = Π Lk.
Unfortunately, probabilities always take values smaller than 1 and the product of many factors
smaller than 1 quickly approaches zero. Therefore, statisticians and computer programs prefer
the logarithm of the likelihood, because the logarithm of a product is the sum of the logarithms
of its factors, which turns the above product into
lnL = Σ lnLk.
If we have the topology and branch lengths of a tree and a dataset we only need one more
thing: a formula which allows us to calculate Px0 and all the probabilities with which nucleotides
mutate along the branches. This formula is called a substitution model (see Box 5).
(3) Bayesian posterior probability
Bayesian statistics also tries to evaluate hypotheses with the help of data, but asks an entirely
different question. Bayesian statisticians are not essentially interested in comparing different
hypotheses, they want to know whether the data increase or decrease our confidence in a certain hypothesis. Likelihood enables us to compare different hypotheses but allows no statements about a single hypothesis or model. With Bayesian methods we can directly infer the
credibility of a single hypothesis.
Bayesian statistics relies on Bayes’ rule which is easily derived from the definition of conditional
probabilities. Let us look at a simple example to explain conditional probability. A pack of 52
cards consists of four types of cards: 13 spades, 13 clubs (both black), 13 diamonds and 13
hearts (both red). What is the probability of randomly drawing a card of hearts (H) under the
condition that it is a red card (R)? Obviously ½, because half of the red cards are hearts. A
more general way of calculating this probability (which is also applicable to more complicated
23
questions that are not so easy to answer) uses a slightly different line of argumentation. The
probability P(R & H) that the card is red and hearts at the same time is ¼, the probability that is
is red, P(R), our precondition, is ½. Dividing P(R & H) by the probability that our condition is true
P(R) also gives ½. Hence the probability of an event (H) under the condition that event (R) has
happened equals the probability that both events happen divided by the probability of the condition:
P(H | R) = P(H & R) / P(R)
This formula is more general because it allows us to exchange the two events:
P(R | H) = P(R & H) / P(H)
Because P(H & R) is the same as P(R & H) – the probability that both events happen – we can
now express P(H & R) in two different ways:
P(H & R) = P(H | R) · P(R) = P(R | H) · P(H).
If we only look at the last two parts of this equation and solve for P(H | R), we get
P(H | R) = P(R | H) · P(H) / P(R).
This is Bayes rule found by the reverend Thomas Bayes sometime in the 1740s, published
posthumously by Richard Price in 1763 and more or less unnoticed until Pierre-Simon Laplace
found it independently and published it once more in 1774. This theorem tells us how the probability of a hypothesis P(H) is changed when we gather new data (R). P(H) is the probability of
the hypothesis before we have seen the new data, the so-called prior probability. P(H | R) is
the probability of the same hypothesis after we have evaluated these data, the posterior probability. In contrast to the likelihood this is a true probability and behaves like one: P(H) + P(not
H) = 1. The likelihood of the hypothesis given the new data P(R | H) is also part of the formula.
Besides these rather harmless participants there is an unpleasant stranger P(R), which is the
total probability of the data under all possible hypotheses.
This total probability P(R), and to a lesser degree the prior probability P(H), can cause tremendous difficulties. The probability of most hypotheses is impossible to guess without having seen
any data. And the total probability of the data under all hypotheses can only be calculated for a
relatively small number of possible hypotheses. There is no doubt that Bayes’ rule is mathematically sound but, because of these difficulties, a lot of statisticians are not convinced of its
application.
For phylogenetic reconstructions with zillions of trees the total probability of the data is impossible to compute. One can, however, come to an approximate solution by studying a random
sample of trees. Such a procedure would resemble the calculation of the average income of
citizens by evaluating a representative sample of them. Considering that there are so many different trees, the vast majority will be entirely absurd, some will be rather unlikely and only a few
will represent the evolution of the group in a relatively reasonable way. A representative sample
of trees would sample trees not randomly but strictly in proportion to their probability or likelihood. One method to sample trees in this way was published by Ziheng Yang and Bruce Rannala in 1997.
It circumvents the problem of the total probability of the data in an elegant way. When the posterior probabilities of two trees are compared this total probability simply cancels out as is evident
from the following formula, in which H1 and H2 represent different trees (hypotheses) and D the
data:
P(H1 | D) = P(D | H1) · P(H1) / P(D) = P(D | H1) · P(H1)
P(H2 | D)
P(D | H2) · P(H2) / P(D)
P(D | H2) · P(H2)
24
The only minor problem now is the prior probability of the trees. Here a simplification can help
us. If we have not seen any data at all, all trees will actually have the same prior probability,
which is called a non-informative prior. This assumption is certainly not true, but as long as we
have not seen any data, we have no reason to believe that any tree is more probable than the
others. Under this somewhat idealized assumption the whole procedure reduces to a calculation
of likelihood ratios. Trees are then sampled in the following way.
A computer program calculates the likelihood of a randomly picked tree. Next, the program
changes a parameter of the phylogeny (a branch length, the position of a species on the tree
etc.) and calculates the likelihood for this new tree. If this is higher the program “accepts” or
stores the tree as a new starting point for calculations. If it is lower the tree is only accepted with
a probability that is inversely proportional to the likelihood difference between the trees. If the
difference is high, i. e. if the second tree is much worse than the first, the probability of accepting it is very small, if the difference between the two is low the program jumps to the new tree
with a higher probability. If a tree is not accepted, the program remains on its current tree. This
sequence (change of a parameter, likelihood calculation, comparison of likelihoods, jump to a
new tree) is repeated several million times. This way, the program jumps like a monkey from
tree to tree preferrably in an upward direction but every now and then also taking a step downhill. The method is called a Markov Chain Monte Carlo process (often abbreviated MCMC) and
ensures that trees are collected in proportion to their likelihood. The Markov Chain gathers a
representative sample of trees. One of the great advantages of this method is that the probability of every parameter of the phylogeny (a certain branch of the topology, parameter of the substitution model etc.) is equal to its frequency in the sample and can be extracted with small effort. Branch lengths and many parameters of the substitution model are not just point estimates
but can be given with standard deviations.
A few andjustments have to made to ensure that the sampling process is efficient and really
representative. Because the chain starts with a random tree the first trees are probably very
unlikely. In order not to bias the sample towards trees with low likelihood, the first 10.000 to
100.000 steps of the Markov chain are discarded. This is called the “Burn-in” phase. It is also
important to ensure that the trees in the sample were gathered independently from each other.
Because only one parameter is changed at each step of the Markov chain this is only given
when we sample not every tree but, for example every 100th tree. Finally, If the likelihood “landscape” is very rugged with several “peaks” of high likelihood divided by deep “valleys” of low
likelihood a single Markov chain might get stuck on one of the peaks. Therefore, one usually
runs several chains in parallel starting from different initial trees. This increases the chances to
sample the whole likelihood landscape evenly.
6. How to Calculate Phylogenetic Trees
After all these theoretical explanations and considerations, let us finally do it. The number of
different computer programs for phylogenetic analyses is so large that we will restrict this
course to a few basic ones that are commonly used or cited in the literature. Programs such as
PAUP, PHYLIP (URL: http://evolution.genetics.washington.edu/phylip.html) and MEGA (URL:
http://www.megasoftware.net/) can perform various analyses. But you will find loads of software
for specialized purposes, for example Bayesian analyses (MrBayes, URL:
http://mrbayes.sourceforge.net/) or ML computations (PHYML, URL:
https://github.com/stephaneguindon/phyml-downloads/releases; RAxMLGUI, URL:
http://sourceforge.net/projects/raxmlgui/). The programs above with a URL are freely available.
25
Most phylogenetic programs will run on Mac OS and Windows. The most commonly use input
formats for these programs (FASTA, PHYLIP and NEXUS format) will be explained during the
course.
Before we start our first phylogenetic analysis, we have to consider one more aspect, the difficult task of finding the optimal tree out of a number of possible trees that exceeds the number of
atoms in the universe. It is obvious that even the fastest computers cannot solve this problem in
reasonable time. Instead of picking a limited number of trees randomly and comparing them,
there are smarter search options that use simple strategies to reach a preliminary result and try
to optimize this solution. Such heuristic search strategies are explained in the following box.
Box 4: Heuristic search
A heuristic search for an optimal tree is a two-step procedure. In the first step an initial tree is constructed using a simple and fast method. A commonly used algorithm is stepwise addition. Three taxa
are chosen randomly from the data set and combined into an unrooted tree (how many unrooted trees
are there for 3 taxa?). Step by step more taxa are added to this tree by joining them to one of the
branches of the existing tree, so that the resulting tree is the most parsimonious one or the one with the
highest likelihood. Why is this tree not the overall best one? Simply because the order, in which taxa
were added, has an impact on the final result. And there are just about as many different orders in which
you can arrange the taxa as there are different phylogenetic trees for them (for 50 taxa there are 3 x 1064
ways). It wouldn’t help to try them all.
The second step starts when all taxa are added to the initial tree. Now the program starts to exchange
branches on the tree and calculates whether this improves the tree. There are different strategies for this
so-called branch swapping.
Nearest Neighbour Interchange (NNI) removes an internal branch of the tree and exchanges the four
groups into all three possible arrangments before the branch is tied again. This is done for all internal
branches of the tree.
Subtree Pruning and Regrafting (SPR) cuts off a branch (like a gardener pruning a tree) and joins it to
all possible branches of the remaining tree. Again, this is done for all branches of the tree.
Tree Bisection and Reconnection (TBR) cuts the tree into two and reconnects the two subtrees at all
possible positions. This is repeated for all possible cutting points. The number of investigated trees is
TBR > SPR > NNI.
In order to further increase the number of trees the search is usually repeated several times, each time
adding the taxa in a different order to the initial tree. Although only a small fraction of trees is compared
this search algorithm works very well in most cases.
Maximum Parsimony and the NEXUS File Format
We will calculate the first phylogenetic trees using the Windows version of PAUP and maximum
parsimony, the default option in PAUP. This program can either be operated through a command line or specific command blocks that are attached to the data. PAUP is a versatile program with lots of commands. We will therefore not learn more than a few basic functions of this
program.
1. Open PAUP. A dialogue box will open automatically. Load the provided alignment in
NEXUS format by marking it and pressing “Open/Execute”. If the file does not contain typos and other mistakes (e. g. blank spaces, full stops or hyphens in taxon names) this
should prompt no error messages.
26
2. In the window above the command line you will see some information on your data set.
>Window>[name of the file] opens a second window with the actual data file. You can
toggle between these windows.
3. We will first try out a simple heuristic search for an optimal tree using the command line:
hsearch addseq=asis; <Enter>.
4. hsearch initiates the heuristic search, addseq=asis tells the program to add the taxa
to the initial tree (stepwise addition, see Box 4) in the order in which they appear in the
dataset.
5. The output tells you how many characters the dataset contains, how many of them are
parsimony-informative, and which branch swapping algorithm was used. The default in
PAUP is TBR. Under “score” in the table you can see how many evolutionary steps (substitutions) the best tree requires.
6. In order to view the tree, type:
describetrees 1/ brlens=yes; <Enter>
Now the program displays tree no. 1 and prints a table with branch lengths (brlens). In
addition different homoplasy indices are displayed.
7. In order to increase the number of studied trees and the probability of finding the optimal
tree, we will now increase the number of cycles in the heuristic search. This only makes
sense if we tell the program to randomly add taxa during the first step of each analysis.
hsearch addseq=random swap=NNI nreps=100; <Enter>
The heuristic search will now be repeated 100 times adding the sequences randomly.
This time branches will be swapped using nearest neighbor interchange.
8. A very convenient option of PAUP is that it remembers recently used commands. in order
to display the first tree click on the arrow to the right of the command line. You will see a
list of the previously used commands. Choose the correct one and press <Enter>.
9. PAUP produces unrooted trees. Let us now root the tree with an outgroup. First we have
to define the outgroup. Assuming you use the provided dataset the command for this is:
outgroup Podocarpus Ginkgo Metasequoia Cycas; <Enter>
If you use your own dataset you will probably know which taxon or taxa to choose for an
outgroup.
10.To root the tree with this outgroup, type:
describetrees 1/ root=outgroup outroot=monophyl; <Enter>
Now the outgroup should appear as a monophyletic group outside the ingroup taxa.
11. In order to save the tree or trees with branch lengths, type:
savetrees file=[file name] brlens=yes; <Enter>
You have now saved the first tree as a result of the analysis. As you may have noticed displaying trees is not the strongest side of PAUP. In order to edit the tree for publications or before
printing it we will now open it in FigTree.
27
1. Load the tree file by >File>Open.
2. If the branches are labelled (for example by statistical support values), you will be asked
to provide a name for these labels. This will not be the case now.
3. The tree appears in the window. The menu on the left side allows you to alter the appearance of tree. Play a bit around with the layout.
4. Above the window you see a graphical menu, with which you can change the position
and order of clades. To re-root the tree with a different group mark the branch and
choose >Reroot.
5. You can also rotate clades around nodes (remember that this does not alter the topology
of the tree). Choose a branch and >Rotate. >Highlight marks a whole clade, >Colour
changes the colour of branches. Play around with the different options. Also change from
“Node” to “Clade” or “Taxa” above “Selection Mode” to find out about these options.
6. If you want to display statistical support values for branches (see below) mark the box
“Node Labels” or “Branch Labels” on the left and open the menu by clicking on the arrow.
7. Choose the appropriate label from the drop-down menu “Display”. At this point we have
not yet investigated the statistical support for branches. But remember this function for
later use.
8. Once the tree suits your taste you can export it into a graphical format by >File>Export
Graphic... and choosing the appropriate format and the folder in which you want to store
the file.
Choosing a Substitution Model for Likelilhood Analyses
Likelihood analyses are not only more difficult to understand than parsimony. They are also a bit
more clumsy to perform, because, in order to calculate the likelihood of a tree, we need to
choose a substitution model. This is a set of formulas that allows us to calculate how probable it
is that a certein nucleotide is substituted by another along a branch of defined length. Before we
go into the practical details, Box 5 will explain substitution models in general.
Box 5: Substitution models
Two sequences that derive from a common ancestor will gradually accumulate substitutions. Because
substitutions are rare events that happen randomly we would expect that the number of observed differences between two sequences grows slowly and more or less linearly with time (black line in the illustration below).
genetic
divergence
time
28
However, the observed differences between sequences do not accurately reflect the substitutions that
have happened. A small number of substitutions will affect nucleotides that have already mutated before.
Instead of two substitution events we will only observe one or, if the mutation is back to the original nucleotide at this position, none at all. Usually, the number of observed differences between two sequences (red line) will be smaller than the actual number of substitution events. This effect is more pronounced in rapidly evolving genes at slowly evolving gene loci and is particularly problematic for MP
where differences among sequences are simply counted, which misrepresents the actual number of substitutions. Substitution models try to model the whole process including parallel and reverse mutations.
The simplest substitution model (the Jukes-Cantor model) assumes that all possible substitutions occur
with equal probability. This model is relatively easy to describe. Substitutions are random events. We do
not know exactly when the next substitution will hit a gene, but we know that eventually a nucleotide will
be substituted by another. It is like waiting for a notoriously unpunctual bus. As you wait the probability
that the bus has not yet arrived grows smaller and smaller until the damn vehicle finally arrives. If we call
the waiting time T and the time that has passed since you arrived at the bus stop t we can say: The
probability that the waiting time for the bus (or the next substitution) is longer than the time that has
passed is getting smaller and smaller as time passes. This probability is exponentially distributed.
P(T > t) = e-t.
If this is the probability that nothing has happened (no bus, no substitution), then the probability that
“something” has happened is 1-e-t. There are four different nucleotides (A, C, G, T). If all substitutions
are equally probable the probability of any single substitution is ¼ (1-e-t). This is correct if, for example a
G is replaced by an A, C or T. But how big is probability that after a certain time the G is still a G? There
are two possibilities for this case: either nothing has happened: e-t. Or the G was replaced by the same
nucleotide: ¼ (1-e-t). The total probability for this event is therefore e-t + ¼ (1-e-t). If we generalize this for
all nucleotides we can describe the probability that, at time t, nucleotide x is replaced by nucleotide y, in
the following way:
Px→y(t) =
{
¼ (1-e-t)
if x ≠ y
e-t + ¼ (1-e-t)
if x = y
The curly bracket here means “either – or”. This is the mathematical abbreviation of the Jukes-Cantor
model. Actually, most gene loci do not follow this simple substitution model. For example, due to their
different size and number of hydrogen bonds, purines (G and A) and pyrimidines (C and T) are more
often substituted by another nucleotide of the same group (transition) than by one of the other group
(transversion). Moreover, nucleotides do usually occur at unequal frequencies which favours certain
substitutions over others. All these facts are considered in more complicated models. The following four
matrices show commonly used substitution models. The uppor row shows the nucleotides before, the left
column the nucleotides after substitution. The greek letters in the center denote the different substitution
rates, the symbols gA, gT etc. frequencies of the respective nucleotide.
A
T
C G
-
α
α
α
A
T α
-
α
C α
α
G α
α
A
A
A
A
T
C G
-
β
β
α
A
α
T β
-
α
β
T βgA - αgC βgG
T αgA - δgC εgG
-
α
C β
α
-
β
C βgA αgT - βgG
C βgA δgT - ζgG
α
-
G α
β
β
-
G αgA βgT βgC -
G γgA εgT ζgC -
Jukes-Cantor
Kimura
T
C G
- βgT βgC αgG
HKY
A
T
C G
- αgT βgC γgG
GTR
These models have different number of unknown parameters. The Jukes-Cantor model only has one
parameter α, the general time reversible model 9 because it assigns a separate rate to all six kinds of
substitutions. The more parameters a model has, the more difficult it becomes to compute the likelihood.
29
Position specific substitution rates
All these models assume that the substitution rates are identical for all positions of a DNA sequence.
This is often not true. Changes in the amino acid sequence of the active centers of enzymes are much
rarer than substitutions in side chains of a protein because they usually affect the function of the protein.
First, second and third positions of a codon also have different substitution rates because many mutations at third positions do not affect the amino acid sequence. Hence, the substitution rates may vary
along a gene sequence. This variation can be considered in models by assigning different overall rates
to different positions.
In principle one could assign a separate rate to each position. Unfortunately, the number of parameters
in a model cannot be higher than the number of datapoints in the dataset. You can easily understand this
if you think of a regression line through a scatter-plot of data points. The regression line is a model that
simplifies the information contained in the data and is defined by two parameters, the gradient and the
intersection with one of the axes. If you only have one data point there is no unique regression line representing the data. We must therefore use another method to include position specific rates into a substitution model. Instead of assigning a separate rate to each position we define a set of “substitution rate
classes”, each with its own rate. The default is often 4 but any number of classes can be defined. How
much these rates differ from each other is defined by a certain probability distribution, the gamma distribution. Look at the following graph.
On the x-axis we have the substitution rates, low rates on the left close to the y-axis, higher rates on the
right. The area below the curve is then divided into 4 (or whatever number you prefer) parts so that each
part covers the same area below the curve. The four different rates are then determined as the median
of each part, in this case 0.32, 0.68, 1.11 and 1.88. The nice thing about this procedure is that the
gamma distribution is defined by a single parameter α and can take very different shapes. In the figure
above α = 2. If α ≥ 1 all substitution rates are relatively close to 1. For α < 1 they differ more strongly.
The rates for α = 0.5 are, for example, 0.03, 0.28, 0.92 and 2.77. Once the different rates are defined,
the likelihood for each position (given a tree with branch lengths) is calculated using the mean likelihood
value for all four rates. Such a model can represent the actual substitution process much better, because
the higher rates contribute a higher proportion to the overall likelihood for rapidly evolving and the low
rates for slowly evolving positions.
Now that we know what a substitution model is and how it reflects the substitution process, how
can we select a model among the many different ones? Mathematical models do not perfectly
represent but only approximate natural processes. An ideal model would allow us to extract the
essential information from complex datasets. As outlined above a scatter plot in a coordinate
system may be reduced to a regression line a + bx that tells us how pressure depends upon
temperature in an autoclave. In such a model, a large dataset is reduced to two parameters a
and b.
30
The law of likelihood offers one way to choose among models. If you calculate the likelihood of
a tree based on the dataset and different substitution models you could pick the preferred model
by comparing the likelihood values for different models, just as you choose the best tree by
comparing likelihood values for different trees.
P (D | M1) / P (D | M2) (see page 10)
The problem is that one can always add further parameters to a model that increase the agreement between the model and the data. The more complex a model gets the more closely it reflects the actual dataset and the higher its likelihood. By just comparing the likelihood ratio we
would always end up with the most complex model. Because all datasets contain noise in addition to information complex models also tend to capture a higher proportion of that noise. This
increases the chances that a model misrepresents the natural relationships and will not fit very
well to future datasets. In the example above, instead of a simple line we could describe a curve
that touches every point of the scatter plot. This model would fit the data perfectly but probably
does not correctly describe the relationship between temperature and pressure in the autoclave.
Therefore we need to avoid overly complicated models.
Introducing additional parameters into a very simple model will greatly increase its likelihood.
But an additional parameter in an already very complex model will make only a small difference.
In 1974, Hirotugu Akaike introduced a method that makes use of this observation by offsetting
the number of parameters with the likelihood value of a model. The formula looks like this:
AIC = 2k – 2log L(M)
where k is the number of parameters and log L(M) the logarithm of the likelihood of a certain
model. The more parameters you include the larger the values for 2k and log L. As long as the
increase in likelihood exceeds the increase in 2k the AIC value gets smaller. Once the model
has reached a certain degree of complexity, the increase in likelihood will no longer compensate
the increase in 2k, because additional parameters add only little to the likelihood value. From
then on the AIC value will increase again. Put simply, the model with the lowest AIC is the one
that explains the data best with a minimum number of parameters. There is a number of freely
available programs which calculate the AIC or similar quantities (e. g. jModeltest, MtGUI).
As we have already pointed out, different genes and even parts of a gene often follow different
substitution models. In many protein coding genes a quick look at the alignment already tells
you that first, second and third codon positions have different rates. For multigene datasets, this
would suggest to use a separate model for each individual gene and each codon position. For
datasets with many genes this would lead lead to an unnecessarily complex model, if some
genes within the dataset follow the same model. Therefore one should always test whether different genes follow the same model and whether they are not better combined in a single data
partition with a single model. The following figure from the user manual of PartitionFinder (Lanfear et al. 2012) shows different options to partition a dataset of three coding genes.
31
In order to find the optimal model, one must consider all possible combinations of models and
partitioning schemes. The program PartitionFinder does this automatically. It needs a dataset in
PHYLIP format and a configuration file, in which the most complicated partitioning scheme to be
considered is defined by the user. In the case of the provided dataset this would include three
codon positions for every gene in the dataset (atp1, matK, rbcL). The configuration file is called
“partition_finder.cfg” and can be found in the folder “PartitionFinder”.
1. Copy the alignment in PHYLIP format into the folder “analyses/PartitionFinder”.
2. Open the file “partition_finder.cfg” in a text editor.
3. Add the name of your PHYLIP file behind alignment = .
4. Below ## DATA BLOCK ## you already find the nine possible partitions. You need to
define which parts of your alignment belong to which partition. The notation 1-xxxx\3
means that every 3rd position between 1 and xxxx belongs to one partition. Why does
the next partition begin with a 2 (2-xxxx\3)?
5. Save the file without changing its name after you have entered all the information.
6. PartitionFinder is written in the programming language Python. In order to start it open a
command prompt and type the command python and a space.
7. Open the file “Programm/PartitionFinder”, drag and drop the file “PartitionFinder.py” into
the command prompt and type another space.
8. Now drag and drop the folder “Analyses/PartitionFinder” into the command prompt and
press <Enter>. If there are no errors in the cfg-file the program will start computing the
optimal combination of models and partitions. The progress is output on the screen.
9. After the analysis has been finished you will find a new folder “analysis” and a file “log.txt”
in the folder “Analyses/PartitionFinder”
10. The logfile documents the screen output.
11. The folder “analysis” contains the results of the analysis. The most important file for us is
“best_schemes.txt” which describes the optimal partitioning scheme. Open the file in
Word (output is difficult to read in some text editors).
12. Following some rather irrelevant lines with numbers you find the description of the
scheme. The partitions are numbered. For each you can see the optimal model, the
name of the gene loci (e. g. matK_1) and their position in the alignment. Study the table
carefully! Is there anything peculiar?
13. Below the table a short explanation tells you which of the opriginal partitions (the ones
that you suggested) follow the same model.
14. At the end of the file there is a text block which we can later use in the ML analysis. Here,
the partitioning scheme is described so that we can just copy-paste it into the program
RAxML.
32
Writing a PAUP Block
In our first attempt to calculate a phylogenetic tree we used the command line to tell PAUP what
to do. A much more convenient way is to write all necessary commands into a text block that is
appended to the data block in PAUP. The NEXUS format allows us to include a number of different blocks, for example an ASSUMPTIONS block, a TREES block with previously computed
trees or a PAUP block with details on the analysis. This is very convenient when we wish to run
different analyses successively or just want to save waiting time. Instead of waiting on results
for hours and type in the next commands, we can feed PAUP with all necessary commands on
Friday and pick up the results of the completed analysis on Monday. We start with a simple example, a ML analysis for which we have to define a substitution model.
1. The data block in a NEXUS file ends with end; Insert two returns. The command block
always starts with the following line:
begin PAUP;
2. Because the default optimality criterion in PAUP is MP we have to tell the program that
we now want to perform an ML analysis. We also want to tell the program to finish the
analysis independently so that we do not have to press the “close” button. Write the following into a new line below the previous command:
set autoclose=yes criterion=likelihood;
3. Next, we have to define the substitution model. PAUP does not accept multiple models
for a single dataset. The following parameters of the substitution model are taken from an
analysis of the dataset with the program MrModeltest 2.2.
lset base=(0.3445 0.1718 0.1697) nst=6 rmat=(0.8683 2.6043 0.2708
0.9152 2.5678) rates=gamma shape=0.9213 pinvar=0.3175;
The command lset initiates the definition of model parameters for an ML analysis.
base=(x y z) defines the frequency of nucleotides A, C and G. Why is the frequency
of T not defined?
The x in nst=x refers to the number of different substitution types in the model (1, 2 or 6
depending on the choice of model, see Box 5).
rmat=(x y z …) means the relative rate of the different substitution types (A→C,
A→G, etc.). The rate for G→T is set to 1.
rates=[equal, gamma]: Substitution rates are either identical for every position of the
alignment or variable among sites.
shape=x defines the shape of the Gamma function that is used to model site specific
substitution rates (see Box 5).
pinvar=x most positions of an alignment are usually identical across species and hence
appear to be invariable. The value for x defines this proportion of invariable sites in the
dataset.
4. To start the heuristic search for the best tree we include the following line after the line
starting with lset:
hsearch addseq=random swap=SPR nreps=10;
33
5. The next line contains the command to show the best tree with branch length that you already know. As an exercise write it yourself.
6. Finally, let PAUP save the tree with branch lengths under a file name of your own choice.
7. Save and close the NEXUS file.
8. Load the file once more. If everything is ok, the analysis will start without any further
prompts.
Phylogenetic Uncertainty, the Non-Parametric Bootstrap
The results of different phylogenetic analyses on the same dataset (e. g. MP and ML analyses)
often differ from each other. MP analyses commonly come up with several equally parsimonious
trees. These differences may concern important details, for example the question whether a
taxon is monophyletic or not. How much confidence can we have in our results or in other
words, how far is our hypothesis (the tree) from the truth (the evolution as it actually happened)? This is an important question in all scientific studies. Physicists will usually repeat
measurements and report mean values and standard deviations to estimate the deviation of a
single measurement from the true value. But evolution cannot be repeated experimentally. Sequencing many genes and comparing the resulting trees might help, but we often find that the
information contained in a single gene is too small to allow meaningful interpretations. That is
the reason why most studies nowadays rely on data from multiple genes.
When the deviation of a measurement from the true value cannot be inferred, because the
measurement cannot be repeated, one can instead treat the measurement as the “truth” and
repeat the analysis by re-sampling the dataset. The deviation of the re-sampled results from the
original result then serves as an approximation of the true deviation. Referring to the American
proverbial expression “to pull yourself up by your own boots” (a basically impossible task) statisticians call this method “bootstrapping”. Which at least proves that they have a sense of humour.
In phylogenetic analysis you re-sample the dataset several 100 to 1000 times with replacement.
By randomly picking single alignment positions one produces datasets that are identical in size
to the original one. Random sampling with replacement ensures that the new datasets all differ
from each other, because some positions will be picked more than once while others will be left
out randomly. In a second step one calculates phylogenetic trees for each subsample and a
single consensus tree, which summarizes the information contained in the bootstrapped trees.
Every branch that was found in more than 50 % of the bootstrapped trees can be unambiguously displayed in the consensus tree. Finally, one determines, in how many of the bootstrapped trees the branches were found. This “bootstrap support” (in percent) or similar measures of support are a necessary part of all decent phylogenetic trees. Simulation studies (in
which you actually know the “truth”) show that bootstrap values tend to underestimate the confidence that we can have in a branch. Although 95 % confidence is a frequently used threshold
value in other fields of statistics, it has become customary to discuss branches with support values above 70 %. Less strongly supported branches are considered to be too unreliable to merit
further discussion.
Bootstrap values are a measure of precision rather than accuracy. High precision means that a
measurement renders very similar results every time it is repeated. High accuracy, on the other
hand, means that these measurements are close to the true value. When I step on my bath-
34
room scales ten times and it always measures the same weight it is very precise. Nevertheless
it can be poorly calibrated and under- or overestimate my true weight considerably. Precision
alone is not really sufficient to convey confidence in a measurement. Bootstrap values are very
similar in this respect. A node with high bootstrap support can be reliably reconstructed from the
data, but a bootstrap support of 85 % does not mean that there is an 85 % chance for this
branch to occur in the true evolutionary tree. We will carry out the first bootstrap analysis using
MP.
1. Open PAUP and load the NEXUS file.
2. Should the program start to compute trees, stop it an change into the window that displays the NEXUS file. You can deactivate the PAUP block at the end, by putting it in
square brackets. Save the file and execute it once more.
3. Type bootstrap nreps=500 search=heuristic; into the command line and
press <Enter>.
This will start a bootstrap analysis with 500 pseudoreplications (the smallest number
considered to give reliable results) and a heuristic search for the best tree in every replicate.
4. After the analysis is finished the display shows the consensus tree and a table with all
groups that were found in at least one bootstrap tree. The taxa are abbreviated as numbers above the columns. The respective groupings are marked by stars. The rightmost
column shows the frequency (in percent) with which the group occurred among the bootstrap trees. Groups appearing in less than 5 % of the trees are not listed. Compare the
list with the tree.
5. In order to save the consensus tree type the following commands:
savetrees from=1 to=1 file=[File name] savebootp=nodelabels; <Enter>
PAUP saves the bootstrap support as a label of each node. The command from=1
to=1 is necessary to circumvent a bug in PAUP. Without this command the file is empty.
The tree file can be opened and edited in FigTree.
Bootstrap Analyses Using Maximum Likelihood
As you have probably seen by now, ML analyses and bootstrap analyses under MP can take
very long with PAUP. Running a ML bootstrap analysis with a dataset like ours would soon exhaust your patience. Different methods have been developed to speed up the computation of
ML trees. One of these programs is RAxML (Randomized Axelerated Maximum Likelihood). In
contrast to PAUP it is also able to consider separate models for different dataset partitions. However, it only offers to select the same model for all data partitions. The difference to a single
model for the whole dataset is that the parameter values of the model are estimated independently for all data partitions. RAxML calculates bootstrap support values and an ML tree separately and displays the bootstrap values on the ML tree, not the consensus tree. RAxML needs
a dataset in PHYLIP format.
35
1. Copy the alignment (PHYLIP format) into a new folder, in which you also want to save
the results of your RAxML analysis.
2. Start the program RAxMLGUI by double-clicking on the file “raxmlGUI.py”.
3. Load the alignment by >Alignment file.
4. A window with the heading “Open alignment file” will open. Navigate to the folder with
your alignment, highlight it and >Open.
5. If the alignment contains identical sequences or positions that only consist of gaps a
warning will appear. You may delete these “double” sequences and gap positions by
choosing >yes. In our case we want to use all sequences and press >no. RAxMLGUI
recognizes whether the dataset consists of DNA or protein sequence data.
6. Your PHYLIP file will appear in the disply window.
7. >Options>Set/Edit Partitions... opens a new window “Set partitions”. Here you can describe the different partitions of your data set.
8. RAxMLGUI can process many kinds of data. You can specify your data by pressing
>DNA in the upper left corner. “BIN” means binary data (0,1), “MULTI” multistate characters (e. g. blue, red, green), and “AA” amino acid sequences. Multistate characters may
have up to 32 possible character states (0–9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P,
Q, R, S, T, U, V). Do not change any of the settings.
9. You can either define the limits of the data partitions “manually” or copy-paste the definition from PartitionFinder. Choose Options>Edit partitions.
10. Copy the last few lines from the file “best_schemes.txt” into the display of RAxMLGUI
and correct line breaks if necessary.
11. >Set will save the partitions.
12. [If you want to enter partitions manually follow these instructions: The start of a partition
cannot be changed. It is either 1 or follows from the end of the previous partition. Enter
the last position of the first partition into the upper right field.
13. >no codon means that all positions with these limits follow the same model. Alternatively, for protein coding sequences, you may choose >codon specific if you want to use
separate models for 1st, 2nd and 3rd positions or >3rd codon if 1st and 2nd positions
together shall follow a model that differs from that for 3rd positions.
14. Press >Add. The larger window will now show a line describing the first partition as you
defined it. Repeat steps 12-14 until all positions of the dataset are assigned to a gene
partition.
15. A message appears “All characters have been assigned to partitions. Set partitions for
the analysis?” If you do not want to change anything press >OK, otherwise >Abbrechen.]
16. After all partitions have been defined a new option will appear below “Run RAxML” called
“per-partition brL”. Selecting this option will prompt the program to calculate branch
lengths for each partition. The topologies of these trees will be identical, but the branch
lengths will be optimized for each position separately.
17. In the lower menu bar you see the command >ML + rapid bootstrap. Here you can
choose between a simple ML analysis or two different types of bootstrap analyses, “thorough” or “rapid”. Do not change the settings.
36
18. The number of bootstrap replicates may also be adjusted. If it is shortly before lunchtime
choose 1000, otherwise leave it at 100.
19. Finally you may choose among different substitution models. For the sake of speed,
RAxMLGUI offers a limited choice of alternatives. The GTRGAMMA model includes site
specific rates, GTRGAMMAI a proportion of invariable sites in addition. The models
GTRCAT and GTRCATI further speed up calculations but are relatively imprecise. This
should particularly affect branch lengths.
20. >Run RAxML starts the analyses and opens a command window in which the progress
of the analysis is reported.
21. RAxML produces a number of output files. The most important file
“RAxML_bipartitions..tre” shows the ML tree with branch lengths and bootstrap support
values. If you have chosen >per-partition brL before you will also get a separate tree
with branch lengths for each partition. These partitions have the endings “PARTITION.0”,
“PARTITION.1”, etc.
22. To finish your analysis, edit the tree in FigTree and save it as graphic.
Bayesian Analysis
For a number of reasons, the Bayesian method has become very popular among systematists.
-
-
Although millions of trees are calculated, the method is very fast.
This is mainly due to the fact that you can skip bootstrap analyses. The Bayesian analysis
computes the probability of the tree and all parameters of the analysis simultaneously.
From the tree sample you can also compute confidence limits and standard deviations for all
parameters of the model (i.e. the tree topology with branch lengths and the substitution model).
All methods that only search for a single optimal tree may fail under certain conditions. We
have already seen the example of long branch attraction under MP. Using an inappropriate
substitution model may also cause serious problems in ML. Bayesian methods are not free
from these error sources, but instead of focussing on a single tree topology, we may rather
consider the sample of trees as a whole and take phylogenetic uncertainty into account
when interpreting the trees.
For example, the consensus tree may show a certain taxon to be polyphyletic, but how high
are the chances and can we completely dismiss the idea that it is actually monophyletic?
Bayesian tree samples offer good possibilities to test competing phylogenetic hypotheses
against each other.
The main point of criticism against Bayesian analyses is that we need to define prior expectations for model parameters. For example, a grossly wrong mean value for the prior distribution
of branch lengths may lead to absurdly long consensus trees when compared with ML trees.
The solution in this case is to choose the mean length of branches from an ML analysis as the
mean of the prior distribution in the Bayesian analysis. In the following, we will perform a simple
Bayesian analysis, which also allows us to apply our skills in writing command blocks.
37
1. Copy the NEXUS file into the folder in which MrBayes is stored.
2. Open the NEXUS file in a text editor or in PAUP and change to the file view window (after eventually stopping automatic calculations). Put the PAUP block from the previous
analysis into square brackets.
3. If anything is written between “#NEXUS” and “begin data;” delete this text. The first lines
of the file shall now look like this:
#NEXUS
begin data;
4. Make sure that the next line looks like this:
format datatype=dna interleave missing=N gap=-;
5. Move to the end of the file and add another line: begin mrbayes;
6. Just like the PAUP block that we have written before, this MrBayes block defines the settings and parameters for the analysis. Insert a line below this command that tells
MrBayes to close the program window after the analysis is finished, and add another line
specifying the name of a log file. This ensures that the screen output is also saved as a
file for future reference. This file will be replaced if you re-run the analysis. If you do not
want this write append instead.
set autoclose=yes;
log start filename=[file name]_log replace;
7. MrBayes allows the use of different substitution models for different gene loci or codon
positions. We will use the partitioning scheme and the models from PartitionFinder. For
each data partition add a line that starts with charset. Add a name for the data partition
and describe its position in the alignment as follows.
charset A = 1-xxxx\3; [matK 1st position]
charset B = yyyz-zzzz\3 2-zzzz\3; [rbcL and atp1 3rd positions]
...
To avoid typing errors keep the names as simple as possible and avoid blank spaces,
numbers and full stops. Different parts of the alignment may be combined into one partition. Use a blank space, not a comma, to separate them. Finish each line with a semicolon. Include more descriptive names of partitions in square brackets to aid your memory.
8. MrBayes may successively analyse several datasets, for example calculate trees first for
every partition separately and then a combined analysis. Defining the partitions is therefore not sufficient. We also have to tell MrBayes which partitions to use as the dataset for
the analysis. The command for this is partition followed by a name and the number
and names of the partitions to be used. Several different datasets may be defined.
partition threegenes=5:matK_1, rbcL_2_atp1_2 ...;
9. Now tell MrBayes which dataset to analyse:
set partition=threegenes;
10. Like in PAUP, the definition of substitution models is initiated by the command lset. You
have to define as many models as there are partitions in the dataset, and then tell
MrBayes for which partition the respective model is used by applyto=(number). Next
38
you define the number of different substitution rates (1, 2 or 6 as for the ML analysis).
The rates command allows to distinguish between models with equal rates for all positions (equal), gamma distributed site specific rates (gamma), a proportion of invariable
sites (propinv) and both (invgamma). The command lines look this:
lset applyto=(1) nst=[value] rates=[value];
lset applyto=(2) nst=[Value] rates=[value];
...
11. Models also differ in the way, in which they treat base frequencies. Some simple models
assume that all nucleotides are equally frequent in the dataset. As a default, MrBayes
assumes unequal base frequencies and tries to optimize the relative frequencies using a
Dirichlet prior distribution. If some of your models require equal base frequencies, you
have to tell MrBayes use a different “state frequency prior”: Make sure that the prior applies to the correct data partition.
prset applyto=(1) statefreqpr=fixed(equal);
prset applyto=(4) statefreqpr=fixed(equal);
...
At this stage you may change the prior distributions of all possible parameters, including
the branch length prior (see page 26).
12. If different models contain the same type of parameters, e. g. two models that both use
gamma-distributed site specific rates, MrBayes will try to find a single value for these parameters which is then applied to all partitions. If you want MrBayes to infer the values for
these parameters independently you have unlink these parameters across partitions. The
command is:
unlink [Parameter]=(all);
In our case we want to optimize the shape parameter alpha, the proportion of invariable
positions, the nucleotide frequencies and the substitution rates independently. This is
done by the following row of commands:
unlink
unlink
unlink
unlink
shape=(all);
pinvar=(all);
statefreq=(all);
revmat=(all);
Alternatively, you may only decouple a parameter between some of the partitions. This
command would look like: unlink shape=(number,number,...);
13. You have probably noticed that, in spite of these additional options, the definition of substitution models is more unspecific in MrBayes than in PAUP or Garli. We only have to
describe how many and which different parameters there are (e. g. the number of different substitution rates, the shape parameter of the gamma distribution etc.). The basic
model is specified but we do not specify the parameters values as we have done for
PAUP or Garli. These values are optimized by the Markov chain while it is running.
14. Next you have to define settings for the Markov chain. The most important settings concern the number of generations to be run (ngen) and the frequency at which trees are
sampled (samplefreq). You may also determine how often the progress of the analysis
39
shall be printed to the screen (printfreq) and whether you want trees with branch
lengths or without (savebrlens). A special command concerns the question of how
many parallel Markov chains will be run (nchains). The following command line tells
MrBayes to run 1 million generations using four parallel Markov chains, sample and print
every 100th generation to the screen and save trees with branch lengths. Many more options may be specified.
mcmc ngen=1000000 printfreq=100 samplefreq=100 nchains=4
savebrlens=yes filename=[file name]_out;
END;
15. Finally, you want MrBayes to not just sample 10.000 trees but also to summarize them in
a consensus tree (sumt) and also summarize the parameters of the substitution models
in a separate file (sump). The respective command lines look like this:
sump filename=[file name]_out relburnin=yes burninfrac=0.5;
sumt filename=[file name]_out relburnin=yes burninfrac=0.5 contype=allcompat;
The commands relburnin=yes and burninfrac=0.5 mean that MrBayes discards
the first 50 % of the sampled trees as burnin. contype=allcompat tells the program to
include clades in the consensus tree even if they have less than 50 % support, as long
as they are compatible with the more highly supported groups.
16. The end of the MrBayes block is defined by :
END;
17. Save the dataset in your MrBayes folder.
18. Open this folder and double click on the MrBayes executable. A command prompt will
open in which you have to type execute [file name] to start the analysis. The program will begin to report every single step of the analysis on the screen. If no error messages appear, every 100th generation of the Markov chain will then be shown with likelihood values of the sampled trees in brackets. In the right column the program tells you
how long it will still need to finish the analysis. After it has finished the program will summarize the results on the screen and saving them in a lot of different files.
19. Among others we see a rather crude plot on the screen in which MrBayes illustrates the
development of the likelihood values during the run. After a steep increase these values
will reach a plateau. You have to discard all trees that were sampled before the plateau.
We have anticipated the decision by telling MrBayes to discard the first half of the trees.
Depending on the number of different models and parameters the screen shows a long
row of values for: tree length TL, transition/transversion rates (kappa), substitution rates
(r…), nucleotide frequencies (pi…), gamma shape parameters (alpha) and proportions of
invariable sites (pinvar). These data are also stored under the extension *.pstat.
20. After this, we get a list of taxa and clades (called “bipartitions”), another list with posterior
probabilities of clades (bipartitions), a list of branch lengths with standard deviations, a
tree with posterior clade probabilities (called “credibility values”) and another tree with
branch lengths. These data are stored in the log-file and a file with the extension *.parts.
21. A list of the different trees sampled by the Markov chain is stored under the extension
*.trprobs. This file also contains the probabilities of single trees and the cumulative probabilities of the trees. This allows us to not only concentrate on the consensus tree of the
40
analysis, but also evaluate sets of trees, for example those trees that together have a
posterior probability of 95 or 99 %.
22. The most important file is the consensus tree with branch lengths and posterior probabilities (*,con.tre), which we open and edit in FigTree as usual.
23. If you want to check the probability of a clade that is not part of the consensus tree, you
can do so by opening the *.parts-file with in Word.
24. The sampled trees are stored under *.t, and the sampled model parameters under *.p,
separately for each run.
The Molecular Clock
Systematists are often interested in more than just the phylogenetic relationships of their pet
organisms. For example, one might also want to date evolutionary events. Temporal information
might help to answer questions regarding biogeography and evolutionary mechanisms. Zuckerkandl and Pauling (1965) supposed that the genome accumulates substitutions at a constant
rate. This idea of a “molecular clock” was also part of the “neutral” theory of evolution by Kimura
(1968, 1983). According to this theory molecular evolution is a random process that depends
mainly on time.
If the hypothesis of a molecular clock is true we would expect that the genetic distance among
taxa is proportional to the time that has passed since these taxa have descended from their
most recent common ancestor. In the tree below the divergence time between taxa A and O is
identical to that between taxa B and O. Hence, we would expect that the genetic distances dAO
and dBO are also equal.
The following table shows the data that induced Zuckerkandl and Pauling (1965) to postulate
the molecular clock. It shows differences between amino acid sequences of α-hemoglobin in
shark, carp, newt, chicken, echidna, kangaroo, dog and humans. The paleonotological evidence
suggests that the shark is the outgroup for all other taxa. Accordingly, the genetic distances between shark and the other vertebrates should be equal. The same logic suggests that the genetic distances between human and the other taxa increases with the distance of the relationship. Indeed we see similar distances in each row and more or less decreasing distances in columns.
41
Strict molecular clocks
It is possible to compute phlyogenetic trees in which the distance from the root to the tips is
identical for all taxa. In such an ultrametric tree the branch lengths are proportional to time. If we
know the age of a single node in such a tree, we can calibrate all other nodes and make inferences about the evolution of groups for which we might not have any fossil evidence. This
sounds like a great idea, but one has to be cautious. The substitution models that we have discussed above do not assume that substitutions were constant over all times and taxonomic
groups. In fact we know that this is not the case. Rodents, for example, have much higher substitution rates than primates, partly because of their shorter generation times. Without the clock
assumption, the branch lengths of a tree are adjusted freely to optimize the likelihood of the
tree. They represent the product of time and (variable) substitution rate and can have very unequal lengths. Forcing the program to assume equal rates for all taxa and branches changes
the substitution model substantially. We can compare the likelihood of a given tree with and
without an enforced clock. The clock tree has always a lower likelihood than a tree in which
branch lengths can vary freely. If the difference is not too big a strict molecular clock reconstruction may be justified. Unfortunately, in most cases the difference in likelihood is so big that there
is simply no justification to use it.
Relaxed molecular clocks
Instead of simply assuming equal rates it is possible to treat substitution rate and time as different parameters to be optimized during the computation of the tree. Such models are called “relaxed molecular clocks” to distinguish them from the strict clocks. A relaxed clock tree enables
us to date nodes while at the same time allows substitution rates to vary across branches. The
most complex models assign a different rate to every branch.
One of the most frequently used models is the so-called “uncorrelated log-normal (ULN) relaxed
molecular clock” implemented in the computer program BEAST. The way in which BEAST assigns substitution rates to branches resembles the way in which site specific substitution rates
are defined. Rates are drawn randomly from a logarithmic normal distribution (see the figure
below) and assigned to branches of the tree. This is done by a Markov chain wich also varies
the topology and branch lengths (and the parameters of the substitution model).
42
Calibrating the molecular clock
A clock tree represents the relative ages of nodes in comparison to each other. Such a phylogeny is of little use when you cannot assign absolute ages (mostly in millions of years) to some of
the nodes. One may use very different sources of information from geology and paleontology to
date phylogenies. A dated fossil may be used if it shows at least one synapomorphy that is typical for a clade of the tree. This is taken as evidence that the respective clade already existed at
the time of this fossil, which is then taken as the minimal age for the respective clade. Geological events such as the origin of volcanic islands, the uplift of mountains or the split up of distributional ranges due to continental drift are also used to calibrate clock trees. Some of these
calibrations have to be treated with caution. Taking the age of an island for an age of an endemic taxon may underestimate the real age, because the taxon may have evolved much earlier and colonized the island at a later stage.
The vast majority of molecular clocks is computed with Bayesian methods. One of the reasons
is the possibility to consider the age of certain nodes as prior information, which should increase
the accuracy of the reconstruction. This is one of the few cases where we can replace a noninformative prior with an informative one by restricting the age of certain nodes with prior distributions of a certain shape. The illustration below shows a few possible distributions.
A uniform distribution with an upper and a lower limit can be selected when a minimum and maximum age of a fossil taxon is known. Lognormal and exponential distributions are suitable for
fossils with a minimum age. The normal distribution is appropriate for tectonic or biogeographical data. Nodes for which no fossil (or other) data is available receive a non-informative prior
age, i. e. their age is not restricted in any direction.
The following molceular clock analysis requires the use of a whole armada of programs: BEAST, BEAUti, TreeAnnotator (from the BEAST package), Tracer and FigTree. BEAUti is used to
produce an input file for BEAST.
43
1. Open BEAUti by double-clicking on the executable. On top of the central window you see
a row of tabs. We will go through these tabs from left („Partitions“) to right („MCMC“) to
produce the input file. First load the NEXUS-file by pressing >+ on the lower left of the
screen and navigating to the Nexus file used in the Bayesian analysis.
2. BEAUTi recognizes the character sets defined in the MrBayes block. For every data partition you should see a separate white row in the grey central window. If you only see a
single line with the file name in the first column you may have put the MrBayes block into
square brackets or chosen a wrong file. In this case, close the file by pressing >-, choose
the correct file or remove square brackets and load the file once more.
3. For each partition the number of taxa, sites (nucleotide positions) and a few descriptors
of the substitution model are listed. In order to use the substitution and partitioning model
calculated with PartitionFinder, mark all Partition Names and choose >Unlink Subst.
Models.
4. The next tab >Taxa allows us to define nodes for calibrating the molecular clock. Press
>+ and a taxon set “untitled” appears in the left field. You can see all taxa in the data set
in the central field “Excluded Taxa”. Mark those that you want to combine in a node (e. g.
all gymnosperms) and transfer them to the field “Included Taxa” by pressing the green
arrow in the center.
5. Double-clicking “untitled” allows you to change the name of the taxon set.
6. Choose >Mono? if you want the group to appear monophyletic in your tree.
7. In the last column you can insert a phylogenetic age for the group. This age may refer to
the base of the branch on which the group is sitting (the “stem age”) or to the node from
which the group diversifies (the “crown age”). >Stem? allows you to define the age as
the stem age of the group.
8. You may define as many taxa sets as you like by clicking >+.
9. Because we do not have age for the tips of our tree or traits to map, we skip the tabs
“Tips” and “Traits” and go straight to >Sites.
10. Here we can define subsitution models for each of our data partitions. Check the output
from PartitionFinder and choose the model parameters that come closest to the models
there.
11. Under >Clocks you may choose a >Model for the calculation of molecular clocks, e. g. a
>Strict Clock or a > Lognormal Relaxed Clock (the recommended option).
12. BEAST is also used for population genetic purposes. Genes are usually exchanged between members of a population but not between different taxa. Under >Trees we can therefore choose among different mathematical models. Select >Tree Prior >Speciation:
Yule Process. This option considers only speciation but not extinctions, such as >Speciation: Birth-Death-Process.
13. We skip the >States tab, because we do not want to reconstruct ancestral character states on this tree.
14. The prior distributions of all model parameters have to be selected under >Priors. Luckily, BEAUTi suggests default options for most of these. The default options are marked by
a * symbol and BEAUTi alerts you to check whether they are appropriate. You may verify
this by clicking on the priors and selecting >ok in the pop-up window. For this course let
us just assume that BEAUTi suggested meaningful priors. However, we must check priors that BEAUTi regards as inappropriate and marks in red.
44
15. We skip >Operators and move on to >MCMC.
16. We define the length of the Markov chain under >Length of chain: (at least 10 million
generations for a good analysis). >Log parameters every: defines how often the chain
saves trees and parameters while it is running. This depends on the length of the chain.
For 10 million generations one should save every 1000th tree.
17. With >File name stem we can select a name for the BEAST input and all output files. Do
not change the name, because in the past this used to cause program crashes etc.
18. >Generate BEAST File finally generates the input file (*.xml). A window opens and
warns you about model priors that you have not yet checked. Press >Continue and
choose a folder in which the file is saved.
19. Running BEAST is now easy. Close BEAUTi and double click on “BEAST v1.8.2.exe”.
20. >Choose File... opens another window. Load the *.xml input file generated by BEAUTi.
21. Because we have not installed the BEAGLE library we must deactivate >Use BEAGLE
library if available. Start the analysis by pressing >Run. The front window will close.
Similar to MrBayes the progress of the analysis will be printed to another window.
22. As soon as the analysis is finished a number of statistics appear in this window. BEAST
also suggests which parameters should be changed to improve the analysis.
23. The program produces two output files (*.trees and *.log). The first contains all sampled
phylogenies, the second all parameters that were sampled during the analysis.
24. The results of the analysis have to be summarized into a consensus tree. Open “TreeAnnotator” by double-clicking on the icon.
25. Insert the number of sampled trees that shall be discarded as burnin behind >Burnin
(as states):. Remember that this is not the number of generations but the number of
sampled trees to be discarded. Because our Markov chain was rather short, we discard
50% of the trees.
26. Behind >Input Tree File: select >Choose File... to open the tree file produced by BEAST (*.trees) and select a name and location behind >Output File:. Both times a pop-up
window appears. Select a file as input and insert a name for the output file. TreeAnnotator will produce a file with this name.
27. >Run starts the calculations in a different window. After the output file has been written,
open it in FigTree. The tree should contain node ages with 95% confidence intervals and
posterior probabilities for all nodes. Try to find out how to depict this information in
FigTree.
Reconstruction of Character Evolution
Phylogenetic trees may also be used to reconstruct the evolution of certain characters within a
systematic group. This may help to understand evolutionary mechanisms, for example tell us in
which order characters evolved and whether certain the development of a “key innovation” was
a prerequisite for the diversification of a systematic group. In order to study the temporal sequence of a character’s evolution one must reconstruct the character states at the internal
nodes or along the branches of the tree. The simplest method makes use of the parsimony cri-
45
terion. The character states are assigned to the internal nodes so that the number of evolutionary changes along the tree is minimized. This method only considers the topology of the tree
and not the branch lengths. But branch lengths are proportional to the time that has passed between its origin and split-up. Not only nucleotide substitutions are more frequent along long
branches. The same holds true for phenotypical characters. When a tree contains branches of
very unequal length it would thus be better to use an evolutionary model instead of just counting
state changes along the tree.
Such a model is another form of likelihood function. Like all likelihood functions it computes the
probability of observing the data (character states for the taxa) given a hypothesis (the phylogenetic tree). But it is much simpler, because we only have to consider a single node with its two
descendant nodes and not a whole tree with all its internal nodes. For a character with two
states 0 and 1 it looks like this:
0
1
L1
L2
A
L(C) = P(A=0)(P0→0(L1) · P0→1(L2)) + P(a=1)(P1→0(L1) · P1→1(L2))
where P(A=0) and P(A=1) denote the probability of observing these character states at node A.
P0→0(L1) is the probability that character state 0 does not mutate along the branch of length L1,
P0→1(L2) the probability that it changes to 1 along the branch of length L2. This probability must
be calculated for both character states at node A and the probabilities summed up because they
represent mutually exclusive events. An evolutionary model is necessary to compute the probabilities of mutations within a certain time (or branch length) unit. For a character with two states
the model only needs two parameters: q01, the mutation rate from 0 to 1, and q10, the rate from 1
to 0. A matrix of this model looks like this:
0
1
0
1 - q 01
q 01
1
q 10
1 - q 10
We want to reconstruct the character evolution in our dataset under parsimony and likelihood
using the program Mesquite. Mesquite is a very complex program for phylogenetic data analysis
with a fundamentally different structure than the programs we have used so far. Most other programs take one or two input files and write a variable number of output files (substitution models, log files, trees etc.). Mesquite combines several input files into so-called projects and analyzes them together. In order to reconstruct character evolution it needs phylogenetic trees and
a text file with the character states for all taxa of the dataset.
46
1. Mesquite needs a character matrix in PHYLIP format that contains the character states
for all taxa included in the analysis. The PHYLIP format is much simpler than the NEXUS
format. The first line consists of two numbers, first the number of taxa and second, separated by one or several blanks, the number of characters. The following lines each contain a taxon name and a blank followed by the character states. The file must be in plain
text format and can be written in any text editor or exported from a database or spreadsheet. In the latter case you need to carefully read the file and correct errors before you
feed it into Mesquite.
100
Taxon1
Taxon2
Taxon3
...
6
1 0 2 1 1 3
1 3 2 1 1 1
2 3 2 2 0 1
...
2. Copy the ML tree without bootstrap values calculated by RAxML into the same folder as
the PHYLIP file.
3. Open Mesquite with a double click. Several windows will open of which only a single one
will remain. Above a display with several lines of text (the log output) you can see four
tabs “Log”, “Projects and Files”, “Search” and “Simplification”.
4. Open the character matrix by >File>Open and selecting the respective file.
5. In a separate window, Mesquite will prompt you to “Please choose an interpreter for this
file”. Select “PHYLIP (categorical data)” and >ok.
6. A further window will open with the request to “Save imported file as NEXUS file”. Select
a name and a folder, in which the data is to be stored and >Save. (No project without
NEXUS file. A new display appears marked by a tab “Project of [thenamechosenbyyou].
7. To connect a phylogenetic tree with this project load the ML tree without bootstrap values
(RAxML_bestTree...tre) by >File>Link File...
8. Once more, you have to select an interpreter. Select “PHYLIP (tree)” and a name for the
file.
9. Below the line “Project” you can now see the names of the two NEXUS files and below
that “Taxa”, “Character Matrix” and “Imported Trees”.
10. >View Trees opens a new window “Tree Window 1 showing stored trees” which partly
conceals the previous window. You can toggle between windows by clicking on the tabs.
11. >Drawing>Tree Form offers you a range of possibilities to disply the tree. “Balls and
Sticks” is particularly well suited for character state reconstructions. In this view the
nodes are represented by circles that later turn into pie charts in which reconstructed
character states are shown. This is only meaningful when more than one possible state
are reconstructed at a single node. For example, under likelihood, the relative contribution of all states to the total likelihood at this node is shown.
12. On the left side of the window there are several tools with which you can manipulate the
tree. Position the cursor on the symbols and read the description appearing in the lower
part of the window.
13. The layout of the tree can be changed with several different options under >Drawing.
14. >Show Node Numbers numbers the nodes.
47
15. >Orientation>Right tips the tree to the right.
16. Try also the commands >Line Style>Square and >Balls on Internal Nodes.
17. >Branches Proportional to Lengths displays the tree with branch lengths.
18. >Centered Branches shifts the internal branches into the center of the ramification.
19. Character state reconstruction is started by >Analysis>Trace Character History.
20. In a new window you will be asked for the data source for the character states. Select
“Stored Characters” and >OK.
21. To begin, select “Parsimony Ancestral States”. The colour of some nodes will change.
22. “Trace Character” in the upper left allows you to select the character that is displayed
along the tree. Below this you can see the number of character state changes for the
most parsimonious solution. Place your cursor on a node or a branch to see which state
was reconstructed as the most parsimonious.
23. Reconstruct state changes under ML by >Analysis>Trace Character History.
24. A new tree appears and a new window “Trace Character” on top of the previous one.
You can drage it with your mouse to see both windows.
25. To toggle between the two reconstructions choose >Tree>Bring to Front and one of the
two “Trace Character Histories”.
26. You can close reconstructions by >Tree>Close/Remove and selecting one of the two
entries.
27. Save all reconstructions as graphics (>File>Save Tree as pdf...). Mesquite only supports
pdf as graphical format. The tree without pie charts may be exported in different formats
that can be read by e. g. FigTree (>File>Export...).
48
References
Avery, O. T., MacLeod, C. M. & McCarty, M. 1944. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: Induction of transformation by a desoxyribonucleic acid fraction isolated
from Pneumococcus type III. Journal of Experimental Medicine 79: 137-158.
Camin, J.H. & Sokal, R.R. 1965. A method for deducing branching sequences in phylogeny. Evolution 19: 311-326.
Cavalli-Sforza, L.L. & Edwards, A.W.F. 1967. Phylogenetic analysis: Models and estimation procedures. Evolution
21: 550-570.
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. 1994. The History and Geography of Human Genes. Princeton:
University Press.
Chargaff, E., Lipshitz, R., Green, C. & Hodes, M. E. 1951. The composition of the desoxyribonucleic acid of salmon
sperm. Journal of Biological Chemistry 192: 223-230.
Darwin, C. 1859. On the origin of species by means of natural selection, or the preservation of favoured races in
the struggle for life. London: John Murray.
Edwards, A.W.F. & Cavalli-Sforza, L.L. 1963. The reconstruction of evolution. Heredity 18: 553.
Edwards, A.W.F. & Cavalli-Sforza, L.L. 1964. Reconstruction of evolutionary trees. N: Heywood, V.H. & McNeill, J.
(eds.) Phenetic and Phylogenetic Classification. London: Systematics Association Publ. 6, pp. 67-76.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17: 368-376.
Felsenstein, J. 2004. Inferring phylogenies. Sunderland, MA, Sinauer.
Fitch, W.M. 1971. Toward defining the course of evolution: Minimum change for a specified tree topology. Systematic Zoology 20: 406-416.
Fitch, W.M. & Margoliash, E. 1967. Construction of phylogenetic trees. Science 155: 279-284.
Fox, G. E., Stackebrandt, E., Hespell, R. B., Gibson, J., Maniloff, J., Dyer, T. A., Wolfe, R. S., Balch, W. E., Tanner,
R. S., Magrum, L. J., Zablen, L. B., Blakemore, R., Gupta, R., Bonen, L., Lewis, B. J., Stahl, D. A., Luehrsen,
K. R., Chen, K. N. & Woese, C. R. 1980. The phylogeny of prokaryotes. Science 209: 457-463.
Haeckel, E. 1866. Generelle Morphologie der Organismen. Allgemeine Grundzüge der organischen FormenWissenschaft. Berlin: Georg Reimer.
Jukes, T.H. & Cantor, C.R. 1969. Evolution of protein molecules. In: Munro, M.N. (ed.) Mammalian Protein Metabolism. Vol. 3. New York: Academic Press, pp. 21-132.
Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217: 624-626.
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge, University Press.
Knoop, V., Müller, K. 2006. Gene und Stammbäume. Ein Handbuch zur molekularen Phylogenetik. Heidelberg,
Elsevier, Spektrum Akademischer Verlag.
Lanfear, R., Calcott, B., Ho, S. Y. W. & Guindon. S. 2012. PartitionFinder: combined selection of partitioning
schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution 29: 1695–
1701.
Mendel, J.G. 1866. Versuche über Pflanzen-Hybriden. Verhandlungen des naturforschenden Vereines in Brünn 4:
3–47.
Michener, C.D. & Sokal, R.R. 1957. A quantitative approach to a problem in classification. Evolution 11: 130-162.
Miescher, F. 1874. Die Spermatozoen einiger Wirbelthiere. Ein Beitrag zur Histochemie. Verhandlungen der Naturforschenden Gesellschaft Basel 6: 138-208.
Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G. & Erlich, H. 1986. Specific enzymatic amplification of DNA in
vitro: the polymerase chain reaction. Cold Spring Harbour Symposia on Quantitative Biology 51: 263-273.
Mullis, K. & Faloona, F. 1987. Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods in Enzymology 155: 335-350.
Saitou, N. & Nei, M. 1987. The neighbour-joining method: A new method for reconstructing phylogenetic trees.
Molecular Biology and Evolution 4: 406-425.
49
Sanger, F., Nicklen, S. & Coulson, A.R. 1977. DNA sequencing with chain-terminating inhibitors. Proceedings of
the National Academy of Sciences of the USA 74: 5463-5467.
Schulte, K., Naduvilezhath, L., Nierbauer, K.U., Silvestro, D., Michalak, I., Metzler, D., Zizka, G. & Printzen, C.
2013. E-volution: Der E-Learning-Kurs zur phylogenetischen Analyse. Beta-Version. URL: http://lernbar.unifrankfurt.de/.
Sober, E. 2008. Evidence and Evolution. The logic behind the science. Cambridge University Press.
Sokal, R. R. & Rohlf F. J. 1995. Biometry. The principles and practice of statistics in biological research. 3rd ed.
New York, Freeman & Co.
Stanier, R. Y., Doudoroff, M. & Adelberg, E. A. 1964. The Microbial world (2nd ed.). Englewood Cliffs N.J.: Prentice-Hall.
Watson, J. D. & Crick F. H. 1953. Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid.
Nature 171: 737–738.
Wu, R. & Kaiser, A. D. 1968. Structure and base sequence in the cohesive ends of bacteriophage lambda DNA.
Journal of Molecular Biology 35: 523-537.
Yang, Z. & Rannala, B. 1997. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo method. Molecular Biology and Evolution 14: 717-724.
Zar, J. H. 1999. Biostatistical analysis. 4th ed. Upper Saddle River, NJ, Prentice Hall.
Zhu, L. & Gigerenzer G. 2006. Children can solve Bayesian problems: the role of representation in mental computation. Cognition 98: 287-308.
Zuckerkandl, E. & Pauling, L. 1962. Molecular disease, evolution, and genetic heterogeneity. In: Kasha, M. & Pullman, B. (eds.). Horizons in Biochemistry. New York: Academic Press, pp. 189–225.
Zuckerkandl, E. & Pauling, L. 1965. Evolutionary divergence and convergence in proteins. In: Bryson, V. & Vogel,
H.J. (eds). Evolving genes and proteins. New York: Academic Press, pp 97–166.
50
Appendix 1: Commands to Define Substitution Models in MrBayes
GTR
lset
lset
lset
lset
applyto=()
applyto=()
applyto=()
applyto=()
nst=6
nst=6 rates=propinv
nst=6 rates=gamma
nst=6 rates=invgamma
#
#
#
#
GTR
GTR + I
GTR + gamma
GTR + I + gamma
SYM
lset applyto=() nst=6
prset applyto=() statefreqpr=fixed(equal)
# SYM
lset applyto=() nst=6 rates=propinv
prset applyto=() statefreqpr=fixed(equal)
# SYM + I
lset applyto=() nst=6 rates=gamma
prset applyto=() statefreqpr=fixed(equal)
# SYM + gamma
lset applyto=() nst=6 rates=invgamma
prset applyto=() statefreqpr=fixed(equal)
# SYM + I + gamma
HKY
lset
lset
lset
lset
applyto=()
applyto=()
applyto=()
applyto=()
nst=2
nst=2 rates=propinv
nst=2 rates=gamma
nst=2 rates=invgamma
#
#
#
#
HKY
HKY + I
HKY + gamma
HKY + I + gamma
K2P (=K80)
lset applyto=() nst=2
prset applyto=() statefreqpr=fixed(equal)
# K2P
lset applyto=() nst=2 rates=propinv
prset applyto=() statefreqpr=fixed(equal)
# K2P + I
lset applyto=() nst=2 rates=gamma
prset applyto=() statefreqpr=fixed(equal)
# K2P + gamma
lset applyto=() nst=2 rates=invgamma
prset applyto=() statefreqpr=fixed(equal)
# K2P + I + gamma
F81
lset
lset
lset
lset
applyto=()
applyto=()
applyto=()
applyto=()
nst=1
nst=1 rates=propinv
nst=1 rates=gamma
nst=1 rates=invgamma
#
#
#
#
51
F81
F81 + I
F81 + gamma
F81 + I + gamma
Jukes Cantor
lset applyto=() nst=1
prset applyto=() statefreqpr=fixed(equal)
# JC
lset applyto=() nst=1 rates=propinv
prset applyto=() statefreqpr=fixed(equal)
# JC + I
lset applyto=() nst=1 rates=gamma
prset applyto=() statefreqpr=fixed(equal)
# JC + gamma
lset applyto=() nst=1 rates=incgamma
prset applyto=() statefreqpr=fixed(equal)
# JC + I + gamma
52
Appendix 2: Example MrBayes-block
BEGIN MRBAYES;
log start filename=example_log append;
[MrBayes writes a “checkpoint” file, in which parameters of the analysis are stored every 100.000 generations. If the analysis is unexpectedly
stopped append allows you to restart it from the last “checkpoint”.]
set autoclose=yes;
delete [taxon name];
[deletes single taxa from the dataset]
outgroup [taxon name];
[specifies an outgroup]
charset
charset
charset
charset
charset
charset
charset
A
B
C
D
E
F
G
=
=
=
=
=
=
=
1-118 270-371;
119-269 980-1881 1883-2581\3;
372-979;
1882-2581\3;
1884-2581\3 2582-3661\3;
2583-3661\3;
2584-3661\3;
partition PartFind = 7: A, B, C, D, E, F, G;
set partition = PartFind;
lset applyto=(all) nucmodel=4by4;
[other options for the nucleotide model are doublets
for stem regions of ribosomal RNA (16 character states) and codon for protein
sequences (64 character states). These models are computationally intensive!]
lset applyto=(1) nst=6 rates=gamma ngammacat=6;
[GTR+G]
lset applyto=(2) nst=6 rates=invgamma ngammacat=6;
[SYM+I+G, equal nucleotide frequencies, see prior settings below]
lset applyto=(3) nst=2 rates=invgamma ngammacat=6;
[HKY+I+G]
lset applyto=(4) nst=2 rates=gamma ngammacat=6;
[K2P+G, equal nucleotide frequencies, see prior settings below]
lset applyto=(5) nst=2 rates=gamma ngammacat=6;
[K2P+G, equal nucleotide frequencies, see prior settings below]
lset applyto=(6) nst=6 rates=propinv;
[GTR+I]
lset applyto=(7) nst=2 rates=propinv;
[HKY+I]
prset applyto=(all) ratepr=variable brlenspr=unconstrained:exp(18) topology=uniform;
[brlenspr sets the prior distribution from which branch lengths are drawn. the options are unconstrained and clock (for a molecular
clock analysis). exp(18) means an exponential distribution with mean 1/18. The default for the mean is (10), but in some cases this may lead
to grossly wrong branch lengths. A way to find a more suitable mean is to calculate a ML tree and then take the mean branch length of that tree
as the mean for the branch length prior. Remember that the number in brackets is the reverse of this mean!]
prset applyto=(1) shapepr=exponential(5);
[shapepr sets the prior distribution for the shape parameter alpha of the gamma distribution.]
prset
prset
prset
prset
applyto=(2)
applyto=(3)
applyto=(4)
applyto=(5)
statefreqpr=fixed(equal) shapepr=exponential(5);
shapepr=exponential(5);
statefreqpr=fixed(equal) shapepr=exponential(5);
statefreqpr=fixed(equal) shapepr=exponential(5);
unlink statefreq=(1,3,6,7) revmat=(1,2,6) tratio=(3,4,5,7) shape=(1,2,3,4,5) pinvar=(2,3,6,7);
[State frequencies are fixed for partitions 2, 4 and 5. They only have to be unlinked between the other partitions. A full rate matrix (revmat) is
only present for the GTR and SYM models (partitions 1, 2, 6). The HKY and K2P model (3, 4, 5, 7) only have a transition/transversion ratio.]
mcmc append=yes ngen=40000000 nruns=3 nchains=6 temp=0.15 printfreq=1000 samplefreq=200 mcmcdiagn=yes diagnfreq=100000 diagnstat=avgstddev minpartfreq=0.10 rel-
53
burnin=yes burninfrac=0.5 stoprule=yes stopval=0.01 starttree=random filename=example_out;
[append=yes allows you to restart the analysis if it is unexpectedly stopped (power failure etc.). diagnfreq specifies how often the average
standard deviation of branch frequencies (avgstddev) is compared between runs. Poorly supported branches will vary widely between trees.
minpartfreq=0.10 specifies that only branches with a support of 0.1 or higher are compared. Relburnin sets the burnin proportion for the
comparison. Stoprule=yes tells MrBayes to stop the analysis when the difference among runs drops below a specified value (stopval).]
plot filename=example_out;
sumt filename= example_out relburnin=yes burninfrac=0.5 nruns=3 contype=allcompat;
sump filename= example_out relburnin=yes burninfrac=0.5 nruns=3;
log stop;
END;
54