Tutorial “Molecular Systematics”, 29 April – 13 May 2015 Christian Printzen, Tetiana Lutsak, Heike Kappes, Sabine Matuszak, Frank Lappe 1. Introduction „... any systematic attempt to construct ... natural relationships becomes the purest speculation, completely unsupported by any sort of evidence. ... the ultimate goal of biological classification can not be achieved in the case of bacteria.“ Stanier, Doudoroff & Adelberg, The microbial world, 2nd ed. (1964) „A revolution is occurring in bacterial taxonomy. ... molecular sequencing techniques permit a direct measurement of genealogical relationships. Fox et al., Science 209: 457–463 (1980) Systematics, and particularly phylogenetics, deals with the evolutionary relationships among organisms. Starting with Charles Darwin’s main work “On the Origin of Species by Means of Natural Selection” (Darwin 1859), systematists have attempted to discover genealogical relations among species. The icon for these relationships is a phylogenetic tree resembling the wellknown family trees with which we outline genealogical realtionships among our family members. The first “tree of life” was published in 1866 by Ernst Haeckel (Haeckel 1866). Until the middle of the 20th century phylogenetic trees were almost exclusively based on morphological characters. This worked relatively well for highly evolved organisms with many observable characters, but failed regularly in systematic groups with few visible and measurable features. Bacteriologists were particularly frustrated. In 1964 (see first citation above) Stanier and his coauthors threw in the towel and buried a whole discipline. In their eyes bacterial systematics was “the purest speculation”. Only 16 years later the situation had changed completely. Leading 1 bacteriologists declared a scientific revolution in Science. With the help of molecular data it had become possible to reconstruct the phylogenetic history of bacteria. Today, the analysis of molecular data can help us answer questions that were thought to be unanswerable 40 years ago. The history of an organism – from its evolutionary roots to its familial past – leaves traces in its genome. Of course, Darwin and his contemporaries knew nothing about the mechanisms and molecules of heredity. The way from the discovery of the origin of species and the laws of heredity to the calculation of molecular phylogenetic trees was long. Gregor Mendel’s work on the heredity of traits was published in 1866 (Mendel 1866), but first recognized in professional circles more than 30 years later. In 1869, the Swiss physiologist Friedrich Miescher discovered a molecule which he called “nuclein” in nuclei of pus cells washed from bandages gathered from a nearby hospital. He suspected this substance to be the molecule of heredity (Miescher 1874). But it took 75 more years until Oswald Avery and his co-workers, through their groundbreaking experiments on pneumococci, were able to show that desoxyribonucleic acid was indeed the inherited material (Avery et al. 1944). Soon afterwards, its chemical and physical structure was revealed by Erwin Chargaff et al. (1951), James Watson and Francis Crick (1953). The first piece of a DNA sequence (only 20 base pairs!) was published another 15 years later by Ray Wu and Dale Kaiser (1968). At that time, de-ciphering a DNA or protein sequence was technically challenging and extremely expensive. Molecular data was therefore not usable for taxonomists and systematists, who needed large numbers of sequences for their datasets. This changed with the publication of two articles that altered the whole of biology and, together with it, systematics and evolutionary science. In the first paper the English biochemist Frederic Sanger and two of his colleagues described how the nucleotide sequence of a DNA molecule could be identified with a relatively simple procedure (Sanger et al. 1977). The other manuscript was rejected by Science and Nature (these journals also make mistakes!) and was therefore first published in a volume of symposium procedures. In it, the American biochemist Kary Mullis and co-workers described a method to amplify DNA in a test tube, the polymerase chain reaction (PCR, Mullis et al. 1986, Mullis & Faloona 1987). Both methods together enabled systematists to determine the nucleotide sequence of particular gene loci in different species in a relatively easy and cheap way. Already in the early 1960s, Emil Zuckerkandl and Linus Pauling (1962) had postulated that phylogenetic relationships could be elucidated with the help of macromolecules. Three years later they published what is probably the first molecular phylogenetic tree based on protein sequences (Zuckerkandl & Pauling 1965). At about the same time, the development of statistical methods for phylogenetic reconstruction set on, at first without direct reference to molecular data. Charles Michener and Robert Sokal (1957) had made a start with the first statistical method for the calculation of genealogical relationships based on “traditional” morphological characters. Events came thick and fast in the 1960s, mainly through the works of Anthony Edwards and Luca Cavalli-Sforza. In 1963 they introduced the principle of “minimum evolution”, better known under the name “parsimony method” popularized by Joseph Camin and Robert Sokal (1965). The works of Thomas Jukes and Charles Cantor (1969) as well as Walter Fitch (1971) made 2 this principle applicable to molecular data. In 1967 Anthony Edwards and Luigi Cavalli-Sforza described a method for the calculation of phylogenetic trees based on genetic distances. Using a similar method, Walter Fitch and Emanuel Margoliash (1967) published a phylogeny of vertebrates in the same year. And already in 1964 Edwards and Cavalli-Sforza had theorized on a technique to reconstruct phylogenis by maximum likelihood. This method is a well established statistical tool to compare models and hypotheses. Its application failed because, at that time, there were no computers that could tackle massive amount of computer operations. It took only eight years to develop and publish the basic methods of tree reconstruction (some of them more than once in parallel), but another 17 years until Joseph Felsenstein (1981) introduced a practical maximum likelihood method for phylogenetic reconstructions. Further milestones on the way to the calculation of evolutionary trees were the neighbour joining method by Naruya Saitou and Masatoshi Nei (1987) based on genetic distances and Bayesian tree reconstruction introduced by Yang and Rannala in 1997. This compact course is designed to familiarize you with some of the methods of DNA sequence analysis used in systematics today. We concentrate on the most widely used methods: parsimony, maximum likelihood and Bayesian tree reconstruction. Distance methods are nowadays rarely used in phylogenetic reconstruction. The course can be regarded as a “cooking class” for molecular systematics and this tutorial is the cookbook. It briefly explains the background of the methods (normal text and blue boxes) and gives step by step instructions for lab procedures and use of computer programs. We use different computer programs for data analysis. Some of them have graphical user interfaces, others are command-line operated. This tutorial uses the following conventions: Bold face denotes commands from drop-down menus or dialog boxes. >File>Open, for example, means “choose Open from the menu File”. Courier denotes command line entries. hsearch addseq=random nreps=10000; <Enter> means “type these exact commands into the command line and then press Enter”. Generating DNA sequences has become rather simple routine work that is not hard to understand (although routine success in this field only comes with experience). The analytical methods of phylogenetic tree reconstruction are considerably more difficult to understand. This course will teach you how to compute phylogenetic trees from DNA sequence data, how to find the tree that is best supported by the data and how to quantify the statistical support for trees. We try to explain the statistical background with as little mathematics as possible. But we cannot do entirely without mathematics and statistics. In order to follow the course (and to find out where you need more help and explanations!) you will have to read this tutorial carefully. We also offer the e-learning program “E-volution” (Schulte et al. 2015) in parallel to this course to familiarize yourself more deeply with the most important aspects of phylogenetic reconstruction. 3 2. Course Objectives Some of the methods for calculating phylogenetic trees are not easy to understand. Although the computer does all of the computations, you need at least an intuitive understanding of these methods and the calculation steps to interpret your data. This tutorial tries to explain the background of these methods with as little mathematics as possible using examples and analogies. But it cannot do entirely without some simple algebra and statistics. In order to make the most of this course you should work through this tutorial thoroughly beforehand, even if you do not understand everything. This will enable you to ask specific questions during the course. Participants with a bascic knowledge of German can also use the e-learning platform E-volution (Schulte et al. 2013, accessible under “Biowissenschaften” at: http://lernbar.uni-frankfurt.de/) to complement the information and explanations given in this course. After this course you should be able to: 1) describe and understand phylogenetic trees, 2) compute phylogenetic trees using maximum parsimony, maximum likelihood and Bayesian methods, 3) describe weaknesses and advantages of these methods, and 4) understand and calculate the statistical support of branches in a phylogenetic tree. 5) date evolutionary events using molecular clock trees, and 6) reconstruct character evolution using phylogenetic trees. 4 3. Datengewinnung DNA zu „sequenzieren“ bedeutet, die Reihenfolge der Nukleotid-Basen – die Basensequenz – eines DNA-Abschnitts zu ermitteln. Die für den Organismus wichtigen Informationen zur Synthese von Proteinen sind in Form von Basentripletts (Dreiergruppen von Nukleotidbasen) auf der DNA gespeichert. Daneben gibt es bei Eukaryoten eine große Menge nicht-codierender DNA, die scheinbar keinerlei Informationen trägt. DNA-Sequenzen werden von einer Generation an die nächste weiter gegeben. Durch gelegentliche Fehler bei der DNA-Replikation (Mutationen) entstehen im Laufe der Zeit immer größere Sequenz-Unterschiede zwischen den Organismen. Die für Systematiker wichtigen Informationen sind genau diese erblich fixierten Veränderungen (Substitutionen), die Sequenzen im Laufe der Evolutionsoder Populationsgeschichte durchmachen. Vereinfacht kann man sagen: Je näher zwei Organismen miteinander verwandt sind, desto ähnlicher sind sich ihre DNA-Sequenzen. Es geht in der molekularen Systematik also darum, anhand von DNA- (oder Protein-) Sequenzunterschieden die Evolutionsgeschichte der Organismen nachzuvollziehen. Dies ist ein Prozess, der eine ganze Reihe von Arbeitsschritten umfasst. Die folgende Zusammenfassung soll Ihnen die Übersicht erleichtern. 1) Der erste Schritt der DNA-Sequenzierung ist die Extraktion der DNA aus Gewebeproben. Die DNA muss aus den Zellen freigesetzt und alle anderen Zellbestandteile beseitigt werden. 2) Ein oder mehrere vorher bestimmte DNA-Abschnitte müssen in so großer Konzentration vorliegen, dass man ihre Basensequenz bestimmen kann. Die Polymerase-Kettenreaktion (PCR) dient dazu, die ausgewählten DNA-Abschnitte zu vermehren. 3) Das Endprodukt der PCR-Reaktion dient als Ausgangsprodukt für die eigentliche Sequenzierreaktion. Um optimale Ergebnisse zu erzielen, setzt man eine genau definierte Menge PCRProdukt ein, dessen Konzentration man hierfür bestimmen muss. Die folgenden Arbeitsschritte werden im Kurs nicht praktisch behandelt. 4) Durch Reinigung der PCR-Produkte entfernt man alle Substanzen, die die Sequenzierreaktion stören. 5) Das gereinigt PCR-Produkt wird in der Sequenzierreaktion einer weiteren PCR unterzogen. Diesmal verwendet man nur einen Primer, so dass es zu keiner Verdoppelung der DNA mehr kommt. Meist möchte man zur Sicherheit und zum Datenabgleich beide DNA-Stränge sequenzieren und setzt deshalb zwei oder mehr Reaktionen mit jeweils unterschiedlichen Primern an. 6) Im letzten Schritt wird das Produkt der Sequenzierreaktion in einem automatischen DNASequenzierer elektrophoretisch aufgetrennt, wobei mit Hilfe von Fluoreszenzmarkern die DNASequenz bestimmt wird. DNA-Isolierung Aufarbeitung des Pflanzenmaterials und DNA-Extraktion Frisch gesammeltes Pflanzenmaterial liefert die beste Ausbeute an DNA. Auch die Qualität (ausgedrückt im Molekulargewicht der DNA) ist bei Einsatz von Frischmaterial am höchsten. Im Alltag greift man aber oft auf konserviertes Material zurück (wer hat schon ein DNA-Extraktionslabor auf Sammelreisen dabei). Am besten konserviert man die DNA von Pflanzen durch rasches Trocknen mit Silicagel-Perlen, Lagerung in einem speziellen Puffergemisch oder Tieffrieren (allerdings hat man im Gelände meist auch keinen Gefrierschrank bei sich). Auch aus Herbarbelegen, durch Pressen getrocknetes Material, kann man DNA isolieren, die allerdings häufig stark degradiert ist. Für verschiedene Organismengruppen sind die unterschiedlichsten DNA-Extraktionsmethoden entwickelt worden. Im Kurs wenden wir exemplarisch 5 zwei Methoden an: Eine Extraktion mit CTAB und Chloroform sowie eine mit einem kommerziell erhältlichen Kit („DNeasy Plant Mini Kit“ der Firma Qiagen). Bei beiden Methoden ist der erste Arbeitsgang der einfachste: Zerstören Sie das Pflanzenmaterial bis zur Unkenntlichkeit! Dies ist notwendig, weil im nächsten Schritt die DNA aus den Zellen freigesetzt werden soll. Je größer die Oberfläche, desto schneller wirken die dazu notwendigen Enzyme. Trotzdem heißt es vorsichtig vorgehen: Handschuhe tragen! DNA-Extraktion mit der CTAB-Methode Von diesem Protokoll gibt es unzählige Varianten für tierisches und pflanzliches Material. Fast jeder Forscher schwört dabei auf sein eigenes Rezept. Im Kurs wird ein für die Familie der Bromeliaceae optimiertes Protokoll, basierend auf dem Grundrezept von Doyle & Doyle (1987), verwendet: 1. Jeder Teilnehmer erhält 4 Pflanzenproben, 3 von unterschiedlich alten Herbarproben sowie 1 in Silicagel getrocknete Probe. 2. Beschriften Sie für jede Probe Deckel und Seitenwand eines 2 mL Eppendorf-Gefäßes mit der Probennummer. Schützen Sie die seitliche Beschriftung mit einem kleinen Stück Tesafilm. Wiegen Sie jeweils 20-30 mg einer Pflanzenprobe ein. 3. Pipettieren Sie 650 µL CTAB-Extraktionspuffer (CTAB = Cetyltrimethylammoniumbromid) in jedes Gefäß. Geben Sie 1,3 µL Mercaptoethanol (= MET) hinzu. Vorsicht, MET ist gif- tig – unter dem Abzug arbeiten! 4. Stellen Sie die Gefäße bei 60 °C in den Heizblock. 5. Stellen Sie für jede Pflanzenprobe einen Mörser samt Pistill bereit, geben Sie eine Messerspitze Sand und das Pistill hinein. 6. Ziehen Sie Schutzbrille und Baumwollhandschuhe, darüber Latexhandschuhe an! 7. Füllen Sie den Mörser zur Hälfte mit flüssigem Stickstoff. Vorsicht! Auch mit Baumwoll- handschuhen den gekühlten Mörser nur kurzzeitig berühren. Geben Sie die Pflanzenprobe hinzu und warten Sie, bis der Stickstoff beinahe vollständig verdunstet ist. 8. Beginnen Sie unverzüglich mit dem Mörsern. 9. Überführen Sie das Pflanzenpulver mit einem Spatel und Trichter in das entsprechende Eppendorf-Gefäß bevor es aufgetaut ist! Unter dem Abzug arbeiten! 10. Durchmischen Sie Puffer und Pflanzenmaterial durch kurzes Umschütteln des geschlos- senen Gefäßes. 11. Wenn alle Proben versorgt sind und im Heizblock stehen, warten Sie 30-60 Minuten. In dieser Zeit die Gefäße alle 5 Minuten kräftig aufschütteln. Nach Absprache mit der Kursleitung können Sie während dieser Zeit eventuell eine Pause einlegen. 12. Anschließend Proben auf Raumtemperatur (RT) abkühlen lassen. 13. Zugabe von 650 µL „Chloroform-Isoamylalkohol 24 + 1“ (gründlich gemischt). Vorsicht: Chloroform ist gesundheitsschädlich – unter dem Abzug arbeiten! Proben 10 Minuten mit der Hand schwenken. 14. Zentrifugieren Sie 15 Minuten bei 9000 rpm und Raumtemperatur. Achten Sie darauf, dass der Rotor austariert ist! 15. Beschriften Sie neue 1,5 mL Eppendorf-Gefäße mit der Probennummer. 16. Obere, DNA-haltige Phase vorsichtig mit der 1000er Pipette abziehen und in das vorbe- reitete 1,5 mL Eppendorf-Gefäß geben. Unter dem Abzug arbeiten! 6 17. Volumen der DNA-Lösung abschätzen und 0,6 Volumen Isopropanol (RT) zugeben, De- ckel schließen und vorsichtig mischen. Es dürfen keine Schlieren mehr zu sehen sein. 18. Für 1 Stunde bei –20 °C stehen lassen. Nach Absprache mit der Kursleitung können Sie während dieser Zeit eventuell eine Pause einlegen. 19. Zentrifugieren Sie 15 Minuten bei 13.000 rpm und 10 °C. Nun liegt die DNA als Pellet am Gefäßboden vor. 20. Vorsichtiges Abgießen des Überstandes in ein Abfallgefäß. Rand des 1,5 mL Eppendorf-Gefäßes gründlich auf Zellstoff abtupfen. 21. Zugabe von 500 µL 70% unvergälltem Ethanol. Gefäß kurz ausschwenken. 22. Zentrifugieren Sie 15 Minuten bei 13.000 rpm und 10 °C. DNA liegt als Pellet am Gefäßboden vor. 23. Vorsichtiges Abgießen des Überstandes in ein Abfallgefäß. Rand des 1,5 mL Eppendorf-Gefäßes gründlich auf Zellstoff abtupfen. 24. Pellet im offenen 1,5 mL Eppendorf-Gefäß bei Raumtemperatur für 20-30 Minuten trocknen. Darauf achten, dass die DNA nicht komplett trocken fällt. 25. 100 µL TE-Puffer (pH 8) und 1 μL RNAse-Lösung (10 mg/mL) zugeben. 1-2 Stunden bei 37 °C inkubieren, bis sich das Pellet gelöst hat. 26. DNA für weitere Verwendung bei 4 °C im Kühlschrank aufbewahren. Zunächst werden bei diesem Verfahren die Zellwände aufgebrochen. Im anschließend zugegebenen Extraktionspuffer wirken verschiedene Stoffe: CTAB ist ein Detergens und zerstört deshalb Zellmembranen, entfernt Polysaccharide (gerade bei Pflanzen meist reichlich vorhanden) und denaturiert Proteine (z.B. Nukleasen). EDTA blockiert DNA-modifizierende Enzyme und MET schützt vor oxidativen Schädigungen. Auch die Inkubation bei 60 °C sorgt für die Denaturierung von Proteinen. Zur Abtrennung der Nukleinsäuren von unerwünschten Stoffen wird Chloroform verwendet. Der zugesetzte Isoamylalkohol verhindert Schaumbildung. In der oberen, wässrigen Phase sammeln sich die Nukleinsäuren, die denaturierten Proteine bilden eine weißliche Grenzschicht zur unteren organischen Phase, welche u.a. Lipide enthält. Ganz unten haben sich Sand und Fasern abgesetzt. Mit Isopropanol werden Nukleinsäuren gefällt und durch Waschen mit 70%igem Ethanol von Salzen und anderen wasserlöslichen Stoffen gereinigt. Die Zugabe von RNAse sorgt für den Abbau der RNA (es soll ja mit DNA gearbeitet werden). Die DNA-Lösung ist im Kühlschrank mehrere Jahre lang haltbar. Frieren Sie DNA-Lösungen nur ein, wenn Sie sicher sind, dass sie nur noch selten gebraucht werden. Häufiges Auftauen und Einfrieren zerstört die DNA. DNA-Extraktion mit dem „DNeasy Plant Mini Kit“ (Qiagen) 1. Für jeden Teilnehmer werden wie oben 4 Pflanzenproben bereitgestellt. Beschriften Sie für jede Probe jeweils ein Schraubdeckelröhrchen. 2. Wiegen Sie 20-30 mg von jeder Probe ab und geben Sie sie in das Röhrchen. Geben Sie vorsichtig zu jeder Probe eine oder zwei Keramikkugeln hinzu. 3. Heizen Sie einen Heizblock auf 65 °C vor. 4. Erwärmen Sie die AP1-Pufferflasche im Wasserbad. 5. Mörsern Sie das Material in der Kugelmühle zu feinem Pulver. Überprüfen Sie dabei regelmäßig den Mahlgrad Ihrer Probe. 7 6. Pipettieren Sie 400 µL Puffer AP1 (Lysispuffer) und 4 µL RNAse zu jeder Probe. Durchmischen Sie Puffer und Pflanzenmaterial durch kurzes Umschütteln des geschlossenen Gefäßes. 7. Stellen Sie die Proben sofort in den vorgewärmten Heizblock. 8. Wenn alle Proben versorgt sind und im Heizblock stehen, warten Sie 30-60 Minuten. In dieser Zeit die Gefäße alle 5 Minuten kräftig aufschütteln und die folgenden Arbeitsschritte vorbereiten: 9. Beschriften Sie für jede Probe 1 violette Säule und 1 weiße Säule im Collecting tube sowie zwei 1,5 mL Gefäße mit der jeweiligen Probennummer. Auch die Collecting tubes beschriften. 10. Erwärmen Sie Puffer AE auf einem Heizblock (65 °C). 11. Nach Abschluss der Inkubation pipettieren Sie zu jeder Probe 130 µL Puffer P3 (Fällungspuffer). Pipettenspitze wechseln! 12. Mischen Sie durch Umschütteln und stellen Sie die Proben 5 Minuten auf Eis. 13. Zentrifugieren Sie das Gemisch 5 Minuten bei 14.000 rpm. Achten Sie darauf, dass der Rotor austariert ist! 14. Pipettieren Sie den Überstand auf die violette Säule. 15. Zentrifugieren Sie 2 Minuten bei 14.000 rpm. Verwerfen Sie die Säule. 16. Pipettieren Sie den Durchlauf (ohne Pellet!) in eines der 1,5 mL Eppendorf-Gefäße. 17. Fügen Sie das 1,5-fache Volumen Puffer AW1 (Bindepuffer) zum Durchlauf hinzu und mischen Sie sofort durch gründliches Auf- und Abpipettieren. Pipettenspitze wechseln! Wie stellen Sie das Volumen des Durchlaufs fest? 18. Überführen Sie 650 µL dieses Gemischs auf die weiße Säule. Pipettenspitze wechseln! 19. Zentrifugieren Sie 1 Minute bei 8000 rpm. 20. Verwerfen Sie den Durchfluss. Wiederholen Sie die Schritte 18 und 19, bis die gesamte Probe durch die Säule gelaufen ist. Verwerfen Sie danach die 1,5 mL Gefäße. 21. Setzen Sie die weiße Säule in ein neues 2 mL Collecting tube und pipettieren Sie 500 µL Puffer AW2 (Waschpuffer) auf die Säule. 22. Zentrifugieren Sie 1 Minute bei 8000 rpm. 23. Verwerfen Sie den Durchfluss. 24. Pipettieren Sie 500 µL Puffer AW2 auf die Säule. 25. Zentrifugieren Sie 2 Minuten bei 14000 rpm. 26. Übertragen Sie die weiße Säule in das zweite 1,5 mL Eppendorf-Gefäß und pipettieren Sie 100 µL warmen Puffer AE (Elutionspuffer) auf die Säule. Lassen Sie die Säulen 5 Minuten bei Raumtemperatur stehen. 27. Zentrifugieren Sie 1 Minute bei 8000 rpm. 28. Wiederholen Sie die die Elution mit weiteren 50 µL Puffer AE. Verwerfen Sie die weiße Säule und schließen Sie die Eppendorf-Gefäße. Beschriften Sie diese seitlich mit Artname und Datum. Nach dem letzten Schritt halten Sie nun die extrahierte DNA in einer Pufferlösung in Ihren Händen. Im Einzelnen haben Sie in Schritt 1-4 das Material mechanisch aufgeschlossen und in Schritt 5-7 die Zellmembranen lysiert und die DNA freigesetzt. In Schritt 9-15 haben Sie feste Bestandteile und die meisten Polysaccharide entfernt. In Schritt 16-19 wurde die DNA an eine Trägermembran gebunden. Das so gebundene Material haben Sie in Schritt 20-24 von Proteinresten, weiteren Polysacchariden, Nukleotiden und anorganischen Zellbestandteilen gereinigt. In Schritt 25-27 wurde die DNA von der Trägermembran wieder gelöst. 8 Sollte es Probleme bei der PCR geben (s.u.), kann man versuchen, statt mit Elutionspuffer mit Wasser zu eluieren, dem danach 10% TE-Puffer beigefügt wird. Das in vielen Elutionspuffern in höherer Konzentration vorhandene EDTA bindet Mg-Ionen, was in der PCR oft zu schlechter Produktausbeute führt. Photometrische Quantifizierung der DNA mit dem Qubit Fluorometer Viele PCR-Reaktionen verlaufen besser, wenn man die Menge an eingesetzter DNA genau einstellen kann. Deshalb muss im nächsten Schritt die DNA-Menge in jedem Eluat bestimmt werden. Mit Aqua bidest kann dann die benötigte Verdünnung eingestellt werden. 1. Beschriften Sie für jede Probe (CTAB-Extraktion und Säulchen-Extraktion) und für Standard 1 und 2 je ein 0,5 mL Gefäß. 2. Für jeden Ansatz (Proben und Standards) werden 200 µL Workingsolution hergestellt: 199 µL Puffer + 1 µL Farbstoff (kein Glasgefäß verwenden), mischen. 3. Legen Sie für jeden Standard 190 µl, für jede Probe dagegen 198 µL der Workingsolution im 0,5 mL Gefäß vor. 4. Geben Sie anschließend 10 µL pro Standard und 2 µL pro Probe hinzu.Gründlich vortexen und 2 min bei RT stehenlassen. 5. Qubit-Gerät anschalten, >Choose you assay >DNA >dsDNA broad range >Read new standards >Yes. 6. Den Anweisungen gemäß erst Standard 1, dann Standard 2 messen (>Read). 7. Mit >Insert Assaytube >Read die Messung der ersten Probe starten. 8. Anschließend >Calculate stock concentration wählen, und mit dem Rad das >Volume of sample used auf 2 µL einstellen. Wert in µg/mL (entspricht ng/µl) notieren. 9. Mit >Read next sample die nächste Probe messen. 10. Berechnen Sie, wie Sie die Proben verdünnen müssen, um jeweils 10 µL mit einer Kon- zentration von 5 ng/µL zu erhalten. Halten Sie Ihre Ergebnisse in einer Tabelle fest. 11. Beschriften Sie ein neues 1,5 mL Gefäß für jede Verdünnung und stellen sie diese anhand Ihrer Berechnungen her. PCR-Reaktion Die Polymerase-Kettenreaktion (polymerase chain reaction = PCR) ist eine Abfolge biochemischer Reaktionen, die bei verschiedenen Temperaturen in einem einzigen Reaktionsgefäß ablaufen. Im Grunde simuliert die PCR-Reaktion die in der Natur vor der Zellteilung erfolgende Replikation der DNA im Reagenzglas. Die isolierte genomische DNA wird in einer Pufferlösung mit einer DNA-Polymerase, Mg2+Ionen und Desoxyribonukleotiden (dNTPs) zusammengebracht. Damit die Polymerase die DNA als Vorlage verwenden kann und mit den in der Lösung vorhandenen Nukleotiden einen Komplementärstrang synthetisieren kann, müssen die beiden Stränge der DNA zuerst getrennt (denaturiert) werden. Dies geschieht bei 90-95 °C. Zusätzlich benötigt das Enzym einen Anfangsstrang, an den es weitere Nukleotide anfügen kann. Diesen kurzen Strang fügt man in Form von zwei „Primern“ hinzu (s. Abb. 1). Diese kurzen Oligonukleotide binden bei 40-60 °C an passende, komplementäre DNA-Abschnitte (annealing). Damit hat man die Möglichkeit zu bestimmen, welche Teile der DNA man amplifizieren will. Die am häufigsten verwendete Polymerase stammt aus dem in heißen Quellen lebenden Bakterium Thermophilus aquaticus und ist deshalb hitzestabil (taq-Polymerase). Die optimale Reaktionstemperatur liegt bei 72 °C. Der Zyklus von DNA-Denaturierung, Primer-annealing und DNA-Synthese wird in einem automati- 9 schen Thermocycler 30 bis 40 mal durchlaufen, wobei sich die Menge des ausgewählten DNAAbschnitts im (nie erreichten) Idealfall jedesmal verdoppelt. Im Kurs soll ein Abschnitt der Chloroplasten-DNA amplifiziert werden, nämlich die Barcoding-Region von rbcL (Gen für die große Untereinheit des Enzyms RuBisCO). Abb. 1: Primerpositionen für die Sequenzierung des 5’-Endes der rbcL. PCR-Reaktionen können sehr launisch sein; geringe Veränderungen im Protokoll führen oft zu kompletten Fehlschlägen. Trotzdem werden in verschiedenen Laboren oft ganz unterschiedliche Protokolle verwendet. Das am häufigsten verwendete Verfahren zur Optimierung von PCR Reaktionen ist „Versuch und Irrtum“: Verläuft die PCR schließlich wie erwünscht, verändert man das Protokoll nicht mehr (oder nur noch, wenn man testen will, ob sich das Resultat noch verbessern lässt). Multiple Banden, unsaubere Banden oder fehlendes PCR-Produkt sind die häufigsten Probleme. Fehlendes PCR-Produkt z. B. kann die unterschiedlichsten Ursachen haben: - Das Extraktionsprotokoll hat versagt; das Eluat enthält keine oder zu wenig DNA (Sollte nach Quantifizierung der DNA ausgeschlossen sein). - Man hat zuviel DNA hinzugefügt (auch das verursacht Probleme, lässt sich aber durch Quantifizierung vermeiden). - Die Ausgangs-DNA ist degeneriert (z. B. durch häufiges Auftauen und wieder Einfrieren). - Der DNA-Extrakt enthält Stoffe (z. B. Polysaccharide, EDTA), die die PCR stören. - Die Polymerase ist überaltert/ nicht tiefgefroren gelagert. - Einer oder beide Primer sind überaltert/ nicht tiefgefroren gelagert. - Irgendein Idiot hat die Primer falsch beschriftet. - Die Primer passen nicht auf die Bindungsstellen (kann sogar bei angeblich universellen Primern hin und wieder geschehen). - Man hat eine Zutat im Ansatz vergessen. - Die Annealing-Temperatur ist zu hoch. - Der Thermocycler ist defekt. Der Lösung kann man nur auf die Spur kommen, wenn man der Reihe nach die verschiedenen Möglichkeiten ausschließt. Im Zweifelsfall empfiehlt es sich, erfahrene Kollegen oder das Internet um Rat zu fragen. Die Polymerase-Kettenreaktion erfolgt in PCR-Gefäßen (0,2 ml). Obwohl die optimale Reaktionstemperatur bei 72° C liegt, ist die Polymerase auch bei Zimmertemperatur schon aktiv und beginnt, hier und da Nukleotide an Primer anzubauen. Das kann die PCR-Reaktion stören, weshalb das Ansetzen der PCR auf Eis erfolgt. Teurere sog. „hot start“ Enzyme enthalten Antikörper, die die Polymerase deaktivieren. Durch mehrminütiges Vorerhitzen auf 96 °C im Thermocycler werden die Antikörper denaturiert und das Enzym aktiviert. Mit solchen Enzymen kann man auch bei Zimmertemperatur arbeiten. 1. Ziehen Sie Handschuhe an und tauen Sie die folgenden Zutaten auf: - destilliertes und autoklaviertes Wasser - Peqlab 10x Reaktionspuffer S (blauer Deckel) = PCR-Puffer - MgCl2 - dNTP-Mix 10 - Primer rbcLa_F und rbcLa_R - DMSO - BSA - Trehalose Nach dem Auftauen mischen Sie jedes Reagenz durch kurzes Schwenken, zentrifugieren es kurz an und bewahren es in Eis auf (ausgenommen DMSO). 2. Beschriften Sie für jeden PCR-Ansatz und die Negativkontrolle je ein PCR-Gefäß (0,2 ml) und stellen Sie es auf Eis. 3. Die Taq-Polymerase ist trotz Aufbewahrung im Gefrierschrank flüssig und wird direkt auf Eis überführt. 4. Ihr DNA-Extrakt überführen sie direkt vom Kühlschrank auf Eis. Abgesehen von der Ausgangs-DNA enthalten Ihre PCR-Ansätze die gleichen Chemikalien in gleicher Konzentration. Um sich unnötiges Pipettieren zu ersparen, stellt man daher zuerst einen sog. „Mastermix“ her, der alles enthält außer der zu amplifizierenden DNA. Jeder PCR-Ansatz enthält 25 μL, davon entfallen 5 μL auf das DNA-Extrakt. Reaktionen mit 50 μL sind auch üblich, 100 μL Ansätze wegen der hohen Kosten nur, wenn große Mengen DNA benötigt werden. Aufgabe: Der Mastermix enthält die folgenden Bestandteile. DNA-Template PCR-Puffer (blau) dNTP-Mix MgCl2 Primer rbcLa_F Primer rbcLa_R DMSO BSA Trehalose Taq-Polymerase Wasser Ausgangskonzentration Konzentration/ Reaktion Volumen/ Reaktion 5 ng/µL 10 x 2 mM 25 mM 10 µM 10 µM 1 ng/µL 1x 0,32 mM 2 mM 0,4 μM 0,4 μM 5 µl 100 x 3M 5 U/ μL - 4x 0,25 M 2,5 U ad 25 µL Volumen/ x Reaktionen - 2 μl Berechnen Sie die Volumina der einzelnen Bestandteile für eine 25 μL-Reaktion und einen Mastermix für die x 25 μL-Reaktionen. Bedenken Sie dabei, dass Sie 5 μL DNA-Extrakt hinzufügen müssen. 1. Berechnen Sie die Mengen für den Mastermix. Berechnen Sie hierfür 1 Ansatz für jede PCRReaktion, sowie 1 Reaktion zusätzlich (= Negativkontrolle). 2. Stellen Sie ein Eppendorf-Gefäß (1,5 ml) auf Eis und pipettieren Sie erst Wasser, dann die anderen Zutaten in der angegebenen Reihenfolge zusammen. Benutzen Sie für jedes Reagens eine frische Pipettenspitze. Mischen Sie jedes Reagens (außer der Polymerase, s. u.!) vor dem Pipettieren durch Ein- und Auspipettieren bei gleichzeitigem Rühren! 3. Pipettieren Sie als letztes die Polymerase. Rühren Sie die Polymerase-Lösung nur sehr vorsichtig, aber sorgfältig mit der Pipettenspitze um. Die Lösung ist sehr zähflüssig; sorgen Sie dafür, dass nichts an der Außenseite der Pipettenspitze hängenbleibt. Dieses Enzym ist extrem teuer (0,5 mL kosten ungefähr 1000 Euro). 11 4. Mischen Sie den Mastermix bis Sie keine Schlieren mehr sehen. Pipettieren Sie 20 μL in jeden Ansatz. 5. Pipettieren Sie zuletzt 5 µL DNA-Extrakt (entsprechende Menge H2O für die Negativkontrolle) in das jeweilige Gefäß und verschließen Sie es. 6. Zentrifugieren Sie die PCR-Ansätze kurz ab (wenige Umdrehungen) und stellen Sie sie sofort wieder auf Eis. 7. Schalten Sie den Thermocycler (Modell Arktik Thermal Cycler) ein und starten Sie das Programm rbcl_pwg: 8. >Run >Carmen Jung (über Pfeiltasten auswählen) >Open >rbcl_pwg (über Pfeiltasten auswählen) >Start >Lidtemp 099 °C >ok > TempMode >Sample >Vesseltype >Common >Sample Volume >25 µl. 9. Deckel öffnen, Proben einstellen und Deckel schließen >Accept >Yes. PCR-Programm rbcl_pwg: initiale Denaturierung: Denaturierung: Annealing: Elongation: finale Elongation: 95 °C 94 °C 55 °C 72 °C 72 °C 8 °C 4 min 30 sec 60 sec 35 x 60 sec 10 min unendlich Drei wichtige Hinweise: 1) Protokollieren Sie die PCR-Reaktion, indem Sie für jedes Gefäß das DNA-Extrakt, die PCR-Nummer und die Mengen aller Bestandteile des Ansatzes notieren (am besten in Tabellenform wie folgt). Extr.-Nr. 127 128 … PCR-Nr. 1345 1346 … DNA 5 μL H2O x μL Puffer x μL dNTPs x μL Primer1 x μL Primer2 x μL Enzym x μL Dieses Protokoll ist sehr wichtig. Zu Beginn einer neuen Versuchsreihe (neue Organismen, neue Genabschnitte usw.) müssen PCR-Reaktionen fast immer optimiert werden. Das ist nur möglich, wenn man die Versuchsbedingungen bei jedem Ansatz protokolliert. 2) DNA-Extrakten von verschiedenen Organismen lassen sie sich äußerlich nicht mehr unterscheiden. Deshalb muss man sich peinlich genau vor Verwechslungen der Reaktionsgefäße oder -nummern hüten. Deshalb alle Eppendorf-Gefäße sauber und permanent beschriften, und unbedingt auch das Datum jedes Ansatzes notieren! Nur so hat man eine Chance, Verwechslungen auch im Nachhinein noch auf die Spur zu kommen. 3) Arbeiten Sie in der „Post-PCR-Phase“ mit pingeliger Genauigkeit, sonst zerrütten Sie das Verhältnis zu Ihren Kollegen. PCR-Produkte enthalten DNA in millionenfach erhöhter Konzentration, die im Labor mit genomischer DNA um Polymerase, Primer und dNTPs konkurriert. Wenn Sie Pech haben, erhalten Sie dann schöne PCR-Produkte, die von Organismen stammen, von denen Sie gar nichts wissen wollen. Dann muss das gesamte Labor einschließlich Mobiliar, Geräten, Glasflaschen usw. gereinigt und kontaminiertes Verbrauchsmaterial weggeworfen werden. 12 Nachweis und densitometrische Quantifizierung von PCR-Produkten Die spannende Frage lautet nun: Hat die PCR funktioniert oder nicht? Um diese Frage zu beantworten führen Sie eine Agarose-Gel-Elektrophorese durch. Das Gel wird mit SYBRgreen gefärbt, das sich mit DNA zu einem fluoreszierenden Komplex verbindet. Bei Betrachten des Gels auf einem UV-bzw. Blaulicht-Transilluminator finden Sie so heraus, ob (1) überhaupt PCR-Produkt entstanden ist, (2) nur ein spezifisches oder mehrere Produkte unterschiedlicher Länge entstanden sind, und (3) wie lang das amplifizierte DNA-Stück ungefähr ist, d. h. ob das richtige Stück amplifiziert wurde. Bei Arbeiten mit Ethidiumbromid oder SYBRgreen müssen Sie immer blaue Nitrilhandschuhe tragen! Die Qualität der Sequenzen hängt empfindlich von der DNA-Menge im Sequenzier-Ansatz ab. Unterhalb eines gewissen Schwellenwertes versagt die Sequenzierreaktion, oberhalb werden die Sequenzen oft unsauber. Solche schlechten Sequenzchromatogramme muss man in nervenaufreibender Arbeit editieren. Selbst dann bleiben oft viele Positionen der Sequenz unsicher, was die Datenanalyse erschwert. Um sich diesen Ärger zu ersparen, ist es besser, die DNA zu quantifizieren. Wir haben die genomische DNA bereits fluorometrisch quantifiziert. Die folgende Methode liefert wesentlich genauere Werte, ist allerdings auch umständlicher. Normalerweise wird nur die Konzentration der aufgereinigten PCRProdukte bestimmt. In diesem Kurs geht es aber darum, die Qualität der genomischen DNA in verschieden aufbewahrten Pflanzenproben abzuschätzen. Die wichtigste Frage ist dabei, wie gut sich das Material für die PCR eignet. Da bei der Aufreinigung DNA in schwer zu kontrollierendem Umfang verloren geht, quantifizieren wir hier das ungereinigte PCR-Produkt. 1. Stellen Sie einprozentige Agarosegele entsprechend der Gesamtzahl an PCR-Produkten her. Wählen Sie die passenden Kammergrößen und Kammeinsätze aus. In jeder Reihe muss eine Tasche für einen Größenmarker frei bleiben. Für jedes Gel verwenden Sie bitte eine Glasflasche. 2. BEISPIEL für ein mittelgroßes Gel: Wiegen Sie 1 g Agarose in eine Glasflasche ein. Vorsicht, Agarose und Kokain unterscheiden sich in Aussehen und Preis nicht sonderlich (vermutlich aber in der Wirkung). Fügen Sie 100 mL 1 × TAE und einen Magnetfisch hinzu und setzen einen Deckel lose auf. 3. Erhitzen Sie die Mischung 1-2 min auf höchster Stufe in der Mikrowelle. 4. Rühren Sie einmal mit dem Magnetrührer durch. 5. Erhitzen Sie jetzt in kleinen Schritten (20-30 sec mit zwischenzeitlichem Rühren) weiter in der Mikrowelle bis die Lösung völlig klar ist. Agarose kocht in Sekundenschnelle über (eine Riesenschweinerei!). Beaufsichtigen Sie die Flasche gut und stoppen Sie die Mikrowelle, sobald die ersten Blasen erscheinen. Auch auf dem Magnetrührer beginnt heiße Agarose-Lösung sehr leicht zu schäumen. 6. Wenn die Agarose-Lösung klar ist, lassen Sie sie auf dem Magnetrührer bis etwa 60° C abkühlen (mit Handschuhen knapp unterhalb der Schmerzgrenze). 7. Setzen Sie für jedes Gel den Gelträger um 90° gedreht in die Elektrophoresekammer ein, so dass die Gummidichtungen an den Wänden liegen. Hängen Sie einen oder zwei Kämme ein. 8. Tragen Sie bei den folgenden Schritten Nitrilhandschuhe. 9. Pipettieren Sie 8 μL SYBRgreen pro 100 mL Agaroselösung hinzu. Vermeiden Sie jeden direkten Kontakt mit dem Farbstoff und entsorgen Sie die Pipettenspitze nur im dafür vorgesehenen Behälter. 10. Gießen Sie das Gemisch vorsichtig in die jeweilige Gelwanne ohne Blasen zu erzeugen. Etwaige Blasen mit einer Pipettenspitze an den Rand des Gels manövrieren. Warten Sie ca. 20 Minuten, bevor Sie mit den folgenden Arbeitsschritten beginnen. 13 11. Wenn das Gel milchig aussieht, ziehen Sie zunächst ganz vorsichtig die Kämme aus dem Gel. Anschließend heben Sie die Gelwanne aus dem Tank und drehen sie um 90°. Füllen Sie Puffer 1 × TAE in den Elektrophoresetank, bis das Gel bedeckt ist. Achten Sie auf die richtige Ausrichtung des Gels, die DNA wandert zur Anode (roter Anschluss). 12. Schneiden Sie einen Streifen Parafilm ab und legen Sie ihn direkt auf den Arbeitstisch. 13. Pipettieren Sie nebeneinander für jedes PCR-Produkt einen Tropfen mit 7 µL Wasser. 14. Pipettieren Sie in jeden Tropfen 2 µl 6xLadepuffer. 15. Fügen Sie zuletzt zu jedem Tropfen 3 µL PCR-Produkt hinzu. Mischen Sie die PCR-Produkte vorher gründlich und wechseln Sie die Pipettenspitzen! Notieren Sie auf einem Papier Namen und Reihenfolge der PCR-Produkte. 16. Pipettieren Sie jeden Tropfen in eine Tasche des Gels. 17. Pipettieren Sie in die letzte Tasche jeder Reihe 5 µL Größenmarker („Easy Ladder“). 18. Setzen Sie den Deckel auf, kontrollieren Sie noch einmal die richtige Ausrichtung des Gels und starten Sie die Elektrophorese bei 85 V. 19. Wenn der blaue Marker drei Viertel der verfügbaren Strecke zurückgelegt hat, kann das Gel auf dem Transilluminator betrachtet werden. 20. Ziehen Sie wieder blaue Nitrilhandschuhe an. Heben Sie die Gelwanne vorsichtig aus dem Elektrophoresetank und trocknen sie mit Papier ab. 21. Überführen Sie das Gel ohne zu tropfen auf den Transilluminator in der Geldokumentationskammer. Gießen Sie den gebrauchten Puffer vorsichtig in die entsprechende Sammelflasche. 22. Schließen Sie die Tür der Kammer und schalten Sie das Gerät und die UV-Beleuchtung („Transillumination“) ein. Arbeiten Sie nun zügig, da UV-Licht die DNA schädigt und dieerfolgreiche Sequenzierung des PCR-Produktes verhindern kann. 23. Starten Sie das Programm „Herolab EasyWin32“ (liegt auf dem Desktop), übernehmen Sie „wie letzte Benutzung“ und klicken in der Menüzeile >Datei >Kamera aktivieren an. 24. Zoomen und fokussieren Sie die Kamera, so dass das ganze Gel und alle Banden scharf zu sehen sind. 25. Machen Sie ein Bild, indem Sie mit der rechten Maustaste auf das Livebild klicken, speichern Sie es, drucken Sie ein Foto des Gels aus und kleben es zur Dokumentation unter ihr PCR-Protokoll. 26. Wenn Sie das Bild des Gels auf dem Bildschirm gespeichert haben, wählen Sie im Hauptmenü >Auswertung >Modul C Spot Einzelvolumenanalyse an. 27. Es erscheint ein Kreuz anstelle des Mauspfeils. Setzen Sie das Kreuz auf die linke obere Ecke der Referenzbande des Größenmarkers (500 bp), ziehen Sie den erscheinenden Rahmen nach rechts unten auf und fixieren ihn mit einem Klick der linken Maustaste. Der Rahmen sollte etwas größer als die Bande sein. 28. Markieren Sie entsprechend alle anderen Banden. Mit einem Klick auf die rechte Maustaste beenden Sie das Setzen der Rahmen. 29. Mit einem Doppelklick der linken Maustaste in einen beliebigen Rahmen öffnet sich die Tabelle mit den Messergebnissen. 30. Zum Quantifizieren der einzelnen DNA-Banden verwenden wir die 500 bp Bande des Größenmarkers mit bekannter Konzentration (50 ng DNA in 5 μL Marker). Markieren Sie in der Tabelle die Zeile der Referenzbande und öffnen mit der rechten Maustaste das Kontextmenü >Volumen zuweisen. Setzen Sie einen Haken bei >Verwenden und geben Sie unter >Neu 50 ein, >Bestätigen. 31. Die Spalte „Vol Kalib“ der Tabelle enthält nun die DNA-Konzentration je 2 μL PCR- Produkt. 14 32. Berechnen Sie die Konzentration für 1 µL und wieviel PCR-Produkt Sie in der Sequenz- reaktion einsetzen müssten, um auf 20 ng DNA zu kommen. 33. Schalten Sie das UV-Licht aus und entsorgen Sie das Gel im entsprechenden Behälter. Reinigen Sie die Oberfläche des Transilluminators mit destilliertem Wasser und Papiertüchern. 34. Entsorgen Sie Ihre blauen Nitrilhandschuhe im Sondermüll. Die amplifizierte DNA einer erfolgreichen Reaktion liegt als mehr oder weniger breite Bande auf dem Gel vor. Diese Banden sollten deutlich sichtbar und sauber begrenzt sein, und die Fragmente sollten in der Länge dem erwarteten Produkt entsprechen. Die Gelspur der Nullprobe sollte schwarz sein. Sollten Sie in der Nullprobe eine Bande finden, die einer der Banden in Ihren PCR-Ansätzen entspricht, sind ihre Reaktionen möglicherweise kontaminiert. Meist sind verunreinigte Reagenzien die Ursache solcher Kontaminationen. Diese Verunreinigungen entstehen z. B., wenn Pipettenspitzen nicht gewechselt wurden und genomische DNA übertragen wurde. In diesem Fall muss die PCR wiederholt werden. Im schlimmsten Fall hat man es mit Laborkontaminationen zu tun (s.o.). Reinigung der PCR-Produkte Die Aufreinigung der PCR-Produkte und die Sequenzierung führen wir im Kurs nicht durch. Rückstände der PCR-Reaktion, besonders Polymerase und nicht verbrauchte Primer müssen beseitigt werden, bevor die eigentliche Sequenzierreaktion angesetzt werden kann. Beim gründlichsten Verfahren trennt man das gesamte PCR-Produkt auf einem Agarosegel auf und schneidet die Banden aus. Durch die Elektrophorese wird das PCR-Produkt von anderen Bestandteilen der PCR-Reaktion abgetrennt. Die ausgeschnittene Bande mit der gewünschten DNA wird in einer enzymatischen Reaktion aus der Agarose gelöst. Die weiteren Reinigungsschritte funktionieren nach dem gleichen Prinzip wie DNAIsolierungskits. Die DNA (PCR-Produkt) wird an eine Membran gebunden, gereinigt und mit Wasser oder Puffer eluiert. DNA-Sequenzierungen werden heute routinemäßig von zentralen Einrichtungen an Universitäten oder von kommerziellen Laborzentren durchgeführt. Versuchen Sie trotzdem, Aufgabe 5 zu lösen. Sequenzierreaktion In der ersten PCR-Reaktion ging es darum, einen spezifischen Abschnitt der genomischen DNA exponentiell zu vermehren, um ihn später sequenzieren zu können. In der Sequenzierreaktion wird diese hoch konzentrierte Template-DNA nicht mehr exponentiell vermehrt. Man fügt deshalb nur einen Primer hinzu, so dass die Polymerase nur einen der beiden DNA-Stränge synthetisieren kann. Die Sequenzierreaktion enthält neben gewöhnlichen dNTPs einen kleinen Anteil Didesoxribonukleotide (ddNTPs). Gewöhnlich verwendet man fertige Mischungen (sog. Terminator-Kits mit z. T. blumigen Namen), die von den Herstellern automatischer Sequenzierer angeboten werden. An jedes der vier unterschiedlichen ddNTPs ist ein anderer Fluoreszenzfarbstoff gebunden. Wird ein solches dd-Nukleotid zufällig eingebaut, kann der DNA-Strang nicht weiter verlängert werden und die Reaktion bricht ab. Am Ende vieler Reaktionszyklen erhält man so ein Gemisch von fluoreszenzmarkierten DNA-Strängen aller unterschiedlichen Längen, bei denen das letzte Nukleotid sich durch seine spezifische Fluoreszenz verrät. Diese vielen hundert verschiedenen DNA-Moleküle werden im letzten Schritt auf einem PolyacrylamidGel elektrophoretisch nach ihrer Größe aufgetrennt. In automatischen Sequenzierern, die hierfür verwendet werden, befinden sich am Ende des Gels zwei Dioden-Laser, die die vorbeiwandernden DNAMoleküle zur Fluoreszenz anregen. Diese Fluoreszenz wird von einem Detektor aufgezeichnet, dessen Signale in Form eines Sequenz-Chromatogramms als Computerdatei gespeichert werden. 15 Abb. 2: Sequenzchromatogramm. Solche Chromatogramme sind nur über eine begrenzte Länge, bestenfalls ca. 1000 Basenpaare (bp), hinweg lesbar. Oft sind die sequenzierten Genabschnitte aber länger, so dass man schon aus diesem Grund beide Stränge sequenzieren muss. Um sicher zu gehen, kann man zusätzlich interne Sequenzierprimer einsetzen. 16 4. Data Analysis: General Considerations Using DNA sequence data we want to answer the question how closely or distantly related different species are to each other. Generating data is an important first step in the empirical sciences. However, analysing these data is equally important. It is only after data analysis that a scientist knows whether a particular hypothesis is supported or rejected or which of the many possible hypotheses fits the data best. Molecular systematists and evolutionary biologists use analytical methods that are widespread in the sciences. However, the application of these methods in systematics is often not easy to understand. The reason for this is not only the widespread dread of mathematics among biologists. The questions put forward by systematists are often more complicated to answer than those in physics or chemistry. Consider the following example to illustrate this: In 1919 the astrophysicist Arthur Stanley Eddington travelled to the island of Príncipe in the Gulf of Guinea. During a total solar eclipse he gathered data to test certain conclusions from the theory of relativity. Ultimately, he wanted to know whether these data supported Newtonian or Einsteinian physics. Eddington only had to deal with two theories and three possible outcomes. The data could support either the theory of relativity or Newtonian physics or it could contradict both theories. Which hypotheses does a systematist test with molecular data? The surprising answer is: the phylogenetic trees. But are these trees not simply “calculated” from the data? Using the three methods of tree reconstruction treated in this course the trees do not inevidently follow from the data. Instead, we have to calculate, how well the data fit a certain tree. Every possible phylogenetic tree is a hypothesis of phylogenetic relationships and we have to identify the tree that is best supported by our data. Unfortunately, the number of possible phylogenetic trees gets unmanageably large even for small datasets with few species (see Box 1 on pages 6-7). Unlike Eddington, we cannot just make a simple yes-no decision, we have to decide among a myriad of possible hypotheses. A further difficulty lies in the way in which scientists test hypotheses and theories. Usually, certain “expected” observations are deduced from a theory. The next step is then to test whether the experimental observations are in line with the “theoretical” ones. For example, according to the theory of relativity the light does not follow an exactly straight path but is bent by gravity fields. This is exactly what Eddington observed during the eclipse: the light of stars that were optically close to the sun was deflected by the sun’s gravitiy field. But which theoretical and testable observations can be deduced from a phylogenetic tree? Even entirely absurd trees, in which for example all flying animals are united in a single group, cannot be rejected entirely by our data, even if they fit the data much worse than trees in which birds, bats and insects form different groups. 5. Phylogenetic Trees and Methods The question most commonly asked by systematists is how organisms are related to each other. According to the theory of evolution all extant organisms have descended from a single common ancestor in the distant past and are therefore related to each other. Because the DNA is replicated and passed on from one generation to the next all homologous DNA sequences are similarly related to each other. As are protein sequences because their amino acid sequence follows directly from the DNA sequence of their coding gene. The mechanism of DNA replication is not error-free. Every now and then read errors occur and one nucleotide is replaced by another, a process called mutation. If these mutations are not lethal they can be passed on to the next generation. By the actions of genetic drift and selection, some mutations 17 will eventually occur in all individuals of a species and are then called substitutions. Mutation is a random process. The longer two sequences have existed independently from each other the more substitutions will have occurred in both of them. The property of DNA and protein sequences to accumulate genetic differences over time can be used to reconstruct the evolutionary history and relationships of organisms. As a rule of thumb one can say that two highly similar individuals or species are more closely related to each other than two very different ones. In systematics and taxonomy, relationships are commonly illustrated by genealogical or phylogenetic trees. As stated above, such a tree is a model or a hypothesis of the relationships among the studied species. Before we care about the methods with which we can turn DNA sequence data into evolutionary trees we will briefly discuss some properties of phylogenetic trees. Box 1: Phylogenetic trees A phylogenetic tree illustrates the evolutionary relationships among organisms. It is a graphical representation of the hypothesis: “The species in this data set could be related to each other in this particular way.” A phylogenetic tree consists of branches (or “edges” in the language of mathematicians) which are joined at nodes. In the following example two internal nodes are denoted as X and Y. The organisms (or better their DNA sequences) are found at the tip of branches and are also called terminal nodes (here called A-L). The particular way in which branches and nodes are tied together is called the topology of the tree. In this example K and L are more closely related to each other than H and L. The length of the branches is proportional to the number of substitutions between one nodes and the next. The scale bar below the tree shows the branch length for 0.1 substitutions per nucleotide position. A B C Y X D E F G H I J K L 0.1 Most phylogenetic trees are dichotomously branched or bifurcating. This means that exactly two branches arise from each ancestral node. If more than two branches arise from a single node this is called a polytomy. The branches of a phylogenetic tree can be rotated around their respective nodes without changing the information content of the tree, pretty much like the parts of a mobile. In the tree above, D is not more closely related to E-F than C. You are free to rotate them around node Y, for example, if this makes it easier for you to describe the relationships. Phylogenetic trees may have a root or not. A root confers additional information to a tree. It tells you from where the evolution of a group took its origin. In principle, you can fix a root to any branch of an unrooted tree. In practice you usually include species in your dataset that are closely related to the group of interest but not part of it. This “outgroup” will automatically provide a root for the “ingroup”. 18 The number of possible trees increases with the number of taxa included in the dataset and can become incredibly large. For n taxa it can be calculated as: Un = (2n-5)(2n-7) … (3)(1) for unrooted trees and Rn = (2n-3)(2n-5) … (3)(1) for rooted trees. For 20 species, R20 = 8 200 794 532 637 891 559 000, for 100 species there are more possible trees than atoms in the universe. A convenient way to summarize trees for computer programs is the Newick format. In this format relationships are expressed as brackets around taxon names. In Newick format the tree above looks like this: (((((I(J(K,L)))(G,H))((E,F)(C,D)))B)A). Additional information, such as branch lengths, can be included after a colon behind taxon names. For example: (((((I:0.25(J:0.11(K:0.03,L:0.06)))(... 0.1 A group is called monophyletic when all its members can be traced back to the same common ancestor and this ancestor has not produced any descendants that are not a member of this group (in this example the red group). All members of the yellow group also descend from a single ancestor, but in contrast to the red group, this ancestor has a descendant that is not a member of the yellow group. Such groups are called paraphlyetic. A typical example for a paraphyly are reptiles, because birds and mammals also evolved from reptilian ancestors. The members of the black group have different ancestors. This group is called polyphyletic. Of course, if we go further back in time we will also find a common ancestor for all members of the black group (in this case the root of the tree). The distinction between paraand polyphyly is therefore somewhat blurred. Which criteria shall we use to find the phylogenetic tree that best represents the evolution of our group? In order to find the „best“ among all possible trees we first need a quantity that tells us how „good“ a tree is in relation to others. We call this an optimization criterion that in one way or the other tells us how well our data fits to a certain evolutionary hypothesis. In this course we will learn about three such criteria: (1) the principle of maximum parsimony (MP), (2) the concept of maximum likelihood (ML) and (3) the concept of posterior probability that forms the core of Bayesian statistics. (1) The principle of maximum parsimony (MP) is the simplest of these criteria. Its philosophical foundation dates back to the medieval friar William of Ockham. Ockham argued that the theory that explains a phenomenon with the smallest number of prior assumptions should be preferred over all other explanations. In the case of phlyogenies this means that the tree that requires the smallest number of substitutions to explain the genetic differences between the taxa is the pre- 19 ferred evolutionary hypothesis. MP is not so much a statistical method but rather founded on philosophical principles. In order to determine the minimal number of substitutions for a particular tree one has to insert all possible combinations of nucleotides at the internal nodes, then count how many substitutions this would require and do this for all nucleotide positions of the dataset and for all possible trees. Fortunately, Walter Fitch (1971) invented a method that greatly facilitates this process, particularly for computers. Starting from the terminal nodes one moves to the next internal node and forms the intersecting set of the nucleotides at the superior nodes. If this set is empty (because the two superior nodes have different nucleotides or sets of nucleotides like in the two nodes “1” below) one forms the set union instead. Every time this happens one counts a substitution event. All one has to do now is to add these numbers over all nodes and nucleotide positions of the dataset, do this for all trees and choose the one with the smallest grand total of substitutions. set union intersecting set MP does not use all nucleotide positions of an alignment. Positions that are identical in all sequences are generally uninformative under MP. If a single sequence deviates from all others at a certain position, the most parsimonious solution is always that the substitution occurred on the terminal branch to this sequence. Such a position does also not allow any conclusions on phylogenetic relationships of a species. Only positions in which more than one taxon shows a certain substitution are parsimony-informative. MP may have serious drawbacks because substitutions are merely counted without modelling the substitution process. Datasets with many homoplasies (see Box 2) can be difficult or impossible to analyse under MP. Box 2: Homology and Homoplasy Two or more organisms may share a certain trait or character state because they inherited it from a common ancestor. A good example are the wings of birds. In this case the trait is called homologous. Identical character states can also evolve independently as, for example, the wings of birds and bats. Such a trait is called a homoplasy. In the case of complex organs such as wings it is fairly easy to distinguish homologous from homoplasious character states. In the case of nucleotides this is often impossible at first view. In most cases we can assume that identical nucleotides at a certain alignment position are homologous. There are, however, two events that can lead to homoplasies at the molecular level: parallel substitutions (also called parallelisms) and reversals, in which a nucleotide mutates back to a previous character state. With only four possible character states, both events are relatively likely to be observed in big datasets, particularly at rapidly evolving gene loci. 20 position sequence A sequence B sequence C sequence D 123456789 AAACCTTGG AATCGTTGG AATCGTTCG AAACCTTCG The example above illustrates homoplasy at the molecular level. If we assume the tree ((A,D)(B,C)) either A and B or C and D have independently acquired their identical nucleotide at position 8. If the true topology is ((A,B)(C,D)) the same holds for the nucleotides at positions 3 and 5. Many authors make a logical link between the principle of parsimony and the terms homoplasy and homology. On the background of common descent, identical character states of two or more species should be regarded as homologous as long as nothing argues against this assumption. The assumption that identical traits are the result of two or more independent evolutionary events is an ad hoc-hypothesis to explain the data. The smaller the number of such ad hoc-hypotheses, the more parsimonious is the tree. Unfortunately, the story is not as simple as it sounds (see Felsenstein 2004: 136-146). Homoplasious positions in a dataset lead to conflicts in the choice of the optimal tree because they favour different tree topologies. A high number of homoplasies in a dataset make it difficult to decide between competing phylogenetic hypotheses. As a measure of the reliability of an MP tree homoplasy indices are often calculated. The consistency index CI is calculated by forming the ratio ci = mi / si for every alignment position, where mi is the minimal number of substitutions in all possible trees and si the observed number of substitutions for the tree under consideration. CI is then calculated as the mean of the values for all positions: CI = Σimi / Σisi This index is not identical for all tree topologies and its lower limit is not 0. Its reverse is called the homoplasy index (HI = 1 – CI) and is a measure of the frequency of positions with parallel or reverse mutations.The retention index RI = (Σigi - Σisi) / (Σigi - Σimi), where gi is the maximal number of substitutions in all possible trees, and the rescaled consistency index RC = CI × RI are identical for all topologies and range from 0 to 1. Homoplasies may positively mislead phylogenetic analyses. A notorious example is the so-called “long branch attraction” under MP. Imagine a tree with the following topology and branch lengths: A B C D The taxa A and C are most closely related to each other, as are B and D. Simulation studies have shown that in most cases MP would recover a topology in which A and B are most closely related. The reason 21 is the high number of mutations on the long branches that lead to A and B. Some of these substitutions will be reverse and parallel mutations creating the impression that A and B share a lot of homologies. Instead of resolving the problem adding more data will make it ever more certain to end up with the wrong topology in such a case. This is one of the reasons why model based methods such as ML and Bayesian methods have more or less replaced MP analyses. Maximum likelihood and Bayesian posterior probability, are statistical methods that try to answer two different questions about data and hypotheses. A proponent of likelihood would ask: “What does the experimental evidence (or the data) tell me about two competing hypotheses?” A Bayesian statistician asks: “How does the available evidence change my confidence in a certain hypothesis?” (2) The law of likelihood states: “The data D favour hypothesis H1 over hypothesis H2 if and only if the probability to observe the data is higher under the assumption of hypothesis H1 than under H2. A mathematical shorthand for the last part of this statement could look like this: P(D| H1) > P(D| H2). The probability of the data under the assumption of a certain hypothesis or model P(D| H) is called the likelihood of the model or hypothesis. This likelihood tells us nothing about the probability of the hypothesis as Sober (2008) illustrates by a nice example. Imagine somebody sitting at home and hearing a noise from above. A superstitious person might believe that some gremlins are bowling in his attic. The likelihood of this hypothesis is very high because the probability is very high that in this case you would hear some kind of noise from above, although this is of course complete nonsense. This is the reason why single likelihood values are of no use when trying to evaulate different hypotheses. What is very useful is the so-called likelihood ratio of two hypotheses defined as: P(D| H1) / P(D| H2). This value is larger than 1 if hypothesis H1 has a higher likelihood than H2 and smaller than 1 in the opposite case. By calculating this ratio one can find out, which of two hypotheses is more strongly supported by the data and how strong this support is. As we already know, systematists have to decide between a very high number of competing hypotheses. Theoretically, this is no problem. The topology with the highest likelihood is supported best and is to be preferred. The computation of likelihood values for phylogenetic trees is only problematic because it involves many more arithmetic operations than an MP analysis. To explain how the likelihood of a phylogenetic hypothesis is calculated, look at the following figure. 2 1 v1 3 v3 v2 5 6 v6 v5 0 22 4 v4 It consists of four taxa 1, 2, 3 and 4 and two internal nodes 5 and 6. The taxa and nodes are connected by branches of different length l1, l2 … l6. The branch lengths correspond to the product of substitution rate and the time between the origin of the branch and its split-up, or simply: The longer a branch the higher the number of substitutions on it. In order to compute the likelihood for a certain position k of the alignment we put random nucleotides on nodes 0, 5 and 6. The nucleotides at the terminal nodes 1-4 are given by our dataset. Remember that the likelihood is the probability of observing these nucleotides (the data) given the tree (the hypothesis). We start by putting a random nucleotide (say a G) at node 0 and call it x0. We must now calculate Px0, the probability of observing this nucleotide there. At this point, don’t worry about how this is done, we are currently only assembling the formula for calculating the likelihood. Next, we need the probabilities that this nucleotide mutates to x5 along branch l5 (Px0->x5 · l5) and to x6 along branch l6 (Px0->x6 · l6). The product tells us that these probabilities depend on the branch lengths. The longer a branch the more opportunities to mutate. If we extend to the final branches leading to the terminal nodes, we get the following long but not very complicated formula: lk = Px0 · (Px0->x5 · l5) · (Px5->x1 · l1) · … · (Px6->x4 · l4) Because we do not know the nucleotides at nodes 0, 5 and 6, we must calculate this quantity for all possible combinations of nucleotides at these nodes and then add up these values. Lk = lk1 + lk2 + ..., where the subscripts 1, 2 etc. denote the different combinations of nucleotides. This value Lk (mind the capital L!) is called the likelihood function for position k of the alignment. Finally, this calculation has to be done for every single position of the alignment. If we assume that all positions evolve independently the likelihood for the total dataset is simply the product of the values for the single positions L = Π Lk. Unfortunately, probabilities always take values smaller than 1 and the product of many factors smaller than 1 quickly approaches zero. Therefore, statisticians and computer programs prefer the logarithm of the likelihood, because the logarithm of a product is the sum of the logarithms of its factors, which turns the above product into lnL = Σ lnLk. If we have the topology and branch lengths of a tree and a dataset we only need one more thing: a formula which allows us to calculate Px0 and all the probabilities with which nucleotides mutate along the branches. This formula is called a substitution model (see Box 5). (3) Bayesian posterior probability Bayesian statistics also tries to evaluate hypotheses with the help of data, but asks an entirely different question. Bayesian statisticians are not essentially interested in comparing different hypotheses, they want to know whether the data increase or decrease our confidence in a certain hypothesis. Likelihood enables us to compare different hypotheses but allows no statements about a single hypothesis or model. With Bayesian methods we can directly infer the credibility of a single hypothesis. Bayesian statistics relies on Bayes’ rule which is easily derived from the definition of conditional probabilities. Let us look at a simple example to explain conditional probability. A pack of 52 cards consists of four types of cards: 13 spades, 13 clubs (both black), 13 diamonds and 13 hearts (both red). What is the probability of randomly drawing a card of hearts (H) under the condition that it is a red card (R)? Obviously ½, because half of the red cards are hearts. A more general way of calculating this probability (which is also applicable to more complicated 23 questions that are not so easy to answer) uses a slightly different line of argumentation. The probability P(R & H) that the card is red and hearts at the same time is ¼, the probability that is is red, P(R), our precondition, is ½. Dividing P(R & H) by the probability that our condition is true P(R) also gives ½. Hence the probability of an event (H) under the condition that event (R) has happened equals the probability that both events happen divided by the probability of the condition: P(H | R) = P(H & R) / P(R) This formula is more general because it allows us to exchange the two events: P(R | H) = P(R & H) / P(H) Because P(H & R) is the same as P(R & H) – the probability that both events happen – we can now express P(H & R) in two different ways: P(H & R) = P(H | R) · P(R) = P(R | H) · P(H). If we only look at the last two parts of this equation and solve for P(H | R), we get P(H | R) = P(R | H) · P(H) / P(R). This is Bayes rule found by the reverend Thomas Bayes sometime in the 1740s, published posthumously by Richard Price in 1763 and more or less unnoticed until Pierre-Simon Laplace found it independently and published it once more in 1774. This theorem tells us how the probability of a hypothesis P(H) is changed when we gather new data (R). P(H) is the probability of the hypothesis before we have seen the new data, the so-called prior probability. P(H | R) is the probability of the same hypothesis after we have evaluated these data, the posterior probability. In contrast to the likelihood this is a true probability and behaves like one: P(H) + P(not H) = 1. The likelihood of the hypothesis given the new data P(R | H) is also part of the formula. Besides these rather harmless participants there is an unpleasant stranger P(R), which is the total probability of the data under all possible hypotheses. This total probability P(R), and to a lesser degree the prior probability P(H), can cause tremendous difficulties. The probability of most hypotheses is impossible to guess without having seen any data. And the total probability of the data under all hypotheses can only be calculated for a relatively small number of possible hypotheses. There is no doubt that Bayes’ rule is mathematically sound but, because of these difficulties, a lot of statisticians are not convinced of its application. For phylogenetic reconstructions with zillions of trees the total probability of the data is impossible to compute. One can, however, come to an approximate solution by studying a random sample of trees. Such a procedure would resemble the calculation of the average income of citizens by evaluating a representative sample of them. Considering that there are so many different trees, the vast majority will be entirely absurd, some will be rather unlikely and only a few will represent the evolution of the group in a relatively reasonable way. A representative sample of trees would sample trees not randomly but strictly in proportion to their probability or likelihood. One method to sample trees in this way was published by Ziheng Yang and Bruce Rannala in 1997. It circumvents the problem of the total probability of the data in an elegant way. When the posterior probabilities of two trees are compared this total probability simply cancels out as is evident from the following formula, in which H1 and H2 represent different trees (hypotheses) and D the data: P(H1 | D) = P(D | H1) · P(H1) / P(D) = P(D | H1) · P(H1) P(H2 | D) P(D | H2) · P(H2) / P(D) P(D | H2) · P(H2) 24 The only minor problem now is the prior probability of the trees. Here a simplification can help us. If we have not seen any data at all, all trees will actually have the same prior probability, which is called a non-informative prior. This assumption is certainly not true, but as long as we have not seen any data, we have no reason to believe that any tree is more probable than the others. Under this somewhat idealized assumption the whole procedure reduces to a calculation of likelihood ratios. Trees are then sampled in the following way. A computer program calculates the likelihood of a randomly picked tree. Next, the program changes a parameter of the phylogeny (a branch length, the position of a species on the tree etc.) and calculates the likelihood for this new tree. If this is higher the program “accepts” or stores the tree as a new starting point for calculations. If it is lower the tree is only accepted with a probability that is inversely proportional to the likelihood difference between the trees. If the difference is high, i. e. if the second tree is much worse than the first, the probability of accepting it is very small, if the difference between the two is low the program jumps to the new tree with a higher probability. If a tree is not accepted, the program remains on its current tree. This sequence (change of a parameter, likelihood calculation, comparison of likelihoods, jump to a new tree) is repeated several million times. This way, the program jumps like a monkey from tree to tree preferrably in an upward direction but every now and then also taking a step downhill. The method is called a Markov Chain Monte Carlo process (often abbreviated MCMC) and ensures that trees are collected in proportion to their likelihood. The Markov Chain gathers a representative sample of trees. One of the great advantages of this method is that the probability of every parameter of the phylogeny (a certain branch of the topology, parameter of the substitution model etc.) is equal to its frequency in the sample and can be extracted with small effort. Branch lengths and many parameters of the substitution model are not just point estimates but can be given with standard deviations. A few andjustments have to made to ensure that the sampling process is efficient and really representative. Because the chain starts with a random tree the first trees are probably very unlikely. In order not to bias the sample towards trees with low likelihood, the first 10.000 to 100.000 steps of the Markov chain are discarded. This is called the “Burn-in” phase. It is also important to ensure that the trees in the sample were gathered independently from each other. Because only one parameter is changed at each step of the Markov chain this is only given when we sample not every tree but, for example every 100th tree. Finally, If the likelihood “landscape” is very rugged with several “peaks” of high likelihood divided by deep “valleys” of low likelihood a single Markov chain might get stuck on one of the peaks. Therefore, one usually runs several chains in parallel starting from different initial trees. This increases the chances to sample the whole likelihood landscape evenly. 6. How to Calculate Phylogenetic Trees After all these theoretical explanations and considerations, let us finally do it. The number of different computer programs for phylogenetic analyses is so large that we will restrict this course to a few basic ones that are commonly used or cited in the literature. Programs such as PAUP, PHYLIP (URL: http://evolution.genetics.washington.edu/phylip.html) and MEGA (URL: http://www.megasoftware.net/) can perform various analyses. But you will find loads of software for specialized purposes, for example Bayesian analyses (MrBayes, URL: http://mrbayes.sourceforge.net/) or ML computations (PHYML, URL: https://github.com/stephaneguindon/phyml-downloads/releases; RAxMLGUI, URL: http://sourceforge.net/projects/raxmlgui/). The programs above with a URL are freely available. 25 Most phylogenetic programs will run on Mac OS and Windows. The most commonly use input formats for these programs (FASTA, PHYLIP and NEXUS format) will be explained during the course. Before we start our first phylogenetic analysis, we have to consider one more aspect, the difficult task of finding the optimal tree out of a number of possible trees that exceeds the number of atoms in the universe. It is obvious that even the fastest computers cannot solve this problem in reasonable time. Instead of picking a limited number of trees randomly and comparing them, there are smarter search options that use simple strategies to reach a preliminary result and try to optimize this solution. Such heuristic search strategies are explained in the following box. Box 4: Heuristic search A heuristic search for an optimal tree is a two-step procedure. In the first step an initial tree is constructed using a simple and fast method. A commonly used algorithm is stepwise addition. Three taxa are chosen randomly from the data set and combined into an unrooted tree (how many unrooted trees are there for 3 taxa?). Step by step more taxa are added to this tree by joining them to one of the branches of the existing tree, so that the resulting tree is the most parsimonious one or the one with the highest likelihood. Why is this tree not the overall best one? Simply because the order, in which taxa were added, has an impact on the final result. And there are just about as many different orders in which you can arrange the taxa as there are different phylogenetic trees for them (for 50 taxa there are 3 x 1064 ways). It wouldn’t help to try them all. The second step starts when all taxa are added to the initial tree. Now the program starts to exchange branches on the tree and calculates whether this improves the tree. There are different strategies for this so-called branch swapping. Nearest Neighbour Interchange (NNI) removes an internal branch of the tree and exchanges the four groups into all three possible arrangments before the branch is tied again. This is done for all internal branches of the tree. Subtree Pruning and Regrafting (SPR) cuts off a branch (like a gardener pruning a tree) and joins it to all possible branches of the remaining tree. Again, this is done for all branches of the tree. Tree Bisection and Reconnection (TBR) cuts the tree into two and reconnects the two subtrees at all possible positions. This is repeated for all possible cutting points. The number of investigated trees is TBR > SPR > NNI. In order to further increase the number of trees the search is usually repeated several times, each time adding the taxa in a different order to the initial tree. Although only a small fraction of trees is compared this search algorithm works very well in most cases. Maximum Parsimony and the NEXUS File Format We will calculate the first phylogenetic trees using the Windows version of PAUP and maximum parsimony, the default option in PAUP. This program can either be operated through a command line or specific command blocks that are attached to the data. PAUP is a versatile program with lots of commands. We will therefore not learn more than a few basic functions of this program. 1. Open PAUP. A dialogue box will open automatically. Load the provided alignment in NEXUS format by marking it and pressing “Open/Execute”. If the file does not contain typos and other mistakes (e. g. blank spaces, full stops or hyphens in taxon names) this should prompt no error messages. 26 2. In the window above the command line you will see some information on your data set. >Window>[name of the file] opens a second window with the actual data file. You can toggle between these windows. 3. We will first try out a simple heuristic search for an optimal tree using the command line: hsearch addseq=asis; <Enter>. 4. hsearch initiates the heuristic search, addseq=asis tells the program to add the taxa to the initial tree (stepwise addition, see Box 4) in the order in which they appear in the dataset. 5. The output tells you how many characters the dataset contains, how many of them are parsimony-informative, and which branch swapping algorithm was used. The default in PAUP is TBR. Under “score” in the table you can see how many evolutionary steps (substitutions) the best tree requires. 6. In order to view the tree, type: describetrees 1/ brlens=yes; <Enter> Now the program displays tree no. 1 and prints a table with branch lengths (brlens). In addition different homoplasy indices are displayed. 7. In order to increase the number of studied trees and the probability of finding the optimal tree, we will now increase the number of cycles in the heuristic search. This only makes sense if we tell the program to randomly add taxa during the first step of each analysis. hsearch addseq=random swap=NNI nreps=100; <Enter> The heuristic search will now be repeated 100 times adding the sequences randomly. This time branches will be swapped using nearest neighbor interchange. 8. A very convenient option of PAUP is that it remembers recently used commands. in order to display the first tree click on the arrow to the right of the command line. You will see a list of the previously used commands. Choose the correct one and press <Enter>. 9. PAUP produces unrooted trees. Let us now root the tree with an outgroup. First we have to define the outgroup. Assuming you use the provided dataset the command for this is: outgroup Podocarpus Ginkgo Metasequoia Cycas; <Enter> If you use your own dataset you will probably know which taxon or taxa to choose for an outgroup. 10.To root the tree with this outgroup, type: describetrees 1/ root=outgroup outroot=monophyl; <Enter> Now the outgroup should appear as a monophyletic group outside the ingroup taxa. 11. In order to save the tree or trees with branch lengths, type: savetrees file=[file name] brlens=yes; <Enter> You have now saved the first tree as a result of the analysis. As you may have noticed displaying trees is not the strongest side of PAUP. In order to edit the tree for publications or before printing it we will now open it in FigTree. 27 1. Load the tree file by >File>Open. 2. If the branches are labelled (for example by statistical support values), you will be asked to provide a name for these labels. This will not be the case now. 3. The tree appears in the window. The menu on the left side allows you to alter the appearance of tree. Play a bit around with the layout. 4. Above the window you see a graphical menu, with which you can change the position and order of clades. To re-root the tree with a different group mark the branch and choose >Reroot. 5. You can also rotate clades around nodes (remember that this does not alter the topology of the tree). Choose a branch and >Rotate. >Highlight marks a whole clade, >Colour changes the colour of branches. Play around with the different options. Also change from “Node” to “Clade” or “Taxa” above “Selection Mode” to find out about these options. 6. If you want to display statistical support values for branches (see below) mark the box “Node Labels” or “Branch Labels” on the left and open the menu by clicking on the arrow. 7. Choose the appropriate label from the drop-down menu “Display”. At this point we have not yet investigated the statistical support for branches. But remember this function for later use. 8. Once the tree suits your taste you can export it into a graphical format by >File>Export Graphic... and choosing the appropriate format and the folder in which you want to store the file. Choosing a Substitution Model for Likelilhood Analyses Likelihood analyses are not only more difficult to understand than parsimony. They are also a bit more clumsy to perform, because, in order to calculate the likelihood of a tree, we need to choose a substitution model. This is a set of formulas that allows us to calculate how probable it is that a certein nucleotide is substituted by another along a branch of defined length. Before we go into the practical details, Box 5 will explain substitution models in general. Box 5: Substitution models Two sequences that derive from a common ancestor will gradually accumulate substitutions. Because substitutions are rare events that happen randomly we would expect that the number of observed differences between two sequences grows slowly and more or less linearly with time (black line in the illustration below). genetic divergence time 28 However, the observed differences between sequences do not accurately reflect the substitutions that have happened. A small number of substitutions will affect nucleotides that have already mutated before. Instead of two substitution events we will only observe one or, if the mutation is back to the original nucleotide at this position, none at all. Usually, the number of observed differences between two sequences (red line) will be smaller than the actual number of substitution events. This effect is more pronounced in rapidly evolving genes at slowly evolving gene loci and is particularly problematic for MP where differences among sequences are simply counted, which misrepresents the actual number of substitutions. Substitution models try to model the whole process including parallel and reverse mutations. The simplest substitution model (the Jukes-Cantor model) assumes that all possible substitutions occur with equal probability. This model is relatively easy to describe. Substitutions are random events. We do not know exactly when the next substitution will hit a gene, but we know that eventually a nucleotide will be substituted by another. It is like waiting for a notoriously unpunctual bus. As you wait the probability that the bus has not yet arrived grows smaller and smaller until the damn vehicle finally arrives. If we call the waiting time T and the time that has passed since you arrived at the bus stop t we can say: The probability that the waiting time for the bus (or the next substitution) is longer than the time that has passed is getting smaller and smaller as time passes. This probability is exponentially distributed. P(T > t) = e-t. If this is the probability that nothing has happened (no bus, no substitution), then the probability that “something” has happened is 1-e-t. There are four different nucleotides (A, C, G, T). If all substitutions are equally probable the probability of any single substitution is ¼ (1-e-t). This is correct if, for example a G is replaced by an A, C or T. But how big is probability that after a certain time the G is still a G? There are two possibilities for this case: either nothing has happened: e-t. Or the G was replaced by the same nucleotide: ¼ (1-e-t). The total probability for this event is therefore e-t + ¼ (1-e-t). If we generalize this for all nucleotides we can describe the probability that, at time t, nucleotide x is replaced by nucleotide y, in the following way: Px→y(t) = { ¼ (1-e-t) if x ≠ y e-t + ¼ (1-e-t) if x = y The curly bracket here means “either – or”. This is the mathematical abbreviation of the Jukes-Cantor model. Actually, most gene loci do not follow this simple substitution model. For example, due to their different size and number of hydrogen bonds, purines (G and A) and pyrimidines (C and T) are more often substituted by another nucleotide of the same group (transition) than by one of the other group (transversion). Moreover, nucleotides do usually occur at unequal frequencies which favours certain substitutions over others. All these facts are considered in more complicated models. The following four matrices show commonly used substitution models. The uppor row shows the nucleotides before, the left column the nucleotides after substitution. The greek letters in the center denote the different substitution rates, the symbols gA, gT etc. frequencies of the respective nucleotide. A T C G - α α α A T α - α C α α G α α A A A A T C G - β β α A α T β - α β T βgA - αgC βgG T αgA - δgC εgG - α C β α - β C βgA αgT - βgG C βgA δgT - ζgG α - G α β β - G αgA βgT βgC - G γgA εgT ζgC - Jukes-Cantor Kimura T C G - βgT βgC αgG HKY A T C G - αgT βgC γgG GTR These models have different number of unknown parameters. The Jukes-Cantor model only has one parameter α, the general time reversible model 9 because it assigns a separate rate to all six kinds of substitutions. The more parameters a model has, the more difficult it becomes to compute the likelihood. 29 Position specific substitution rates All these models assume that the substitution rates are identical for all positions of a DNA sequence. This is often not true. Changes in the amino acid sequence of the active centers of enzymes are much rarer than substitutions in side chains of a protein because they usually affect the function of the protein. First, second and third positions of a codon also have different substitution rates because many mutations at third positions do not affect the amino acid sequence. Hence, the substitution rates may vary along a gene sequence. This variation can be considered in models by assigning different overall rates to different positions. In principle one could assign a separate rate to each position. Unfortunately, the number of parameters in a model cannot be higher than the number of datapoints in the dataset. You can easily understand this if you think of a regression line through a scatter-plot of data points. The regression line is a model that simplifies the information contained in the data and is defined by two parameters, the gradient and the intersection with one of the axes. If you only have one data point there is no unique regression line representing the data. We must therefore use another method to include position specific rates into a substitution model. Instead of assigning a separate rate to each position we define a set of “substitution rate classes”, each with its own rate. The default is often 4 but any number of classes can be defined. How much these rates differ from each other is defined by a certain probability distribution, the gamma distribution. Look at the following graph. On the x-axis we have the substitution rates, low rates on the left close to the y-axis, higher rates on the right. The area below the curve is then divided into 4 (or whatever number you prefer) parts so that each part covers the same area below the curve. The four different rates are then determined as the median of each part, in this case 0.32, 0.68, 1.11 and 1.88. The nice thing about this procedure is that the gamma distribution is defined by a single parameter α and can take very different shapes. In the figure above α = 2. If α ≥ 1 all substitution rates are relatively close to 1. For α < 1 they differ more strongly. The rates for α = 0.5 are, for example, 0.03, 0.28, 0.92 and 2.77. Once the different rates are defined, the likelihood for each position (given a tree with branch lengths) is calculated using the mean likelihood value for all four rates. Such a model can represent the actual substitution process much better, because the higher rates contribute a higher proportion to the overall likelihood for rapidly evolving and the low rates for slowly evolving positions. Now that we know what a substitution model is and how it reflects the substitution process, how can we select a model among the many different ones? Mathematical models do not perfectly represent but only approximate natural processes. An ideal model would allow us to extract the essential information from complex datasets. As outlined above a scatter plot in a coordinate system may be reduced to a regression line a + bx that tells us how pressure depends upon temperature in an autoclave. In such a model, a large dataset is reduced to two parameters a and b. 30 The law of likelihood offers one way to choose among models. If you calculate the likelihood of a tree based on the dataset and different substitution models you could pick the preferred model by comparing the likelihood values for different models, just as you choose the best tree by comparing likelihood values for different trees. P (D | M1) / P (D | M2) (see page 10) The problem is that one can always add further parameters to a model that increase the agreement between the model and the data. The more complex a model gets the more closely it reflects the actual dataset and the higher its likelihood. By just comparing the likelihood ratio we would always end up with the most complex model. Because all datasets contain noise in addition to information complex models also tend to capture a higher proportion of that noise. This increases the chances that a model misrepresents the natural relationships and will not fit very well to future datasets. In the example above, instead of a simple line we could describe a curve that touches every point of the scatter plot. This model would fit the data perfectly but probably does not correctly describe the relationship between temperature and pressure in the autoclave. Therefore we need to avoid overly complicated models. Introducing additional parameters into a very simple model will greatly increase its likelihood. But an additional parameter in an already very complex model will make only a small difference. In 1974, Hirotugu Akaike introduced a method that makes use of this observation by offsetting the number of parameters with the likelihood value of a model. The formula looks like this: AIC = 2k – 2log L(M) where k is the number of parameters and log L(M) the logarithm of the likelihood of a certain model. The more parameters you include the larger the values for 2k and log L. As long as the increase in likelihood exceeds the increase in 2k the AIC value gets smaller. Once the model has reached a certain degree of complexity, the increase in likelihood will no longer compensate the increase in 2k, because additional parameters add only little to the likelihood value. From then on the AIC value will increase again. Put simply, the model with the lowest AIC is the one that explains the data best with a minimum number of parameters. There is a number of freely available programs which calculate the AIC or similar quantities (e. g. jModeltest, MtGUI). As we have already pointed out, different genes and even parts of a gene often follow different substitution models. In many protein coding genes a quick look at the alignment already tells you that first, second and third codon positions have different rates. For multigene datasets, this would suggest to use a separate model for each individual gene and each codon position. For datasets with many genes this would lead lead to an unnecessarily complex model, if some genes within the dataset follow the same model. Therefore one should always test whether different genes follow the same model and whether they are not better combined in a single data partition with a single model. The following figure from the user manual of PartitionFinder (Lanfear et al. 2012) shows different options to partition a dataset of three coding genes. 31 In order to find the optimal model, one must consider all possible combinations of models and partitioning schemes. The program PartitionFinder does this automatically. It needs a dataset in PHYLIP format and a configuration file, in which the most complicated partitioning scheme to be considered is defined by the user. In the case of the provided dataset this would include three codon positions for every gene in the dataset (atp1, matK, rbcL). The configuration file is called “partition_finder.cfg” and can be found in the folder “PartitionFinder”. 1. Copy the alignment in PHYLIP format into the folder “analyses/PartitionFinder”. 2. Open the file “partition_finder.cfg” in a text editor. 3. Add the name of your PHYLIP file behind alignment = . 4. Below ## DATA BLOCK ## you already find the nine possible partitions. You need to define which parts of your alignment belong to which partition. The notation 1-xxxx\3 means that every 3rd position between 1 and xxxx belongs to one partition. Why does the next partition begin with a 2 (2-xxxx\3)? 5. Save the file without changing its name after you have entered all the information. 6. PartitionFinder is written in the programming language Python. In order to start it open a command prompt and type the command python and a space. 7. Open the file “Programm/PartitionFinder”, drag and drop the file “PartitionFinder.py” into the command prompt and type another space. 8. Now drag and drop the folder “Analyses/PartitionFinder” into the command prompt and press <Enter>. If there are no errors in the cfg-file the program will start computing the optimal combination of models and partitions. The progress is output on the screen. 9. After the analysis has been finished you will find a new folder “analysis” and a file “log.txt” in the folder “Analyses/PartitionFinder” 10. The logfile documents the screen output. 11. The folder “analysis” contains the results of the analysis. The most important file for us is “best_schemes.txt” which describes the optimal partitioning scheme. Open the file in Word (output is difficult to read in some text editors). 12. Following some rather irrelevant lines with numbers you find the description of the scheme. The partitions are numbered. For each you can see the optimal model, the name of the gene loci (e. g. matK_1) and their position in the alignment. Study the table carefully! Is there anything peculiar? 13. Below the table a short explanation tells you which of the opriginal partitions (the ones that you suggested) follow the same model. 14. At the end of the file there is a text block which we can later use in the ML analysis. Here, the partitioning scheme is described so that we can just copy-paste it into the program RAxML. 32 Writing a PAUP Block In our first attempt to calculate a phylogenetic tree we used the command line to tell PAUP what to do. A much more convenient way is to write all necessary commands into a text block that is appended to the data block in PAUP. The NEXUS format allows us to include a number of different blocks, for example an ASSUMPTIONS block, a TREES block with previously computed trees or a PAUP block with details on the analysis. This is very convenient when we wish to run different analyses successively or just want to save waiting time. Instead of waiting on results for hours and type in the next commands, we can feed PAUP with all necessary commands on Friday and pick up the results of the completed analysis on Monday. We start with a simple example, a ML analysis for which we have to define a substitution model. 1. The data block in a NEXUS file ends with end; Insert two returns. The command block always starts with the following line: begin PAUP; 2. Because the default optimality criterion in PAUP is MP we have to tell the program that we now want to perform an ML analysis. We also want to tell the program to finish the analysis independently so that we do not have to press the “close” button. Write the following into a new line below the previous command: set autoclose=yes criterion=likelihood; 3. Next, we have to define the substitution model. PAUP does not accept multiple models for a single dataset. The following parameters of the substitution model are taken from an analysis of the dataset with the program MrModeltest 2.2. lset base=(0.3445 0.1718 0.1697) nst=6 rmat=(0.8683 2.6043 0.2708 0.9152 2.5678) rates=gamma shape=0.9213 pinvar=0.3175; The command lset initiates the definition of model parameters for an ML analysis. base=(x y z) defines the frequency of nucleotides A, C and G. Why is the frequency of T not defined? The x in nst=x refers to the number of different substitution types in the model (1, 2 or 6 depending on the choice of model, see Box 5). rmat=(x y z …) means the relative rate of the different substitution types (A→C, A→G, etc.). The rate for G→T is set to 1. rates=[equal, gamma]: Substitution rates are either identical for every position of the alignment or variable among sites. shape=x defines the shape of the Gamma function that is used to model site specific substitution rates (see Box 5). pinvar=x most positions of an alignment are usually identical across species and hence appear to be invariable. The value for x defines this proportion of invariable sites in the dataset. 4. To start the heuristic search for the best tree we include the following line after the line starting with lset: hsearch addseq=random swap=SPR nreps=10; 33 5. The next line contains the command to show the best tree with branch length that you already know. As an exercise write it yourself. 6. Finally, let PAUP save the tree with branch lengths under a file name of your own choice. 7. Save and close the NEXUS file. 8. Load the file once more. If everything is ok, the analysis will start without any further prompts. Phylogenetic Uncertainty, the Non-Parametric Bootstrap The results of different phylogenetic analyses on the same dataset (e. g. MP and ML analyses) often differ from each other. MP analyses commonly come up with several equally parsimonious trees. These differences may concern important details, for example the question whether a taxon is monophyletic or not. How much confidence can we have in our results or in other words, how far is our hypothesis (the tree) from the truth (the evolution as it actually happened)? This is an important question in all scientific studies. Physicists will usually repeat measurements and report mean values and standard deviations to estimate the deviation of a single measurement from the true value. But evolution cannot be repeated experimentally. Sequencing many genes and comparing the resulting trees might help, but we often find that the information contained in a single gene is too small to allow meaningful interpretations. That is the reason why most studies nowadays rely on data from multiple genes. When the deviation of a measurement from the true value cannot be inferred, because the measurement cannot be repeated, one can instead treat the measurement as the “truth” and repeat the analysis by re-sampling the dataset. The deviation of the re-sampled results from the original result then serves as an approximation of the true deviation. Referring to the American proverbial expression “to pull yourself up by your own boots” (a basically impossible task) statisticians call this method “bootstrapping”. Which at least proves that they have a sense of humour. In phylogenetic analysis you re-sample the dataset several 100 to 1000 times with replacement. By randomly picking single alignment positions one produces datasets that are identical in size to the original one. Random sampling with replacement ensures that the new datasets all differ from each other, because some positions will be picked more than once while others will be left out randomly. In a second step one calculates phylogenetic trees for each subsample and a single consensus tree, which summarizes the information contained in the bootstrapped trees. Every branch that was found in more than 50 % of the bootstrapped trees can be unambiguously displayed in the consensus tree. Finally, one determines, in how many of the bootstrapped trees the branches were found. This “bootstrap support” (in percent) or similar measures of support are a necessary part of all decent phylogenetic trees. Simulation studies (in which you actually know the “truth”) show that bootstrap values tend to underestimate the confidence that we can have in a branch. Although 95 % confidence is a frequently used threshold value in other fields of statistics, it has become customary to discuss branches with support values above 70 %. Less strongly supported branches are considered to be too unreliable to merit further discussion. Bootstrap values are a measure of precision rather than accuracy. High precision means that a measurement renders very similar results every time it is repeated. High accuracy, on the other hand, means that these measurements are close to the true value. When I step on my bath- 34 room scales ten times and it always measures the same weight it is very precise. Nevertheless it can be poorly calibrated and under- or overestimate my true weight considerably. Precision alone is not really sufficient to convey confidence in a measurement. Bootstrap values are very similar in this respect. A node with high bootstrap support can be reliably reconstructed from the data, but a bootstrap support of 85 % does not mean that there is an 85 % chance for this branch to occur in the true evolutionary tree. We will carry out the first bootstrap analysis using MP. 1. Open PAUP and load the NEXUS file. 2. Should the program start to compute trees, stop it an change into the window that displays the NEXUS file. You can deactivate the PAUP block at the end, by putting it in square brackets. Save the file and execute it once more. 3. Type bootstrap nreps=500 search=heuristic; into the command line and press <Enter>. This will start a bootstrap analysis with 500 pseudoreplications (the smallest number considered to give reliable results) and a heuristic search for the best tree in every replicate. 4. After the analysis is finished the display shows the consensus tree and a table with all groups that were found in at least one bootstrap tree. The taxa are abbreviated as numbers above the columns. The respective groupings are marked by stars. The rightmost column shows the frequency (in percent) with which the group occurred among the bootstrap trees. Groups appearing in less than 5 % of the trees are not listed. Compare the list with the tree. 5. In order to save the consensus tree type the following commands: savetrees from=1 to=1 file=[File name] savebootp=nodelabels; <Enter> PAUP saves the bootstrap support as a label of each node. The command from=1 to=1 is necessary to circumvent a bug in PAUP. Without this command the file is empty. The tree file can be opened and edited in FigTree. Bootstrap Analyses Using Maximum Likelihood As you have probably seen by now, ML analyses and bootstrap analyses under MP can take very long with PAUP. Running a ML bootstrap analysis with a dataset like ours would soon exhaust your patience. Different methods have been developed to speed up the computation of ML trees. One of these programs is RAxML (Randomized Axelerated Maximum Likelihood). In contrast to PAUP it is also able to consider separate models for different dataset partitions. However, it only offers to select the same model for all data partitions. The difference to a single model for the whole dataset is that the parameter values of the model are estimated independently for all data partitions. RAxML calculates bootstrap support values and an ML tree separately and displays the bootstrap values on the ML tree, not the consensus tree. RAxML needs a dataset in PHYLIP format. 35 1. Copy the alignment (PHYLIP format) into a new folder, in which you also want to save the results of your RAxML analysis. 2. Start the program RAxMLGUI by double-clicking on the file “raxmlGUI.py”. 3. Load the alignment by >Alignment file. 4. A window with the heading “Open alignment file” will open. Navigate to the folder with your alignment, highlight it and >Open. 5. If the alignment contains identical sequences or positions that only consist of gaps a warning will appear. You may delete these “double” sequences and gap positions by choosing >yes. In our case we want to use all sequences and press >no. RAxMLGUI recognizes whether the dataset consists of DNA or protein sequence data. 6. Your PHYLIP file will appear in the disply window. 7. >Options>Set/Edit Partitions... opens a new window “Set partitions”. Here you can describe the different partitions of your data set. 8. RAxMLGUI can process many kinds of data. You can specify your data by pressing >DNA in the upper left corner. “BIN” means binary data (0,1), “MULTI” multistate characters (e. g. blue, red, green), and “AA” amino acid sequences. Multistate characters may have up to 32 possible character states (0–9, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V). Do not change any of the settings. 9. You can either define the limits of the data partitions “manually” or copy-paste the definition from PartitionFinder. Choose Options>Edit partitions. 10. Copy the last few lines from the file “best_schemes.txt” into the display of RAxMLGUI and correct line breaks if necessary. 11. >Set will save the partitions. 12. [If you want to enter partitions manually follow these instructions: The start of a partition cannot be changed. It is either 1 or follows from the end of the previous partition. Enter the last position of the first partition into the upper right field. 13. >no codon means that all positions with these limits follow the same model. Alternatively, for protein coding sequences, you may choose >codon specific if you want to use separate models for 1st, 2nd and 3rd positions or >3rd codon if 1st and 2nd positions together shall follow a model that differs from that for 3rd positions. 14. Press >Add. The larger window will now show a line describing the first partition as you defined it. Repeat steps 12-14 until all positions of the dataset are assigned to a gene partition. 15. A message appears “All characters have been assigned to partitions. Set partitions for the analysis?” If you do not want to change anything press >OK, otherwise >Abbrechen.] 16. After all partitions have been defined a new option will appear below “Run RAxML” called “per-partition brL”. Selecting this option will prompt the program to calculate branch lengths for each partition. The topologies of these trees will be identical, but the branch lengths will be optimized for each position separately. 17. In the lower menu bar you see the command >ML + rapid bootstrap. Here you can choose between a simple ML analysis or two different types of bootstrap analyses, “thorough” or “rapid”. Do not change the settings. 36 18. The number of bootstrap replicates may also be adjusted. If it is shortly before lunchtime choose 1000, otherwise leave it at 100. 19. Finally you may choose among different substitution models. For the sake of speed, RAxMLGUI offers a limited choice of alternatives. The GTRGAMMA model includes site specific rates, GTRGAMMAI a proportion of invariable sites in addition. The models GTRCAT and GTRCATI further speed up calculations but are relatively imprecise. This should particularly affect branch lengths. 20. >Run RAxML starts the analyses and opens a command window in which the progress of the analysis is reported. 21. RAxML produces a number of output files. The most important file “RAxML_bipartitions..tre” shows the ML tree with branch lengths and bootstrap support values. If you have chosen >per-partition brL before you will also get a separate tree with branch lengths for each partition. These partitions have the endings “PARTITION.0”, “PARTITION.1”, etc. 22. To finish your analysis, edit the tree in FigTree and save it as graphic. Bayesian Analysis For a number of reasons, the Bayesian method has become very popular among systematists. - - Although millions of trees are calculated, the method is very fast. This is mainly due to the fact that you can skip bootstrap analyses. The Bayesian analysis computes the probability of the tree and all parameters of the analysis simultaneously. From the tree sample you can also compute confidence limits and standard deviations for all parameters of the model (i.e. the tree topology with branch lengths and the substitution model). All methods that only search for a single optimal tree may fail under certain conditions. We have already seen the example of long branch attraction under MP. Using an inappropriate substitution model may also cause serious problems in ML. Bayesian methods are not free from these error sources, but instead of focussing on a single tree topology, we may rather consider the sample of trees as a whole and take phylogenetic uncertainty into account when interpreting the trees. For example, the consensus tree may show a certain taxon to be polyphyletic, but how high are the chances and can we completely dismiss the idea that it is actually monophyletic? Bayesian tree samples offer good possibilities to test competing phylogenetic hypotheses against each other. The main point of criticism against Bayesian analyses is that we need to define prior expectations for model parameters. For example, a grossly wrong mean value for the prior distribution of branch lengths may lead to absurdly long consensus trees when compared with ML trees. The solution in this case is to choose the mean length of branches from an ML analysis as the mean of the prior distribution in the Bayesian analysis. In the following, we will perform a simple Bayesian analysis, which also allows us to apply our skills in writing command blocks. 37 1. Copy the NEXUS file into the folder in which MrBayes is stored. 2. Open the NEXUS file in a text editor or in PAUP and change to the file view window (after eventually stopping automatic calculations). Put the PAUP block from the previous analysis into square brackets. 3. If anything is written between “#NEXUS” and “begin data;” delete this text. The first lines of the file shall now look like this: #NEXUS begin data; 4. Make sure that the next line looks like this: format datatype=dna interleave missing=N gap=-; 5. Move to the end of the file and add another line: begin mrbayes; 6. Just like the PAUP block that we have written before, this MrBayes block defines the settings and parameters for the analysis. Insert a line below this command that tells MrBayes to close the program window after the analysis is finished, and add another line specifying the name of a log file. This ensures that the screen output is also saved as a file for future reference. This file will be replaced if you re-run the analysis. If you do not want this write append instead. set autoclose=yes; log start filename=[file name]_log replace; 7. MrBayes allows the use of different substitution models for different gene loci or codon positions. We will use the partitioning scheme and the models from PartitionFinder. For each data partition add a line that starts with charset. Add a name for the data partition and describe its position in the alignment as follows. charset A = 1-xxxx\3; [matK 1st position] charset B = yyyz-zzzz\3 2-zzzz\3; [rbcL and atp1 3rd positions] ... To avoid typing errors keep the names as simple as possible and avoid blank spaces, numbers and full stops. Different parts of the alignment may be combined into one partition. Use a blank space, not a comma, to separate them. Finish each line with a semicolon. Include more descriptive names of partitions in square brackets to aid your memory. 8. MrBayes may successively analyse several datasets, for example calculate trees first for every partition separately and then a combined analysis. Defining the partitions is therefore not sufficient. We also have to tell MrBayes which partitions to use as the dataset for the analysis. The command for this is partition followed by a name and the number and names of the partitions to be used. Several different datasets may be defined. partition threegenes=5:matK_1, rbcL_2_atp1_2 ...; 9. Now tell MrBayes which dataset to analyse: set partition=threegenes; 10. Like in PAUP, the definition of substitution models is initiated by the command lset. You have to define as many models as there are partitions in the dataset, and then tell MrBayes for which partition the respective model is used by applyto=(number). Next 38 you define the number of different substitution rates (1, 2 or 6 as for the ML analysis). The rates command allows to distinguish between models with equal rates for all positions (equal), gamma distributed site specific rates (gamma), a proportion of invariable sites (propinv) and both (invgamma). The command lines look this: lset applyto=(1) nst=[value] rates=[value]; lset applyto=(2) nst=[Value] rates=[value]; ... 11. Models also differ in the way, in which they treat base frequencies. Some simple models assume that all nucleotides are equally frequent in the dataset. As a default, MrBayes assumes unequal base frequencies and tries to optimize the relative frequencies using a Dirichlet prior distribution. If some of your models require equal base frequencies, you have to tell MrBayes use a different “state frequency prior”: Make sure that the prior applies to the correct data partition. prset applyto=(1) statefreqpr=fixed(equal); prset applyto=(4) statefreqpr=fixed(equal); ... At this stage you may change the prior distributions of all possible parameters, including the branch length prior (see page 26). 12. If different models contain the same type of parameters, e. g. two models that both use gamma-distributed site specific rates, MrBayes will try to find a single value for these parameters which is then applied to all partitions. If you want MrBayes to infer the values for these parameters independently you have unlink these parameters across partitions. The command is: unlink [Parameter]=(all); In our case we want to optimize the shape parameter alpha, the proportion of invariable positions, the nucleotide frequencies and the substitution rates independently. This is done by the following row of commands: unlink unlink unlink unlink shape=(all); pinvar=(all); statefreq=(all); revmat=(all); Alternatively, you may only decouple a parameter between some of the partitions. This command would look like: unlink shape=(number,number,...); 13. You have probably noticed that, in spite of these additional options, the definition of substitution models is more unspecific in MrBayes than in PAUP or Garli. We only have to describe how many and which different parameters there are (e. g. the number of different substitution rates, the shape parameter of the gamma distribution etc.). The basic model is specified but we do not specify the parameters values as we have done for PAUP or Garli. These values are optimized by the Markov chain while it is running. 14. Next you have to define settings for the Markov chain. The most important settings concern the number of generations to be run (ngen) and the frequency at which trees are sampled (samplefreq). You may also determine how often the progress of the analysis 39 shall be printed to the screen (printfreq) and whether you want trees with branch lengths or without (savebrlens). A special command concerns the question of how many parallel Markov chains will be run (nchains). The following command line tells MrBayes to run 1 million generations using four parallel Markov chains, sample and print every 100th generation to the screen and save trees with branch lengths. Many more options may be specified. mcmc ngen=1000000 printfreq=100 samplefreq=100 nchains=4 savebrlens=yes filename=[file name]_out; END; 15. Finally, you want MrBayes to not just sample 10.000 trees but also to summarize them in a consensus tree (sumt) and also summarize the parameters of the substitution models in a separate file (sump). The respective command lines look like this: sump filename=[file name]_out relburnin=yes burninfrac=0.5; sumt filename=[file name]_out relburnin=yes burninfrac=0.5 contype=allcompat; The commands relburnin=yes and burninfrac=0.5 mean that MrBayes discards the first 50 % of the sampled trees as burnin. contype=allcompat tells the program to include clades in the consensus tree even if they have less than 50 % support, as long as they are compatible with the more highly supported groups. 16. The end of the MrBayes block is defined by : END; 17. Save the dataset in your MrBayes folder. 18. Open this folder and double click on the MrBayes executable. A command prompt will open in which you have to type execute [file name] to start the analysis. The program will begin to report every single step of the analysis on the screen. If no error messages appear, every 100th generation of the Markov chain will then be shown with likelihood values of the sampled trees in brackets. In the right column the program tells you how long it will still need to finish the analysis. After it has finished the program will summarize the results on the screen and saving them in a lot of different files. 19. Among others we see a rather crude plot on the screen in which MrBayes illustrates the development of the likelihood values during the run. After a steep increase these values will reach a plateau. You have to discard all trees that were sampled before the plateau. We have anticipated the decision by telling MrBayes to discard the first half of the trees. Depending on the number of different models and parameters the screen shows a long row of values for: tree length TL, transition/transversion rates (kappa), substitution rates (r…), nucleotide frequencies (pi…), gamma shape parameters (alpha) and proportions of invariable sites (pinvar). These data are also stored under the extension *.pstat. 20. After this, we get a list of taxa and clades (called “bipartitions”), another list with posterior probabilities of clades (bipartitions), a list of branch lengths with standard deviations, a tree with posterior clade probabilities (called “credibility values”) and another tree with branch lengths. These data are stored in the log-file and a file with the extension *.parts. 21. A list of the different trees sampled by the Markov chain is stored under the extension *.trprobs. This file also contains the probabilities of single trees and the cumulative probabilities of the trees. This allows us to not only concentrate on the consensus tree of the 40 analysis, but also evaluate sets of trees, for example those trees that together have a posterior probability of 95 or 99 %. 22. The most important file is the consensus tree with branch lengths and posterior probabilities (*,con.tre), which we open and edit in FigTree as usual. 23. If you want to check the probability of a clade that is not part of the consensus tree, you can do so by opening the *.parts-file with in Word. 24. The sampled trees are stored under *.t, and the sampled model parameters under *.p, separately for each run. The Molecular Clock Systematists are often interested in more than just the phylogenetic relationships of their pet organisms. For example, one might also want to date evolutionary events. Temporal information might help to answer questions regarding biogeography and evolutionary mechanisms. Zuckerkandl and Pauling (1965) supposed that the genome accumulates substitutions at a constant rate. This idea of a “molecular clock” was also part of the “neutral” theory of evolution by Kimura (1968, 1983). According to this theory molecular evolution is a random process that depends mainly on time. If the hypothesis of a molecular clock is true we would expect that the genetic distance among taxa is proportional to the time that has passed since these taxa have descended from their most recent common ancestor. In the tree below the divergence time between taxa A and O is identical to that between taxa B and O. Hence, we would expect that the genetic distances dAO and dBO are also equal. The following table shows the data that induced Zuckerkandl and Pauling (1965) to postulate the molecular clock. It shows differences between amino acid sequences of α-hemoglobin in shark, carp, newt, chicken, echidna, kangaroo, dog and humans. The paleonotological evidence suggests that the shark is the outgroup for all other taxa. Accordingly, the genetic distances between shark and the other vertebrates should be equal. The same logic suggests that the genetic distances between human and the other taxa increases with the distance of the relationship. Indeed we see similar distances in each row and more or less decreasing distances in columns. 41 Strict molecular clocks It is possible to compute phlyogenetic trees in which the distance from the root to the tips is identical for all taxa. In such an ultrametric tree the branch lengths are proportional to time. If we know the age of a single node in such a tree, we can calibrate all other nodes and make inferences about the evolution of groups for which we might not have any fossil evidence. This sounds like a great idea, but one has to be cautious. The substitution models that we have discussed above do not assume that substitutions were constant over all times and taxonomic groups. In fact we know that this is not the case. Rodents, for example, have much higher substitution rates than primates, partly because of their shorter generation times. Without the clock assumption, the branch lengths of a tree are adjusted freely to optimize the likelihood of the tree. They represent the product of time and (variable) substitution rate and can have very unequal lengths. Forcing the program to assume equal rates for all taxa and branches changes the substitution model substantially. We can compare the likelihood of a given tree with and without an enforced clock. The clock tree has always a lower likelihood than a tree in which branch lengths can vary freely. If the difference is not too big a strict molecular clock reconstruction may be justified. Unfortunately, in most cases the difference in likelihood is so big that there is simply no justification to use it. Relaxed molecular clocks Instead of simply assuming equal rates it is possible to treat substitution rate and time as different parameters to be optimized during the computation of the tree. Such models are called “relaxed molecular clocks” to distinguish them from the strict clocks. A relaxed clock tree enables us to date nodes while at the same time allows substitution rates to vary across branches. The most complex models assign a different rate to every branch. One of the most frequently used models is the so-called “uncorrelated log-normal (ULN) relaxed molecular clock” implemented in the computer program BEAST. The way in which BEAST assigns substitution rates to branches resembles the way in which site specific substitution rates are defined. Rates are drawn randomly from a logarithmic normal distribution (see the figure below) and assigned to branches of the tree. This is done by a Markov chain wich also varies the topology and branch lengths (and the parameters of the substitution model). 42 Calibrating the molecular clock A clock tree represents the relative ages of nodes in comparison to each other. Such a phylogeny is of little use when you cannot assign absolute ages (mostly in millions of years) to some of the nodes. One may use very different sources of information from geology and paleontology to date phylogenies. A dated fossil may be used if it shows at least one synapomorphy that is typical for a clade of the tree. This is taken as evidence that the respective clade already existed at the time of this fossil, which is then taken as the minimal age for the respective clade. Geological events such as the origin of volcanic islands, the uplift of mountains or the split up of distributional ranges due to continental drift are also used to calibrate clock trees. Some of these calibrations have to be treated with caution. Taking the age of an island for an age of an endemic taxon may underestimate the real age, because the taxon may have evolved much earlier and colonized the island at a later stage. The vast majority of molecular clocks is computed with Bayesian methods. One of the reasons is the possibility to consider the age of certain nodes as prior information, which should increase the accuracy of the reconstruction. This is one of the few cases where we can replace a noninformative prior with an informative one by restricting the age of certain nodes with prior distributions of a certain shape. The illustration below shows a few possible distributions. A uniform distribution with an upper and a lower limit can be selected when a minimum and maximum age of a fossil taxon is known. Lognormal and exponential distributions are suitable for fossils with a minimum age. The normal distribution is appropriate for tectonic or biogeographical data. Nodes for which no fossil (or other) data is available receive a non-informative prior age, i. e. their age is not restricted in any direction. The following molceular clock analysis requires the use of a whole armada of programs: BEAST, BEAUti, TreeAnnotator (from the BEAST package), Tracer and FigTree. BEAUti is used to produce an input file for BEAST. 43 1. Open BEAUti by double-clicking on the executable. On top of the central window you see a row of tabs. We will go through these tabs from left („Partitions“) to right („MCMC“) to produce the input file. First load the NEXUS-file by pressing >+ on the lower left of the screen and navigating to the Nexus file used in the Bayesian analysis. 2. BEAUTi recognizes the character sets defined in the MrBayes block. For every data partition you should see a separate white row in the grey central window. If you only see a single line with the file name in the first column you may have put the MrBayes block into square brackets or chosen a wrong file. In this case, close the file by pressing >-, choose the correct file or remove square brackets and load the file once more. 3. For each partition the number of taxa, sites (nucleotide positions) and a few descriptors of the substitution model are listed. In order to use the substitution and partitioning model calculated with PartitionFinder, mark all Partition Names and choose >Unlink Subst. Models. 4. The next tab >Taxa allows us to define nodes for calibrating the molecular clock. Press >+ and a taxon set “untitled” appears in the left field. You can see all taxa in the data set in the central field “Excluded Taxa”. Mark those that you want to combine in a node (e. g. all gymnosperms) and transfer them to the field “Included Taxa” by pressing the green arrow in the center. 5. Double-clicking “untitled” allows you to change the name of the taxon set. 6. Choose >Mono? if you want the group to appear monophyletic in your tree. 7. In the last column you can insert a phylogenetic age for the group. This age may refer to the base of the branch on which the group is sitting (the “stem age”) or to the node from which the group diversifies (the “crown age”). >Stem? allows you to define the age as the stem age of the group. 8. You may define as many taxa sets as you like by clicking >+. 9. Because we do not have age for the tips of our tree or traits to map, we skip the tabs “Tips” and “Traits” and go straight to >Sites. 10. Here we can define subsitution models for each of our data partitions. Check the output from PartitionFinder and choose the model parameters that come closest to the models there. 11. Under >Clocks you may choose a >Model for the calculation of molecular clocks, e. g. a >Strict Clock or a > Lognormal Relaxed Clock (the recommended option). 12. BEAST is also used for population genetic purposes. Genes are usually exchanged between members of a population but not between different taxa. Under >Trees we can therefore choose among different mathematical models. Select >Tree Prior >Speciation: Yule Process. This option considers only speciation but not extinctions, such as >Speciation: Birth-Death-Process. 13. We skip the >States tab, because we do not want to reconstruct ancestral character states on this tree. 14. The prior distributions of all model parameters have to be selected under >Priors. Luckily, BEAUTi suggests default options for most of these. The default options are marked by a * symbol and BEAUTi alerts you to check whether they are appropriate. You may verify this by clicking on the priors and selecting >ok in the pop-up window. For this course let us just assume that BEAUTi suggested meaningful priors. However, we must check priors that BEAUTi regards as inappropriate and marks in red. 44 15. We skip >Operators and move on to >MCMC. 16. We define the length of the Markov chain under >Length of chain: (at least 10 million generations for a good analysis). >Log parameters every: defines how often the chain saves trees and parameters while it is running. This depends on the length of the chain. For 10 million generations one should save every 1000th tree. 17. With >File name stem we can select a name for the BEAST input and all output files. Do not change the name, because in the past this used to cause program crashes etc. 18. >Generate BEAST File finally generates the input file (*.xml). A window opens and warns you about model priors that you have not yet checked. Press >Continue and choose a folder in which the file is saved. 19. Running BEAST is now easy. Close BEAUTi and double click on “BEAST v1.8.2.exe”. 20. >Choose File... opens another window. Load the *.xml input file generated by BEAUTi. 21. Because we have not installed the BEAGLE library we must deactivate >Use BEAGLE library if available. Start the analysis by pressing >Run. The front window will close. Similar to MrBayes the progress of the analysis will be printed to another window. 22. As soon as the analysis is finished a number of statistics appear in this window. BEAST also suggests which parameters should be changed to improve the analysis. 23. The program produces two output files (*.trees and *.log). The first contains all sampled phylogenies, the second all parameters that were sampled during the analysis. 24. The results of the analysis have to be summarized into a consensus tree. Open “TreeAnnotator” by double-clicking on the icon. 25. Insert the number of sampled trees that shall be discarded as burnin behind >Burnin (as states):. Remember that this is not the number of generations but the number of sampled trees to be discarded. Because our Markov chain was rather short, we discard 50% of the trees. 26. Behind >Input Tree File: select >Choose File... to open the tree file produced by BEAST (*.trees) and select a name and location behind >Output File:. Both times a pop-up window appears. Select a file as input and insert a name for the output file. TreeAnnotator will produce a file with this name. 27. >Run starts the calculations in a different window. After the output file has been written, open it in FigTree. The tree should contain node ages with 95% confidence intervals and posterior probabilities for all nodes. Try to find out how to depict this information in FigTree. Reconstruction of Character Evolution Phylogenetic trees may also be used to reconstruct the evolution of certain characters within a systematic group. This may help to understand evolutionary mechanisms, for example tell us in which order characters evolved and whether certain the development of a “key innovation” was a prerequisite for the diversification of a systematic group. In order to study the temporal sequence of a character’s evolution one must reconstruct the character states at the internal nodes or along the branches of the tree. The simplest method makes use of the parsimony cri- 45 terion. The character states are assigned to the internal nodes so that the number of evolutionary changes along the tree is minimized. This method only considers the topology of the tree and not the branch lengths. But branch lengths are proportional to the time that has passed between its origin and split-up. Not only nucleotide substitutions are more frequent along long branches. The same holds true for phenotypical characters. When a tree contains branches of very unequal length it would thus be better to use an evolutionary model instead of just counting state changes along the tree. Such a model is another form of likelihood function. Like all likelihood functions it computes the probability of observing the data (character states for the taxa) given a hypothesis (the phylogenetic tree). But it is much simpler, because we only have to consider a single node with its two descendant nodes and not a whole tree with all its internal nodes. For a character with two states 0 and 1 it looks like this: 0 1 L1 L2 A L(C) = P(A=0)(P0→0(L1) · P0→1(L2)) + P(a=1)(P1→0(L1) · P1→1(L2)) where P(A=0) and P(A=1) denote the probability of observing these character states at node A. P0→0(L1) is the probability that character state 0 does not mutate along the branch of length L1, P0→1(L2) the probability that it changes to 1 along the branch of length L2. This probability must be calculated for both character states at node A and the probabilities summed up because they represent mutually exclusive events. An evolutionary model is necessary to compute the probabilities of mutations within a certain time (or branch length) unit. For a character with two states the model only needs two parameters: q01, the mutation rate from 0 to 1, and q10, the rate from 1 to 0. A matrix of this model looks like this: 0 1 0 1 - q 01 q 01 1 q 10 1 - q 10 We want to reconstruct the character evolution in our dataset under parsimony and likelihood using the program Mesquite. Mesquite is a very complex program for phylogenetic data analysis with a fundamentally different structure than the programs we have used so far. Most other programs take one or two input files and write a variable number of output files (substitution models, log files, trees etc.). Mesquite combines several input files into so-called projects and analyzes them together. In order to reconstruct character evolution it needs phylogenetic trees and a text file with the character states for all taxa of the dataset. 46 1. Mesquite needs a character matrix in PHYLIP format that contains the character states for all taxa included in the analysis. The PHYLIP format is much simpler than the NEXUS format. The first line consists of two numbers, first the number of taxa and second, separated by one or several blanks, the number of characters. The following lines each contain a taxon name and a blank followed by the character states. The file must be in plain text format and can be written in any text editor or exported from a database or spreadsheet. In the latter case you need to carefully read the file and correct errors before you feed it into Mesquite. 100 Taxon1 Taxon2 Taxon3 ... 6 1 0 2 1 1 3 1 3 2 1 1 1 2 3 2 2 0 1 ... 2. Copy the ML tree without bootstrap values calculated by RAxML into the same folder as the PHYLIP file. 3. Open Mesquite with a double click. Several windows will open of which only a single one will remain. Above a display with several lines of text (the log output) you can see four tabs “Log”, “Projects and Files”, “Search” and “Simplification”. 4. Open the character matrix by >File>Open and selecting the respective file. 5. In a separate window, Mesquite will prompt you to “Please choose an interpreter for this file”. Select “PHYLIP (categorical data)” and >ok. 6. A further window will open with the request to “Save imported file as NEXUS file”. Select a name and a folder, in which the data is to be stored and >Save. (No project without NEXUS file. A new display appears marked by a tab “Project of [thenamechosenbyyou]. 7. To connect a phylogenetic tree with this project load the ML tree without bootstrap values (RAxML_bestTree...tre) by >File>Link File... 8. Once more, you have to select an interpreter. Select “PHYLIP (tree)” and a name for the file. 9. Below the line “Project” you can now see the names of the two NEXUS files and below that “Taxa”, “Character Matrix” and “Imported Trees”. 10. >View Trees opens a new window “Tree Window 1 showing stored trees” which partly conceals the previous window. You can toggle between windows by clicking on the tabs. 11. >Drawing>Tree Form offers you a range of possibilities to disply the tree. “Balls and Sticks” is particularly well suited for character state reconstructions. In this view the nodes are represented by circles that later turn into pie charts in which reconstructed character states are shown. This is only meaningful when more than one possible state are reconstructed at a single node. For example, under likelihood, the relative contribution of all states to the total likelihood at this node is shown. 12. On the left side of the window there are several tools with which you can manipulate the tree. Position the cursor on the symbols and read the description appearing in the lower part of the window. 13. The layout of the tree can be changed with several different options under >Drawing. 14. >Show Node Numbers numbers the nodes. 47 15. >Orientation>Right tips the tree to the right. 16. Try also the commands >Line Style>Square and >Balls on Internal Nodes. 17. >Branches Proportional to Lengths displays the tree with branch lengths. 18. >Centered Branches shifts the internal branches into the center of the ramification. 19. Character state reconstruction is started by >Analysis>Trace Character History. 20. In a new window you will be asked for the data source for the character states. Select “Stored Characters” and >OK. 21. To begin, select “Parsimony Ancestral States”. The colour of some nodes will change. 22. “Trace Character” in the upper left allows you to select the character that is displayed along the tree. Below this you can see the number of character state changes for the most parsimonious solution. Place your cursor on a node or a branch to see which state was reconstructed as the most parsimonious. 23. Reconstruct state changes under ML by >Analysis>Trace Character History. 24. A new tree appears and a new window “Trace Character” on top of the previous one. You can drage it with your mouse to see both windows. 25. To toggle between the two reconstructions choose >Tree>Bring to Front and one of the two “Trace Character Histories”. 26. You can close reconstructions by >Tree>Close/Remove and selecting one of the two entries. 27. Save all reconstructions as graphics (>File>Save Tree as pdf...). Mesquite only supports pdf as graphical format. The tree without pie charts may be exported in different formats that can be read by e. g. FigTree (>File>Export...). 48 References Avery, O. T., MacLeod, C. M. & McCarty, M. 1944. Studies on the chemical nature of the substance inducing transformation of pneumococcal types: Induction of transformation by a desoxyribonucleic acid fraction isolated from Pneumococcus type III. Journal of Experimental Medicine 79: 137-158. Camin, J.H. & Sokal, R.R. 1965. A method for deducing branching sequences in phylogeny. Evolution 19: 311-326. Cavalli-Sforza, L.L. & Edwards, A.W.F. 1967. Phylogenetic analysis: Models and estimation procedures. Evolution 21: 550-570. Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. 1994. The History and Geography of Human Genes. Princeton: University Press. Chargaff, E., Lipshitz, R., Green, C. & Hodes, M. E. 1951. The composition of the desoxyribonucleic acid of salmon sperm. Journal of Biological Chemistry 192: 223-230. Darwin, C. 1859. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. London: John Murray. Edwards, A.W.F. & Cavalli-Sforza, L.L. 1963. The reconstruction of evolution. Heredity 18: 553. Edwards, A.W.F. & Cavalli-Sforza, L.L. 1964. Reconstruction of evolutionary trees. N: Heywood, V.H. & McNeill, J. (eds.) Phenetic and Phylogenetic Classification. London: Systematics Association Publ. 6, pp. 67-76. Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17: 368-376. Felsenstein, J. 2004. Inferring phylogenies. Sunderland, MA, Sinauer. Fitch, W.M. 1971. Toward defining the course of evolution: Minimum change for a specified tree topology. Systematic Zoology 20: 406-416. Fitch, W.M. & Margoliash, E. 1967. Construction of phylogenetic trees. Science 155: 279-284. Fox, G. E., Stackebrandt, E., Hespell, R. B., Gibson, J., Maniloff, J., Dyer, T. A., Wolfe, R. S., Balch, W. E., Tanner, R. S., Magrum, L. J., Zablen, L. B., Blakemore, R., Gupta, R., Bonen, L., Lewis, B. J., Stahl, D. A., Luehrsen, K. R., Chen, K. N. & Woese, C. R. 1980. The phylogeny of prokaryotes. Science 209: 457-463. Haeckel, E. 1866. Generelle Morphologie der Organismen. Allgemeine Grundzüge der organischen FormenWissenschaft. Berlin: Georg Reimer. Jukes, T.H. & Cantor, C.R. 1969. Evolution of protein molecules. In: Munro, M.N. (ed.) Mammalian Protein Metabolism. Vol. 3. New York: Academic Press, pp. 21-132. Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217: 624-626. Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge, University Press. Knoop, V., Müller, K. 2006. Gene und Stammbäume. Ein Handbuch zur molekularen Phylogenetik. Heidelberg, Elsevier, Spektrum Akademischer Verlag. Lanfear, R., Calcott, B., Ho, S. Y. W. & Guindon. S. 2012. PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution 29: 1695– 1701. Mendel, J.G. 1866. Versuche über Pflanzen-Hybriden. Verhandlungen des naturforschenden Vereines in Brünn 4: 3–47. Michener, C.D. & Sokal, R.R. 1957. A quantitative approach to a problem in classification. Evolution 11: 130-162. Miescher, F. 1874. Die Spermatozoen einiger Wirbelthiere. Ein Beitrag zur Histochemie. Verhandlungen der Naturforschenden Gesellschaft Basel 6: 138-208. Mullis, K., Faloona, F., Scharf, S., Saiki, R., Horn, G. & Erlich, H. 1986. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harbour Symposia on Quantitative Biology 51: 263-273. Mullis, K. & Faloona, F. 1987. Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction. Methods in Enzymology 155: 335-350. Saitou, N. & Nei, M. 1987. The neighbour-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406-425. 49 Sanger, F., Nicklen, S. & Coulson, A.R. 1977. DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences of the USA 74: 5463-5467. Schulte, K., Naduvilezhath, L., Nierbauer, K.U., Silvestro, D., Michalak, I., Metzler, D., Zizka, G. & Printzen, C. 2013. E-volution: Der E-Learning-Kurs zur phylogenetischen Analyse. Beta-Version. URL: http://lernbar.unifrankfurt.de/. Sober, E. 2008. Evidence and Evolution. The logic behind the science. Cambridge University Press. Sokal, R. R. & Rohlf F. J. 1995. Biometry. The principles and practice of statistics in biological research. 3rd ed. New York, Freeman & Co. Stanier, R. Y., Doudoroff, M. & Adelberg, E. A. 1964. The Microbial world (2nd ed.). Englewood Cliffs N.J.: Prentice-Hall. Watson, J. D. & Crick F. H. 1953. Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid. Nature 171: 737–738. Wu, R. & Kaiser, A. D. 1968. Structure and base sequence in the cohesive ends of bacteriophage lambda DNA. Journal of Molecular Biology 35: 523-537. Yang, Z. & Rannala, B. 1997. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo method. Molecular Biology and Evolution 14: 717-724. Zar, J. H. 1999. Biostatistical analysis. 4th ed. Upper Saddle River, NJ, Prentice Hall. Zhu, L. & Gigerenzer G. 2006. Children can solve Bayesian problems: the role of representation in mental computation. Cognition 98: 287-308. Zuckerkandl, E. & Pauling, L. 1962. Molecular disease, evolution, and genetic heterogeneity. In: Kasha, M. & Pullman, B. (eds.). Horizons in Biochemistry. New York: Academic Press, pp. 189–225. Zuckerkandl, E. & Pauling, L. 1965. Evolutionary divergence and convergence in proteins. In: Bryson, V. & Vogel, H.J. (eds). Evolving genes and proteins. New York: Academic Press, pp 97–166. 50 Appendix 1: Commands to Define Substitution Models in MrBayes GTR lset lset lset lset applyto=() applyto=() applyto=() applyto=() nst=6 nst=6 rates=propinv nst=6 rates=gamma nst=6 rates=invgamma # # # # GTR GTR + I GTR + gamma GTR + I + gamma SYM lset applyto=() nst=6 prset applyto=() statefreqpr=fixed(equal) # SYM lset applyto=() nst=6 rates=propinv prset applyto=() statefreqpr=fixed(equal) # SYM + I lset applyto=() nst=6 rates=gamma prset applyto=() statefreqpr=fixed(equal) # SYM + gamma lset applyto=() nst=6 rates=invgamma prset applyto=() statefreqpr=fixed(equal) # SYM + I + gamma HKY lset lset lset lset applyto=() applyto=() applyto=() applyto=() nst=2 nst=2 rates=propinv nst=2 rates=gamma nst=2 rates=invgamma # # # # HKY HKY + I HKY + gamma HKY + I + gamma K2P (=K80) lset applyto=() nst=2 prset applyto=() statefreqpr=fixed(equal) # K2P lset applyto=() nst=2 rates=propinv prset applyto=() statefreqpr=fixed(equal) # K2P + I lset applyto=() nst=2 rates=gamma prset applyto=() statefreqpr=fixed(equal) # K2P + gamma lset applyto=() nst=2 rates=invgamma prset applyto=() statefreqpr=fixed(equal) # K2P + I + gamma F81 lset lset lset lset applyto=() applyto=() applyto=() applyto=() nst=1 nst=1 rates=propinv nst=1 rates=gamma nst=1 rates=invgamma # # # # 51 F81 F81 + I F81 + gamma F81 + I + gamma Jukes Cantor lset applyto=() nst=1 prset applyto=() statefreqpr=fixed(equal) # JC lset applyto=() nst=1 rates=propinv prset applyto=() statefreqpr=fixed(equal) # JC + I lset applyto=() nst=1 rates=gamma prset applyto=() statefreqpr=fixed(equal) # JC + gamma lset applyto=() nst=1 rates=incgamma prset applyto=() statefreqpr=fixed(equal) # JC + I + gamma 52 Appendix 2: Example MrBayes-block BEGIN MRBAYES; log start filename=example_log append; [MrBayes writes a “checkpoint” file, in which parameters of the analysis are stored every 100.000 generations. If the analysis is unexpectedly stopped append allows you to restart it from the last “checkpoint”.] set autoclose=yes; delete [taxon name]; [deletes single taxa from the dataset] outgroup [taxon name]; [specifies an outgroup] charset charset charset charset charset charset charset A B C D E F G = = = = = = = 1-118 270-371; 119-269 980-1881 1883-2581\3; 372-979; 1882-2581\3; 1884-2581\3 2582-3661\3; 2583-3661\3; 2584-3661\3; partition PartFind = 7: A, B, C, D, E, F, G; set partition = PartFind; lset applyto=(all) nucmodel=4by4; [other options for the nucleotide model are doublets for stem regions of ribosomal RNA (16 character states) and codon for protein sequences (64 character states). These models are computationally intensive!] lset applyto=(1) nst=6 rates=gamma ngammacat=6; [GTR+G] lset applyto=(2) nst=6 rates=invgamma ngammacat=6; [SYM+I+G, equal nucleotide frequencies, see prior settings below] lset applyto=(3) nst=2 rates=invgamma ngammacat=6; [HKY+I+G] lset applyto=(4) nst=2 rates=gamma ngammacat=6; [K2P+G, equal nucleotide frequencies, see prior settings below] lset applyto=(5) nst=2 rates=gamma ngammacat=6; [K2P+G, equal nucleotide frequencies, see prior settings below] lset applyto=(6) nst=6 rates=propinv; [GTR+I] lset applyto=(7) nst=2 rates=propinv; [HKY+I] prset applyto=(all) ratepr=variable brlenspr=unconstrained:exp(18) topology=uniform; [brlenspr sets the prior distribution from which branch lengths are drawn. the options are unconstrained and clock (for a molecular clock analysis). exp(18) means an exponential distribution with mean 1/18. The default for the mean is (10), but in some cases this may lead to grossly wrong branch lengths. A way to find a more suitable mean is to calculate a ML tree and then take the mean branch length of that tree as the mean for the branch length prior. Remember that the number in brackets is the reverse of this mean!] prset applyto=(1) shapepr=exponential(5); [shapepr sets the prior distribution for the shape parameter alpha of the gamma distribution.] prset prset prset prset applyto=(2) applyto=(3) applyto=(4) applyto=(5) statefreqpr=fixed(equal) shapepr=exponential(5); shapepr=exponential(5); statefreqpr=fixed(equal) shapepr=exponential(5); statefreqpr=fixed(equal) shapepr=exponential(5); unlink statefreq=(1,3,6,7) revmat=(1,2,6) tratio=(3,4,5,7) shape=(1,2,3,4,5) pinvar=(2,3,6,7); [State frequencies are fixed for partitions 2, 4 and 5. They only have to be unlinked between the other partitions. A full rate matrix (revmat) is only present for the GTR and SYM models (partitions 1, 2, 6). The HKY and K2P model (3, 4, 5, 7) only have a transition/transversion ratio.] mcmc append=yes ngen=40000000 nruns=3 nchains=6 temp=0.15 printfreq=1000 samplefreq=200 mcmcdiagn=yes diagnfreq=100000 diagnstat=avgstddev minpartfreq=0.10 rel- 53 burnin=yes burninfrac=0.5 stoprule=yes stopval=0.01 starttree=random filename=example_out; [append=yes allows you to restart the analysis if it is unexpectedly stopped (power failure etc.). diagnfreq specifies how often the average standard deviation of branch frequencies (avgstddev) is compared between runs. Poorly supported branches will vary widely between trees. minpartfreq=0.10 specifies that only branches with a support of 0.1 or higher are compared. Relburnin sets the burnin proportion for the comparison. Stoprule=yes tells MrBayes to stop the analysis when the difference among runs drops below a specified value (stopval).] plot filename=example_out; sumt filename= example_out relburnin=yes burninfrac=0.5 nruns=3 contype=allcompat; sump filename= example_out relburnin=yes burninfrac=0.5 nruns=3; log stop; END; 54
© Copyright 2025 ExpyDoc