This document provides a step-by-step tutorial to converting a NEXUS file with a DATA or CHARACTERS block into a BEAST XML file. This is done using the program BEAUti (this stands for Bayesian Evolutionary Analysis Utility). This is a user-friendly program for setting the evolutionary model and options for the MCMC.
The exact instructions for running BEAUti differs depending on which computer you are using. Please see the README text file that was distributed with the version you downloaded.
Once running, BEAUti will look similar irrespective of which computer system it is running on. For this tutorial, the Mac OS X version will be shown but the Linux & Windows versions will have exactly the same layout and functionality.
To load a NEXUS format alignment, simply select the Import NEXUS... option from the File menu:
The examples folder contains a file called Dengue4.env.nex. This file contains an alignment of 17 sequences from the env gene of Dengue-4 virus from various parts of the world with dates from 1956-1994. It starts out like this (the lines have been truncated):
#NEXUS BEGIN DATA; DIMENSIONS NTAX=17 NCHAR=1485; FORMAT MISSING=? GAP=- DATATYPE=DNA; MATRIX D4Brazi82 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4ElSal83 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4ElSal94 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Indon76 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Indon77 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Mexico84 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4NewCal81 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Philip64 ATGCGATGCGTGGGAGTGGGGAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Philip56 ATGCGATGCGTGGGAGTGGGGAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Philip84 ATGCGATGCGTAGGAGTGGGGAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4PRico86 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4SLanka78 ATGCGATGCGTGGGAGTGGGGAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Tahiti79 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Tahiti85 ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTTTCAGGTGGAGCATGGGTC... D4Thai63 ATGCGATGCGTAGGAGTGGGGAACAGGGACTTTGTGGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Thai78 ATGCGATGCGTAGGAGTGGGGAACAGAGACTTTGTAGAAGGAGTCTCAGGTGGAGCATGGGTC... D4Thai84 ATGCGATGCGTAGGAGTAGGGAACAGAGACTTTGTAGAAGGAGTCTCAGGTGGAGCATGGGTC... ; END;
Once loaded, the list of taxa and the actual alignment will be displayed in the main window:
By default all the taxa are assumed to have a date of zero. This will be correct if all the sequences were sampled at approximately the same point in time (at least on the time scale of their evolutionary history). In this case, the Dengue-4 virus have been sampled at various dates going back to the 1950s. The actual year of sampling is given in the name of each taxon and we could simply edit the value in the Date column of the table to reflect these. However it is more convenient to use the Guess Dates button at the top of the window. Clicking this will make the following dialog box appear:
This operation attempts to guess what the dates are from information contained within the taxon names. Generally this works by trying to find a numerical field within each name. If the taxon names contain more than one numerical field (such as the Dengue-4 sequences, above) then you can specify how to find the one that corresponds to the date of sampling. You can either specify the order that the date field comes (e.g., first, last or various positions in between) or specify a prefix (some characters that come immediately before the date field in each name). For the Dengue-4 sequences you select last from the drop-down menu for the order. If you left it at first then all taxa would be given a date of "4" taken from the second character of each name.
In this dialog box, you can also get BEAUti to add a fixed value to each guessed date. In this case the value "1900" has been added to turn the dates from 2 digit years to 4 digit. Any dates in the taxon names given as "00" would thus become "1900". If these dates were actually referring to the year 2000 then selecting the next option would convert them correctly, adding 2000 to any date less than 07.
When you press OK the dates will appear in the appropriate column of the main window. You can then check these and edit them manually as required. At the top of the window you can set the units that the dates are given in (years, months, days) and whether they are specified relative to a point in the past (as would be the case for years such as 1984) or backwards in time from the present.
At the bottom of the main window is the option to translate the data into amino acid sequences. This will be done using the genetic code specified in the associated drop down menu. If the loaded sequence are not nucleotides then this option will be disabled.
The next thing to do is to click on the Model tab at the top of the main window. This will reveal the evolutionary model settings for BEAST. Exactly which options appear depend on whether the data are nucleotides or amino acids (or nucleotides translated into amino acids). The settings that will appear after loading the Dengue-4 data set will be as follows:
Select between the HKY and the GTR model for nucleotide sequences or a variety of common amino acid substitution models.
Site Heterogeneity Model
Select between gamma or invariant sites or both.
Number of Gamma Categories
The number of categories for the descrete approximation to the gamma distribution.
Partition into codon positions
For when the data are nucleotide sequences that have been aligned to the reading frame. This allows a separate substitution rate for each codon position.
If selected then the following options can also be used:
Unlink substition model across codon positions
Allows a different substitution model for each codon position. This means a separate kappa (for HKY) or rate matrix (for GTR).
Unlink rate heterogeneity model across codon positions
Allows a different rate heterogeneity model for each codon position. This means a separate alpha (for gamma) or Pinv (for invariant sites).
Fix mean substitution rate
When dates have been associated with the sequences, the substitution rate can be estimated on an absolute time scale. If this is on then the substitution rate is fixed to the value in the text field below otherwise it is estimated from the data.
Mean Substitution Rate
If there are no dates associated with the sequences than this can be used to specify an _a_priori_ substition rate that will be used to scale all the times in the tree to units of time. This can be used to calibrate the tree with an external rate. Doing this will mean abandoning any errors associated with this rate.
Choose between a constant population size, an exponentially growing population or a logistically growing population.
This tutorial assumes you are familiar with the evolutionary models available, however there are a couple of points to note:
The next tab allows priors to be specified for each parameter in the model:
In this tutorial, the default options will be used.
The next stage is to look at the operators for the MCMC. To do this select the Operators tab at the top of the main window. For the Dengue-4 dataset, with the model set up as shown in the screen shot above, you will see the following table:
Each parameter in the model has one or more "operators". The operators specify how the parameter changes as the MCMC runs. This table lists the parameters, their operators and the tuning settings for these operators. In the first column are the parameter names. These will be called things like hky1.kappa which means the HKY model's kappa parameter (the transition-transversion bias). The next column has the type of operators that are acting on each parameter. For example, the scale operator scales the parameter up or down by a proportion, the random walk operator adds or subtracts an amount to the parameter and the uniform operator simply picks a new value uniformally within a range. Some parameters relate to the tree or to the heights of the nodes of the tree and these have special operators.
The next column, labelled Tuning, gives a tuning setting to the operator. Some operators don't have any tuning settings so have n/a under this column. Changing the tuning setting will set how large a move that operator will make which will affect how often that change is accepted by the MCMC which will affect the efficency of the analysis. At the top of the window is an option called Auto Optimize which, when selected, will automatically adjust the tuning setting as the MCMC runs to try to achieve maximum efficiency. At the end of the run a table of the operators, their performance and the final values of these tuning settings will be displayed. These can then be used to set the starting tuning settings in order to minimize the amount of time taken to reach optimum performance in subsequent runs.
The next column, labelled Weight, specifies how often each operator is applied relative to each other. Some parameters have very little interaction with the rest of the model and as a result tend to be estimated very efficiently - an example is the kappaparameter - these parameters can have their operators down-weighted so that they are not changed as often.
The next tab, MCMC, provides settings to control the MCMC:
Firstly we have the Length of chain. This is the number of steps the MCMC will make in the chain before finishing. How long this should be depends on the size of the dataset, the complexity of the model and the quality of answer required. The default value of 10,000,000 is entirely arbitrary and should be adjusted according to the size of your dataset. In order examine whether a particular chain length is adequate, the resulting log file can be analysed using Tracer. The aim of setting the chain length is to achieve a reasonable Effective Sample Size (ESS). Ways of doing this are discussed in [another tutorial|Increasing ESSs].
The next options specify how often the current parameter values should be displayed on the screen and recorded in the log file. The screen output is simply for monitoring the programs progress so can be set to any value (although if set too small, the sheer quantity of information being displayed on the screen may actually slow the program down). For the log file, the value should be set relative to the total length of the chain. Sampling too often will result in very large files with little extra benefit in terms of the precision of the analysis. Sample too infrequently and the log file will not contain much information about the distributions of the parameters. You probably want to aim to store about 10,000 samples so this should be set to the chain length / 10,000.
For this dataset let's initially setting the chain length to 100,000 as this will run reasonably quickly on most modern computers.
The final two options give the file names of the log files for the parameters and the trees. These will be set to a default based on the name of the imported NEXUS file but feel free to change these.
If you select the Save option from the File menu this will save a document in BEAUti's own format. Note that although this is an XML document it is not in the format that BEAST understands (it is much simpler and specific to the user interface of BEAUti). The idea is that the settings and data in BEAUti can be saved and loaded at a later time. We suggest you save BEAUti files with the extension '.beauti'.
We are now ready to create the BEAST XML file. Select Generate BEAST File... from the File menu (or click the button at the bottom of the window) and save the file with an appropriate name (we usually end the filename with '.xml'). We are now ready to run the file through BEAST.
Now run BEAST according to the instructions provided (look at the README text file included in the package you downloaded), providing your newly created XML file as input.
+-----------------------------------------------\ | BEAST v1.3 2002-2006 || | Bayesian Evolutionary Analysis Sampling Trees || | BEAST Library: Build 1.10 2003/09/03 || | Alexei Drummond and Andrew Rambaut || | University of Oxford || | http://evolve.zoo.ox.ac.uk/Beast/ || \-----------------------------------------------\| \-----------------------------------------------\ +-----------------------------------------------+ | Components created by: | | Alexei Drummond | | Roald Forsberg | | Oliver Pybus | | Andrew Rambaut | | Thanks to (for use of their code): | | Korbinian Strimmer | | Oliver Pybus | +-----------------------------------------------+ Read alignment: Sequences = 17 Sites = 1485 Datatype = nucleotide Site patterns 'patterns1' created from positions 1-1485 of alignment 'alignment' only using every 3 site pattern count = 34 Site patterns 'patterns2' created from positions 2-1485 of alignment 'alignment' only using every 3 site pattern count = 17 Site patterns 'patterns3' created from positions 3-1485 of alignment 'alignment' only using every 3 site pattern count = 119 Native nucleotide likelihood core found TreeLikelihood using native nucleotide likelihood core. TreeLikelihood using native nucleotide likelihood core. TreeLikelihood using native nucleotide likelihood core. Pre-burnin... state likelihood treeModel.rootHeight treeLikelihood1 treeLikelihood2 treeLikelihood3 coalescent 0 -3,919.1283 159.9998 -995.1892 -838.3758 -1,989.881 -95.6820 1000 -3,806.6080 125.4206 -971.7499 -828.9929 -1,917.5387 -88.3264 2000 -3,784.8661 96.7894 -966.8750 -829.8393 -1,904.5461 -83.6057 . . .
Operator analysis Operator Pr(accept) Performance suggestion hky.kappa 0.331 0.2007 good siteModel1.mu 0.442 0.2753 good siteModel2.mu 0.319 0.2594 good siteModel3.mu 0.658 0.2326 good up:compoundParameter down:nodeHeights 0.885 0.1548 good constant.popSize 0.293 0.2561 good treeModel.rootHeight 0.697 0.1897 slightly low Try setting scaleFactor to about 0.7461 nodeHeights 0.4338 high subtreeSlide 7.198 0.1951 slightly low Try decreasing size to about 6.001352499826204 Narrow Exchange 0.2655 good Wide Exchange 0.0543 good wilsonBalding 0.0139 slightly low 23.212 seconds