Specifying data for BEAST
From BEAST Software
BEAST Documentation->XML format->Specifying data for BEAST
Contents |
Formatting data for the BEAST input format
Most analyses done in BEAST will have a molecular sequence alignment as input data. Additional information may be supplied such as dates of isolation of sequences or groupings of taxa. In some cases a phylogenetic tree will be provided, perhaps reconstructed in a different program. This document describes how data is defined in the XML format used by BEAST. For information about the structure of an XML document in general, and a BEAST document specifically, pleast read the Introduction to XML format.
In most cases the user-interface program BEAUti can be used to create these files even if further editing will be required. This documentation is mostly for reference.
Defining the taxa
A taxon defines a particular organism from which a sequence was isolated. This concept is used to link sequences with the tips of a tree or the sequences in different alignments together. A taxon can be given particular attributes such as a date or a continuous trait value. In its simplest form a taxon can be defined anywhere in a document in the following way:
<taxon id="a_taxon"/>
The id, "a_taxon", is an unique identifyer by which that taxon can be referenced. Normally a set of related taxon elements would be grouped together as a set of taxa:
<taxa id="taxa1"> <taxon id="Brazi82"><date value="1982" units="years" direction="forwards" /></taxon> <taxon id="ElSal83"><date value="1983" units="years" direction="forwards" /></taxon> <taxon id="ElSal94"><date value="1994" units="years" direction="forwards" /></taxon> <taxon id="Indon76"><date value="1976" units="years" direction="forwards" /></taxon> <taxon id="Indon77"><date value="1977" units="years" direction="forwards" /></taxon> </taxa>
This set of taxa has an id, "taxa1", by which it can be referenced. Each of the taxon elements contains a date, in this case the date the taxon was isolated.
Defining an alignment
A multiple sequence alignment is defined as a set of individual sequences of the same length. The standard 1 letter codes are used for both nucleotides and amino acids including ambiguity codes. The sequences can be in upper or lower case or an arbitrary mixture of the two. The type of data is specified using a 'dataType' attribute. In this case the sequences are nucleotides (either DNA or RNA - U and T are treated synonymously):
<alignment id="alignment1"> <sequence dataType="nucleotides"> <taxon idref="Brazi82"/> ATGCGATGCGTAG-----GGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCAT </sequence> <sequence dataType="nucleotides"> <taxon idref="ElSal83"/> ATGCGATGCGTAGG---AGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCAT </sequence> <sequence dataType="nucleotides"> <taxon idref="ElSal94"/> ATGCGATGCGTAGG---AGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCAT </sequence> <sequence dataType="nucleotides"> <taxon idref="Indon76"/> ATGCGATGCGTAGG----GGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCAT </sequence> <sequence dataType="nucleotides"> <taxon idref="Indon77"/> ATGCGATGCGTAGGAGTAGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCAT </sequence> </alignment>
Note that as well as the sequence data, each sequence contains a taxon element. The taxon elements have an "idref" attribute which means that these are actually references to previously defined taxon elements (in this case those defined in the set of taxa, above). It is also possible to define the taxa 'in-line' within the sequences:
<sequence dataType="nucleotides"> <taxon id="ElSal83"><date value="1983" units="years" direction="forwards" /></taxon> ATGCGATGCGTAGG---AGGAAACAGAGACTTTGTGGAAGGAGTCTCAGGTGGAGCAT </sequence>
However, defining all the taxa first has some organisational advantages and is recommended.
Defining a tree
A tree can be defined in one of two ways. The first, and simplest, way is to simply insert a NEWICK formatted tree (replacing or commenting out the <coalescentTree>....</coalescentTree> and using the same tree name (id="startingTree") if using beauti output):
<newick id="Tree1">((Brazi82,(ElSal83,ElSal94)),(Indon76,Indon77));</newick>
The NEWICK format is used by many programs including PHYLIP and PAML and is embedded within the NEXUS format used by PAUP.
The second way of defining a tree is using a complete XML format. This starts with an element representing the tree which contains a node element representing the root of the tree. This contains two or more node elements representing its descendants which, in turn, contain their descendent nodes. A node which represents a sampled taxon contains no descendent nodes but a taxon element (or reference to one). Here is an example:
<tree id="Tree2"> <node> <node> <node><taxon idref="Brazi82"/></node> <node> <node><taxon idref="ElSal83"/></node> <node><taxon idref="ElSal94"/></node> </node> </node> <node> <node><taxon idref="Indon76"/></node> <node><taxon idref="Indon77"/></node> </node> </node> </tree>
Alexei Drummond and Andrew Rambaut
Copyright © 2002-2006 All rights reserved.

