You are here


BEAST Documentation->FAQ



What can BEAST do?

A description of some of the analyses that can be performed using BEAST can be found here.

See also the list of tutorials.


Installing and running BEAST

How do I install and run BEAST?

That depends on the operating system you are using. Please look at the README file that was included in the package you downloaded.

Problems running under Linux

One problem you may encounter is the following error when attempting to run BEAST or BEAUti

sr@tiny$ bin/beast examples/benchmark.xml bin/..

                  BEAST v1.4.7, 2002-2007


        Korbinian Strimmer

Random number seed: 1217151199607

Exception in thread "main" java.lang.NoClassDefFoundError:
   at java.lang.Class.initializeClass(
   at dr.util.MessageLogHandler.<init>(Unknown Source)
   at<init>(Unknown Source)
   at Source)
Caused by: java.lang.ClassNotFoundException: not found in
parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
   at java.lang.ClassLoader.loadClass(
   at java.lang.ClassLoader.loadClass(
   at java.lang.Class.forName(
   at java.lang.Class.initializeClass(
   ...3 more

If you see this you are trying to run the programs with the wrong version of the JVM. Either you haven't installed the Java JVM or you have installed it but haven't told linux you want to use the Java JVM by default.

The way to fix this will depend on what kind of linux you are running. If you are running Debian then use the update-java-alternatives command...

$ sudo /usr/sbin/update-java-alternatives --config java

There are 5 alternatives which provide `java'.

  Selection    Alternative
          1    /usr/lib/jvm/java-6-sun/jre/bin/java
          2    /usr/bin/gij-4.3
          3    /usr/lib/jvm/ia32-java-6-sun/jre/bin/java
*+        4    /usr/lib/jvm/java-gcj/jre/bin/java
          5    /usr/bin/gij-4.2

Press enter to keep the default[*], or type selection number: 1
Using '/usr/lib/jvm/java-6-sun/jre/bin/java' to provide 'java'.

and you should be ready to run.

Note: it's best not to use the update-alternatives command for this. Instead, use the update-java-alternatives because "The former just sets the symlinks for the /usr/bin/java alternative, whereas the update-java-alternatives sets all java-related symlink"


Not so sure about this. used to be setDefaultJava, but apparently they have been replaced by the update-alternatives command.


for more details

What is BEAGLE and should I use it? 

BEAGLE is a high-performance library that can perform the core calculations at the heart of most Bayesian and Maximum Likelihood phylogenetics package. It can make use of highly-parallel processors such as those in 3D graphics boards (referred to as Graphics Processing Units or GPUs) found in many PCs. In general using it (even if not using a GPU) will improve the performance of BEAST. However, installing it for BEAST to use is not a simple operation presently (but we hope to fix this shortly) and it is not necessarily going to benefit all data sets. In particular, for the use of a GPU (and currently only NVidia ones are supported) to be efficient, long partitions are required (perhaps >500 unique site patterns). Only high-end GPUs such as the GTX285 or Tesla boards will provide sufficient benefits. 

How do I use BEAGLE with BEAST? 

How it is installed and used with BEAST depends on the platform: 


Evolutionary rates and time scales

I have X sequences sampled over a Y year time span; are they enough to estimate the substitution rate?

It depends on the substitution rate. If the substitution rate is high enough to produce a substantial number of substitutions in the Y years then it may work. The easiest thing is to simply try it and see. It would be best to start with a simple model - HKY, constant population size and a strict molecular clock. If that seems to behave well then you might consider a more realisitic model, depending on the question you are trying to answer. If the time span is insufficient to provide information about the substitution rate then BEAST will not converge and the age of the root of the tree will simply increase to a very large value (and the rate will drop towards zero).

What units are mutation rates, root heights and population sizes expressed in?

If the sequence data are all from one time point, then the overall evolutionary rate must be specified with a strong prior. The units implied by the prior on the evolutionary rate will determine the units of the node heights in the tree (including the age of the most recent common ancestor) as well as the units of the demographic parameters such as the population size parameter and the growth rate. If the evolutionary rate is set to 1.0, then the node heights (and root height) will be in units of mutations per site (i.e. the units of branch lengths produced by common software packages such as MrBayes 3.0). Similarly the population size parameter will be an estimate of Ne*mu , (i.e., half of the standard population genetic parameter, theta=2*Ne*mu). However, if, for example, the evolutionary rate is expressed in mutations per site per year, then the branches in the tree will be in units of years. Furthermore the population size parameter of the demographic model will then be equal to Ne*tau , where Ne is the effective population size and tau is the generation length in years. Finally, if the evolutionary rate is expressed in units of mutations per site per generation then the resulting tree will be in units of generations and the population parameter of the demographic model will be in natural units (i.e. will be equal to the effective number of reproducing individuals).

What is the difference between the 'meanRate' and 'ucld.mean' rate parameters?

If r_i is the rate on the i'th branch then ucld.mean is simply the sum of r_i for all i divided by the number of branches (2n-2). It is the simple arithmetic mean of the branch rates. Since some branches represent much more time than others, this rate will not necessarily be the same as the total number of substitutions per site divided by the total amount of time that the tree represents. The meanRate parameter is in fact the total number of substitutions per site divided by the total amount of time that the tree represents. So the meanRate can be thought of as the mean of the r_i weighted by t_i (the length of time of the i'th branch). That is, it is the sum of r_i * t_i divided by the sum of t_i. 

If you have prior information on the overall rate it is best to reflect that with a prior distribution on the ucld.mean as that is the actual parameter of the model. This actually contradicts earlier advice to put it on the meanRate statistic.

How to make use of other substitution models not given in Beauti? Specifically, Jukes-Cantor model?

The Jukes-Cantor model is the simplest substitution model - the frequencies of the four nucleotides are assumed to be equal (0.25

each), and all of the rates in the transition matrix are assumed to be equal (A->C = A->G = A->T = C->G = C->T = G->T etc.).

This can easily be implemented in BEAUti:

1) In the "Site Model" pane in BEAUti, select "GTR" for the Substitution Model.

2) In the "Site Model" pane, select "All equal" for the Base Frequencies.

3) In the "Operators" pane, uncheck the boxes for "ac", "ag", "at", "cg", and "gt". This means that BEAST will not estimate the values of

these parameters, and they will remain at their default starting values of 1.0.

Following these steps will implement the Jukes-Cantor model. Cheers, Simon



What do all the parameters do/mean?

Look here for a description of the model parameters.

Optimizing operators

Do I need to worry about optimizing operators if my ESSs are okay?

No. Tuning the operators will only increase the efficiency of the sampling - resulting in better ESSs for the same chain length. If you are already getting suitable ESSs then that is fine. See this tutorial for more details about this subject

Why does the operator analysis continue to suggest that I decrease my <subtreeSlide>'s size attribute in order to improve my acceptance probability?

The size value in the <subtreeSlide> operator should be proportional to the height of your tree (say about 10% initially). If the tree is uncalibrated then the height of the tree is given in _substitutions per site_ which can be very small.


Data and Experimental Design

How does BEAST treat gaps and ambiguities in sequence data?

BEAST treats a gap character ('-', '?' or 'N') as missing and it does not contribute any probability to the likelihood for that branch and site. This is the same as saying that there is equal marginal probability for the 4 nucleotides. Joe Felsenstein's book, 'Inferring Phylogenies' has a section describing this. One additional thing to note is that, by default, BEAST treats any of the ambiguities codes (IUPAC) as gaps or missing data (i.e., an R is treated as an N). This simple approximation allow a considerable (up to about 50%) speed up in the likelihood calculation but if ambiguities are important to your analysis you can override this behaviour. If you want BEAST to treat ambiguity codes exactly, then you can add the following attribute in your XML: 

<treeLikelihood id="treeLikelihood" useAmbiguities="true"> .... 

Should I remove identical sequences?

BEAST is a method for sampling all trees that have a reasonable probability given the data. One of the assumptions underlying the BEAST program is that there is a binary tree that has generated the data. Just because (for example) three taxa have identical sequences doesn't mean that they are equally closely related in the true tree - it just means that there were no mutations (in the sampled part of the genome) down the ancestral history of those three taxa. In this case, BEAST would sample all three trees with equal probability ((A,B),C), (A,(B,C)), ((A,C),B). If you are summarize the BEAST output as a single tree (presumably using TreeAnnotator which picks a specific tree from the trace that is representative and annotates it with posterior probabilities of clades) you will see some particular resolution of the identical sequences, based on the selected representative tree. But the posterior probability for that particular resolution will probably be low, since many other resolutions have also been sampled in the chain. 

One of the results of the way that BEAST analyzes the data is that you get an estimate of how closely related these sequences are, even if the sequences are identical. This is possible because BEAST is essentially determining how old the common ancestor of these sequences could be given that no mutations were observed in the ancestral history of the identical sequences, and given the estimated substitution rate and sequence length. In terms of the identical sequences, the only node with the possibility of significant support would be the common ancestor of the identical sequences. If this is the case then you can confidently report the age of this node, but should not try to make any statements about relationships or divergence times within the group of identical sequences.

Finally: A Population genetic reason not to remove identical sequences: Imagine you have sampled 100 individuals and there are only 20 haplotypes. You are tempted to just analyze the 20 haplotypes. However this is equivalent to tricking BEAST into thinking that you have randomly sampled 20 individuals from a population and you found every individual had a unique haplotype. If this was actually the case you would indeed conclude that the population must be very large! And so will BEAST. So by removing all the identical sequences you are tricking BEAST into estimating a larger population.

Why is the distribution of the root height statistic not uniform using 'sample from prior'?

Question: I have a slight problem with BEAST. I have generated a configuration file with BEAUti. The prior on the root height is set as uniform (in the [74,200] range). I also selected the option 'sample from prior'. I then ran BEAST and obtained a log file. The problem is that the distribution of the root height statistic is not uniform. Is this a known problem or did I do something wrong?

Answer: Your analysis also has a constant size coalescent prior specified on the tree. The "calibrated tree prior" is the product of the coalescent tree prior and the calibration density. Such a product-construction of the full tree prior can (and often does) lead to marginal prior distributions on the calibrated nodes that differ substantially from the calibration density specified on those nodes. This is not a bug unless you thought that you were specifying the marginal prior distribution of the node, rather than a calibration density that is considered an independent piece of prior information to the constant-size coalescent prior information. 

An alternative construction to the product-construction is the conditional-construction, where you specify the marginal prior distribution of the calibrated node and then compute the coalescent distribution *conditional* on the height H of the calibrated node. This gives priority of the calibration prior over the coalescent tree prior. Joseph and I have a paper accepted in Systematic Biology [1] on the difference between these two prior constructions and we show how to do the second type with certain tree priors, but the general conditional-construction with multiple calibrations and arbitrary tree prior process is too hard for us to work out.

I hope this helps, Cheers, Alexei.

Starting tree and fixing trees

Can you designate a user-defined starting tree?

Yes, you can insert a NEWICK format tree into the XML to act as a starting tree - Take a look at this tutorial

How can you keep this topology constant while estimating other parameters, e.g., node height?

Remove all the operators that act on the <treeModel>. This will be the <narrowExchange>, <wideExchange>, <wilsonBalding> and <subtreeSlide> operators. Without these operators the actual topology of the tree won't change.


Setting up models

How do I tell BEAST to use an outgroup?

The simple answer is that you may not want to - BEAST will sample the root position along with the rest of the nodes in the tree. If you then calculate the proportion of trees that have a particular root, you obtain a posterior probability for this root position. However if you have a strong prior for an outgroup then you can constrain the ingroup to be monophyletic. See this tutorial for details of how to do this.

How do I run BEAST without data to sample the Prior?

See this tutorial for details of how to do this.

Does it matter what order the Priors & Likelihoods come in the XML?

Yes. BEAST will evaluate each component in order starting with the priors. If any of these are zero, then the rest of the posterior is not calculated. Thus it is particularly important that constraints, like <booleanLikelihood> and <uniformPrior> which may give zero probabilities, are put at the beginning of the <prior> section: 

	<mcmc id="mcmc" chainLength="1000000" autoOptimize="true">
		<posterior id="posterior">
			<prior id="prior">
				<uniformPrior id="constraint" upper="100.0" lower="90.0"><parameter idref="rootHeight"/></uniformPrior>
				<coalescentLikelihood idref="coalescent"/>
			<likelihood id="likelihood">
				<treeLikelihood idref="treeLikelihood"/>

How do tree prior distributions effect estimation of rates and dates?

See this page for a discussion of this question.


Effective Sample Size (ESS) of parameters

What is an ESS?

The Effective Sample Size (ESS) of a parameter sampled from an MCMC (such as BEAST) is the number of effectively independent draws from the posterior distribution that the Markov chain is equivalent to.

How do I calculate an ESS?

The simplest way is to load your BEAST log files into Tracer.

Why do I need to increase it?

If the ESS of a parameter is small then the estimate of the posterior distribution of that parameter will be poor. In Tracer you can calculate the standard deviation of the estimated mean of a parameter. If the ESS is small then the standard deviation will be large. This is exactly the same as the sample size of an experiment consisting of measurements.

What size ESS is adequate?

The larger the better. Tracer flags up ESSs < 100 but this may be liberal and > 200 would be better. On the other hand chasing ESSs > 10000 may be a waste of computational resources.

Do I need adequate ESSs for all my parameters?

Possibly not. Really low ESSs may be indicative of poor mixing but if a couple of parameters that you are not interested in are a little low it probably doesn't matter. The likelihoods (both of the tree and coalescent model) should have decent ESSs.

Is the ESS important if I am interested in the sample of trees?

Definitely. At the moment we don't have anyway of directly examining the ESS of the tree or the clade frequencies. Therefore, it is important that the continuous parameters and likelihoods have adequate ESS to demonstrate good mixing of the MCMC.

How do I increase the ESS of a parameter?

Take a look at this brief tutorial for ways of increasing the ESS.


Interpreting the results

How do I do model comparison?

See this page for a description of model comparison using BEAST and how to do it.

How do I summarize the posterior distribution of trees?

See this page for various approaches to summarizing trees.

What is the maximum clade credibility (MCC) tree produced by TreeAnnotator?

The tree produced by TreeAnnotator (denoted the maximum clade credibility or MCC tree) is not a consensus tree such as that produced by the 'sumt' command in MrBayes. Instead, TreeAnnotator picks one of the trees actually present in the sample produced by BEAST - thus it is a tree that was actually visited by the MCMC sampler. The tree it picks is the one that has the highest product of all the clade credibilities (posterior probabilities) for all the clades in the tree. The motivation with this is to find a single 'point estimate' tree that is in some way central to the distribution of trees. This tree is then given (annotated with) summary information for the full set of trees from the sample. For example, it is given average node heights, credible intervals for the node heights, rates, etc. 

See this page for various other approaches to summarizing trees.

Why does my tree produced by TreeAnnotator have negative branch lengths?

MCC trees produced by TreeAnnotator can have a descendent node that is older than its direct ancestor (a negative branch length). This may seem like an error but is actually the correct behaviour. The MCC tree is, by default, generated with average node heights across all trees in the sample which contain that clade. The negative branch lengths result when a clade is at low frequency and tends not to occur in those trees that have the MCC tree's ancestral clade (or vice versa). This means the average heights are for the adjacent nodes are derived from different sets of trees and may not have any direct ancestor-descendent relationship.

Error messages

What does the error message I am getting mean?

Look at the Error Messages page for details of the different error messages.

Why is BEAST telling me that "the initial model is invalid because the state has a zero likelihood"?

There are essentially three reasons that you can get the "initial model is invalid" error.

Firstly, your initial tree could violate one or more of the boolean priors (priors with hard bounds including constraints on the monophyly of clades). This can be fixed by providing a starting tree that conforms to these constraints using the <newick> element in the XML file. See this tutorial for details about how to use a starting tree. The starting tree should contain all constrained clades and the root node and any clade MRCAs should fit within any uniform priors (or translated exponential or lognormal priors). This means the provided NEWICK format tree should be rooted with branch length in the units of time being used in the priors. Technically it doesn't matter whether the initial tree is ultrametric because BEAST will adjust it until it is. However, this process alters the ages of the nodes of the tree and thus could cause it to violate the hard bounds. If the constraint is on the age of the root of the tree, BEAST can be told to rescale the entire tree so that the root has a particular age. This is done by adding an attribute to the <newick> element:

<newick rescaleTree="100.0">
 ( newick tree here...);

Secondly, the initial tree could be particularly far from the optimum which may cause the calculation of the likelihood of the sequences to fail. More technically, the likelihood of a particular site is calculated by traversing the tree and taking the product of the probability for each branch. If the individual probabilities are small (because the data doesn't fit the tree) then this product can rapidly approach, and will eventually be rounded to, zero. This is more likely to occur if there are many sequences as there are more probabilities to multiply. Once the likelihood for a given site is calculated, the logarithm is taken and then summed across all sites and thus long sequences do not cause this problem. The only solution to this problem is to start with a better tree (such as the UPGMA option in BEAUti) or a reasonably optimal starting tree. 

Thirdly, if you are using calibrations and estimating or specifying a rate of substitution, it is possible that the initial value for the rate is too small or too large which can also cause the underflow problems when the branches are scaled from units of time to units of substitutions using the rate. One potential reason the rate might be inappropriate is if the calibrations and tree is given in units of millions of years (e.g., a root of 10My) whereas the rate is given in units of substitutions per year (e.g., a rate of 1.0E-8 subst. per year). If you multiply the initial rate with the initial age of the tree it should be in the sort of range you would expect for the genetic divergence of the gene you are using (probably 0.01-1.0 - genes are usually selected to have diversity but not so much that they are saturated).

We have also added a mechanism for rescaling the likelihood of the tree as it is calculated to avoid numerical underflows. Turning this on should help for large trees in cases where an initial likelihood was zero. To turn this on you need to add 'useScaling="true"' to the treeLikelihood element(s) in the BEAST XML file. Search for the line:

	<treeLikelihood id="treeLikelihood">

and change it to:

	<treeLikelihood id="treeLikelihood" useScaling="true">

If you are using codon partitioning there will be more than one treeLikelihood element.

BEAST seems to run but the posterior is reported as ? (question mark) and the none of the values change

This can happen when using a complex coalescent prior such as the Logistic. The problem is that the particular set of starting parameters may result in probabilities close to zero. Try altering the starting values of the coalescent model parameters.

I am getting the error "java.lang.OutofMemoryError Java heap space", Is there a way I can increase the memory for the program?

Look at the Increasing Memory Usage page for details of increasing the memory available to BEAST and the other programs.

I still get an OutofMemoryError in TreeAnnotator even when I use all the memory in the machine

This is usually due to very large tree files being loaded into TreeAnnotator. It is unlikely that more than 10,000 trees are necessary to give good estimates of well supported nodes so if you are trying to load hundreds of thousands of trees then you may well have problems (and reach the memory limitations of your computer). When running BEAST it is a good idea to adjust the sampling frequency of the log files to obtain about 10,000 samples when you change the chain length. For existing log files, you can use LogCombiner to 'thin' out the trees at a lower sampling frequency.


Questions that haven't been answered yet:

Creating XML Input

What is a "Double”?

Complex models and hypothesis testing

Is it possible to divide the data into partitions and assign different site models to each partition?

Yes,you can. Take a look at these tutorial

The differences between these two tutorial: Tutorial 8: The different genes(locus) have different alignments. For example, the alignment for gene1 are "alignment1", the alignment for gene2 are "alignment2", the alignment for gene3 are "alignment3";

I think according to this tutorial, the different alignments of genes(locus) can have different taxons. For example, the "alignment1" have taxon A,taxon B,taxon C,taxon D; the "alignment2" have taxon A,taxon B,taxon C,taxon E; the "alignment3" have taxon A,taxon B,taxon C,taxon D, taxon E.

Tutorial 6: The different genes(locus) are in one alignment; I think this tutorial will work well if all of your genes have all the same taxa;

How do you test alternative demographic scenarios with BEAST?


Return to contents