You are here

Model selection tutorial

Model selection tutorial

Recent years have seen the development of several new approaches to perform model selection in the field of phylogenetics, such as path sampling (under the term 'thermodynamic integration'; Lartillot and Philippe, 2006), stepping-stone sampling (Xie et al., 2011) and generalized stepping-stone sampling (Fan et al., 2011), with the latter approach currently only applicable on a fixed underlying topology. In a recently published paper in Molecular Biology and Evolution (Baele et al., 2012), the performance of 4 model selection approaches for demographic and molecular clock model comparison is investigated: the harmonic mean estimator (HME; Newton and Raftery, 1994), a posterior simulation-based analogue of Akaike's information criteration (AICM; Raftery et al., 2007), path sampling (PS) and stepping-stone sampling. Baele et al. (2012) 'confirm that the HME systematically overestimates the marginal likelihood and fails to yield reliable model classification and show that the AICM performs better and may be a useful initial evaluation of model choice but that it is also, to a lesser degree, unreliable'. The authors show 'that path sampling and stepping-stone sampling substantially outperform these estimators'.

The publication is accompanied by 2 example BEAST XML files to indicate how to use the different estimators in BEAST, which is also what this tutorial focuses on.

Below the mcmc block is provided of the BEAST XML example for a Bayesian skyline plot model (BSP) and an uncorrelated relaxed clock model with an underlying lognormal distribution (UCLD). This code runs an MCMC chain for 100 million iterations and stores samples from this chain in an output file. Besides collecting these samples, the definitions of prior and posterior in the code below are used when calculating the marginal likelihood using path sampling and stepping-stone sampling.

As shown in Baele et al. (2013), it is important to specify proper priors for all the parameters being estimated in the analyses, especially to avoid convergence issues and numerical problems when performing path sampling (PS) and stepping-stone sampling (SS), although it is generally considered to be good practice in any case.

<beast>
        <mcmc id="mcmc" chainLength="100000000" autoOptimize="true">
		<posterior id="posterior">
			<prior id="prior">
				<logNormalPrior mean="1.0" stdev="1.25" offset="0.0" meanInRealSpace="false">
					<parameter idref="kappa"/>
				</logNormalPrior>
				<exponentialPrior mean="0.3333333333333333" offset="0.0">
					<parameter idref="ucld.stdev"/>
				</exponentialPrior>
				<gammaPrior shape="0.0010" scale="1000.0" offset="0.0">
					<parameter idref="ucld.mean"/>
				</gammaPrior>
				<generalizedSkyLineLikelihood idref="skyline"/>
				<exponentialMarkovLikelihood idref="eml1"/>
			</prior>
			<likelihood id="likelihood">
				<treeLikelihood idref="treeLikelihood"/>
			</likelihood>
		</posterior>
                ...
                <!-- write log to file -->
		<log id="fileLog" logEvery="1000" fileName="intergenic_UCLD_BSP.log">
			<posterior idref="posterior"/>
			<prior idref="prior"/>
			<likelihood idref="likelihood"/>
			<parameter idref="treeModel.rootHeight"/>
			<parameter idref="skyline.popSize"/>
			<parameter idref="skyline.groupSize"/>
			<parameter idref="kappa"/>
			<parameter idref="ucld.mean"/>
			<parameter idref="ucld.stdev"/>
			<rateStatistic idref="meanRate"/>
			<rateStatistic idref="coefficientOfVariation"/>
			<rateCovarianceStatistic idref="covariance"/>
			<treeLikelihood idref="treeLikelihood"/>
			<generalizedSkyLineLikelihood idref="skyline"/>
		</log>
                ...
        </mcmc>
</beast>

First off: the harmonic mean estimator (HME). The following code runs the harmonic mean estimation on the samples collected in the mcmc and hence does not require additional calculations to be performed in order to estimate the marginal likelihood, which is considered a major advantage of this approach. NOTE: given that additional model selection approaches are available as of BEAST v1.7.0, the mcmc element for the harmonic mean estimator is now assumed to be the following for clarity purposes (although the previous XML-element is still supported):

<beast>
    <harmonicMeanAnalysis fileName="intergenic_UCLD_BSP.log" bootstrapLength="0" burnIn="10000000">
        <likelihoodColumn name="likelihood" />
    </harmonicMeanAnalysis>
</beast>

The smoothed harmonic mean estimation (sHME) on the samples collected in the mcmc is run as follows:

<beast>
    <harmonicMeanAnalysis fileName="intergenic_UCLD_BSP.log" bootstrapLength="0" burnIn="10000000" 
smoothedEstimate="true">
        <likelihoodColumn name="likelihood" />
    </harmonicMeanAnalysis>
</beast>

Another approach that is able to perform model selection using the samples collected during the MCMC run is the AICM. The AICM is also available in Tracer v1.6.0. NOTE: the AICM is NOT a marginal likelihood estimator. The AICM estimation on the samples collected in the mcmc is run using the following code:

<beast>
    <aicmAnalysis fileName="intergenic_UCLD_BSP.log" bootstrapLength="0" burnIn="10000000">
        <likelihoodColumn name="likelihood" />
    </aicmAnalysis>
</beast>

Path sampling and stepping-stone sampling require additional calculations to perform model selection. Both approaches use the same collection of samples to estimate the marginal likelihood but do so in a different way. Both approaches construct a path between prior and posterior (both defined in the mcmc block above) but, since path sampling uses Simpson's triangulation formula, it converges less quickly to the true value than does stepping-stone sampling. However, when a lot of samples along this path are collected, this difference becomes negligible. NOTE: while the HME and AICM are available in Tracer, path sampling and stepping-stone sampling are not, as they require additional calculations. The code below runs a series of 100 power posteriors along the path between prior en posterior, with the powers being determined according to evenly spaced quantiles of a Beta(0.3,1.0) distribution:

<beast>
    <marginalLikelihoodEstimator chainLength="1000000" pathSteps="100" pathScheme="betaquantile" alpha="0.30">
    	<samplers>
	    <mcmc idref="mcmc"/>
	</samplers>
        <pathLikelihood id="pathLikelihood">
            <source>
                <posterior idref="posterior"/>
            </source>
            <destination>
                <prior idref="prior"/>
            </destination>
        </pathLikelihood>
        <log id="MLE.bsp" logEvery="1000" fileName="MLE.bsp.log">
            <pathLikelihood idref="pathLikelihood"/>
        </log>
    </marginalLikelihoodEstimator>
</beast>

The samples collected from the power posteriors are stored in the MLE.bsp.log file, which is subsequently used as input for calculating the marginal likelihood using path sampling and stepping-stone sampling. The following code runs the path sampling calculation on the samples collected in the marginalLikelihoodEstimator element:

<beast>
    <pathSamplingAnalysis fileName="MLE.bsp.log">
        <likelihoodColumn name="pathLikelihood.delta"/>
        <thetaColumn name="pathLikelihood.theta"/>      
    </pathSamplingAnalysis>
</beast>

The code below then runs the stepping-stone sampling calculation on the same collection of samples gathered in the marginalLikelihoodEstimator element:

<beast>
    <steppingStoneSamplingAnalysis fileName="MLE.bsp.log">
        <likelihoodColumn name="pathLikelihood.delta"/>
        <thetaColumn name="pathLikelihood.theta"/>      
    </steppingStoneSamplingAnalysis>
</beast>

It is also possible to run several independent marginal likelihood calculations to obtain a larger number of samples to increase the accuracy of path sampling and stepping-stone sampling. For example, running the code above twice would result in two output files (e.g. MLE-1.bsp.log and MLE-2.bsp.log) containing different samples from a series of power posteriors. One can then combine these two output files (and hence the samples they contain) to estimate the marginal likelihood using path sampling and stepping-stone sampling by providing a list of the output files to the pathSamplingAnalysis and steppingStoneSamplingAnalysis XML elements. The following code runs the path sampling calculation on the two output files:

<beast>
    <pathSamplingAnalysis fileName="MLE-1.bsp.log MLE-2.bsp.log">
        <likelihoodColumn name="pathLikelihood.delta"/>
        <thetaColumn name="pathLikelihood.theta"/>      
    </pathSamplingAnalysis>
</beast>

The code below then runs the stepping-stone sampling calculation on these two output files:

<beast>
    <steppingStoneSamplingAnalysis fileName="MLE-1.bsp.log MLE-2.bsp.log">
        <likelihoodColumn name="pathLikelihood.delta"/>
        <thetaColumn name="pathLikelihood.theta"/>      
    </steppingStoneSamplingAnalysis>
</beast>

Note that if the two output files were obtained using independent runs, a separate BEAST XML file to combine these samples can easily be created. This XML file has no need for an mcmc or marginalLikelihoodEstimator block, since all the samples have already been collected and hence, no additional calculations are required.

If you use this code we would encourage you to cite the following papers:

Baele G, Lemey P, Bedford T, Rambaut A, Suchard MA and Alekseyenko AV (2012) 'Improving the accuracy of demographic and molecular clock model comparison while accommodating phylogenetic uncertainty' Molecular Biology and Evolution 29(9), 2157-2167.

Baele G, Li WLS, Drummond AJ, Suchard MA and Lemey P. (2013) 'Accurate model selection of relaxed molecular clocks in Bayesian phylogenetics' Molecular Biology and Evolution 30(2), 239-243.

 

If you'd like to know more about the use of path sampling (PS) and stepping-stone sampling (SS) on whole genome data sets (i.e. involving a lot of parameters across different genes) and/or using multi-core hardware in BEAST/BEAGLE, we'd like to refer you to the following paper:

Baele G and Lemey P. (2013) 'Bayesian evolutionary model testing in the phylogenomics era: matching model complexity with computational efficiency' Bioinformatics 29(16), 1970-1979.

Listed below is a more technical paper on performing direct (log) Bayes factor estimation between two models using path sampling (PS) and stepping-stone sampling (SS), with an emphasis on the derivation of the statistical properties of the latter. This paper focuses on comparing high-dimensional substitution models, but currently does not offer a BEAST implementation:

Baele G, Lemey P and Vansteelandt S. (2013) 'Make the most of your samples: Bayes factor estimators for high-dimensional models of sequence evolution' BMC Bioinformatics 14, 85.