You are here

Continuous phylogeographic analysis

 

Setting up a standard continuous phylogeographic analysis 

This tutorial describes how to edit a BEAST XML file to set up a phylogeographic analysis in continuous space using a standard random walk[1]. The tutorial assumes you are using BEASTv1.5.4 or higher. The example XML file for this analysis can be found here [2] and considers raccoon rabies diffusion in north-eastern United States [3]. This tutorial sets up a standard random walk; for relaxed random walks (RRWs, [4]), we refer to a separate tutorial. First, we need to add the 2D locations as continuous "traits" to the tips. Latitude and longitude can be specified as a bivariate taxon attribute, in this case named "location". Although any order can be used as long as it is the same for all taxa, it would be useful to a stick to a latitude-longitude order for the sake of consistency:

	<taxa id="taxa">
		<taxon id="hOH10_97.2_41.053_80.706">
			<date value="1997.2" direction="forwards" units="years"/>
			<attr name="location">41.053 -80.706</attr>
		</taxon>
		<taxon id="hWVa01_93.2_40.505_80.575">
			<date value="1993.2" direction="forwards" units="years"/>
			<attr name="location">40.505 -80.575</attr>
		</taxon>
		.
		.
		.
		<taxon id="WVa22_02.7_39.635_79.995">
			<date value="2002.7" direction="forwards" units="years"/>
			<attr name="location">39.635 -79.995</attr>
		</taxon>
	</taxa>

Note that with the standard angular specification of latitude and longitude, we assume that phylogeographic diffusion occurs within a rectangle defined by -90,+90 latitude borders and -180,180 longitude borders.

Under a time-homogeneous spatial diffusion process, the additional parameters we must estimate are the unobserved locations of the sequence ancestors at all times along the phylogeny. Therefore, add the four nodeTraits elements to the treemodel (one for the rootTrait, the internalTraits, the leafTraits and a collection of allTraits):

	<treeModel id="treeModel">
		<coalescentTree idref="startingTree"/>
		<rootHeight>
			<parameter id="treeModel.rootHeight"/>
		</rootHeight>
		<nodeHeights internalNodes="true">
			<parameter id="treeModel.internalNodeHeights"/>
		</nodeHeights>
		<nodeHeights internalNodes="true" rootNode="true">
			<parameter id="treeModel.allInternalNodeHeights"/>
		</nodeHeights>
		<nodeTraits name="location" rootNode="true" internalNodes="false" leafNodes="false" traitDimension="2">
			<parameter id="rootTrait"/>
		</nodeTraits> 
		<nodeTraits name="location" rootNode="true" internalNodes="true" leafNodes="false" traitDimension="2">
			<parameter id="internalTraits"/>
		</nodeTraits>
		<nodeTraits name="location" rootNode="false" internalNodes="false" leafNodes="true" traitDimension="2">
			<parameter id="leafTraits"/>
		</nodeTraits>
		<nodeTraits name="location" rootNode="true" internalNodes="true" leafNodes="true" traitDimension="2">
			<parameter id="allTraits"/>
		</nodeTraits>
	</treeModel>

Sufficient statistics of the continuous process are the unobserved 2D locations at the root and internal nodes, and the infinitesimal precision matrix for a bivariate diffusion. Add a multivariateDiffusionModel (after the treeLikelihood is a good place), with the bivariate precision matrix:

	<multivariateDiffusionModel id="diffusionModel">
		<precisionMatrix>
			<matrixParameter id="precisionMatrix">
				<parameter id="col1" value="0.05 0.002"/>
				<parameter id="col2" value="0.002 0.05"/>
			</matrixParameter>
		</precisionMatrix>
	</multivariateDiffusionModel>

The diagonal elements of the precision matrix are the two strictly nonnegative marginal precisions in each spatial dimension. The off-diagonal elements determine the correlation, which can be obtained by the correlation statistic (see below). Following the multivariateDiffusionModel, we specify a multivariateWishartPrior over the precision matrix:

	<multivariateWishartPrior id="precisionPrior" df="2">
		<scaleMatrix>
			<matrixParameter>
				<parameter value="1.0 0.0"/>
				<parameter value="0.0 1.0"/>
			</matrixParameter>
		</scaleMatrix>
		<data>
			<matrixParameter idref="precisionMatrix"/>
		</data>
	</multivariateWishartPrior>

The Wishart distribution is a multivariate generalization of the gamma distribution and is characterized by a “degrees of freedom” (df="2") and scale matrix. Without further expert knowledge, we set these hyperparameters, such that the precisions are relatively uninformative. For the the root location, we specify a multivariateNormalPrior:

	 <multivariateNormalPrior id="rootPrior">
	   <meanParameter>
		 <parameter value="0.0 0.0"/>
		   </meanParameter>
	   	<precisionParameter>
	 		<matrixParameter>
	   			<parameter value="0.001 0.0"/>
	   			<parameter value="0.0 0.001"/>
	 		</matrixParameter>
	   	</precisionParameter>
	  	<data>
	 		<parameter idref="rootTrait"/>
	  	</data>
	 </multivariateNormalPrior>

Note that the prior expectation for the root is (0,0), but again in the absence of strong preexisting knowledge, we make the prior uninformative by granting it a large variance (with small precisions of 0.001). We complete the continuous model specification by setting up a Brownian diffusion likelihood:

	<multivariateTraitLikelihood id="traitLikelihood" traitName="location" 
	 						     useTreeLength="true" scaleByTime="true" 
                                 reportAsMultivariate="true">
		<multivariateDiffusionModel idref="diffusionModel"/>
		<treeModel idref="treeModel"/>
		<traitParameter>
			<parameter idref="allTraits"/>
		</traitParameter>
		<randomize upper="45 -70" lower="28 -122">
			<parameter idref="internalTraits"/>
		</randomize>
		<!-- a jitter adds random noise uniformily drawn from a particular window size for each dimension in case there are duplicate traits -->
		<!--
		<jitter window="0.5 0.5" duplicatesOnly="true">
			<parameter idref="leafTraits"/>
		</jitter>
		-->
    </multivariateTraitLikelihood>

In this multivariateTraitLikelihood, we can specify starting values for the locations at the internal nodes that are randomly drawn from a uniform distribution, in this case uniform(28,45) and uniform(-122,-70) for latitude and longitude respectively. As the case here, the boundaries can be set according to the distribution of 2D locations at the tips. The jitter element is commented out, but can be used to add random noise from a fixed window (here with size 0.5 for both latitude and longitude) to the duplicate traits at the tips of the tree. This is strongly recommended when resorting to RRWs. Each sequence has a unique location in this data set, so there are no duplicate traits in this case.

Additional statistics can be included to obtain useful summaries of the diffusion proces:

	<correlation id="locationCorrelation" dimension1="1" dimension2="2">
		<matrixParameter idref="precisionMatrix"/>
	</correlation>

	<treeLengthStatistic id="treeLength">
		<treeModel idref="treeModel"/>
	</treeLengthStatistic>

	<productStatistic id="treeLengthPrecision1">
		<treeLengthStatistic idref="treeLength"/>
		<subStatistic id="precision1" dimension="0"> 
			<parameter idref="col1"/>
		</subStatistic>
	</productStatistic>

	<productStatistic id="treeLengthPrecision2">
		<treeLengthStatistic idref="treeLength"/>
		<subStatistic id="precision2" dimension="1">
			<parameter idref="col2"/>
		</subStatistic>
	</productStatistic>

	<treeDispersionStatistic id="diffusionRate" greatCircleDistance="true">
		<treeModel idref="treeModel"/>
		<multivariateTraitLikelihood idref="traitLikelihood"/>
	</treeDispersionStatistic>

These include a standard correlation between longitude and latitude and a tree length statistic that is subsequently used in two productStatistics. The precision matrix scales arbitrarily in units-time. We measure the precision matrix in units-tree-length (see useTreeLength="true" scaleByTime="true" in multivariateTraitLikelihood) to maintain data likelihood identifiability. The productStatistics return the original precisions before rescaling. Finally a treeDispersionStatistic sumarizes the overall rate of diffusion throughout the tree. This is calculated as the total distance travelled accross the tree divided by the total tree length in time units. The distances for each branch are inferred from the location realizations at the parent and descendent node of the branches (in this case in great-circle distances [5]). Beware that the great-circle distance calculations assume that the 2D locations were entered in latitude-longitude order!

The Wishart distribution is conjugate to the BD likelihood, enabling us to construct a Gibbs sampler from the full-condition distribution of the precision matrix (precisionGibbsOperator). Because we assume a multivariate normal prior on the root, the full-conditional distributions for the root and internal node locations remain multivariate normally distributed under a Brownian diffusion process, allowing us to construct a Gibbs sampler for internal node locations. These samplers are added under the operators:

	<operators id="operators" optimizationSchedule="log">
		.
		.
		.
		<precisionGibbsOperator weight="15">
			<multivariateTraitLikelihood idref="traitLikelihood"/>
			<multivariateWishartPrior idref="precisionPrior"/>
		</precisionGibbsOperator>

		<traitGibbsOperator weight="50">
			<multivariateTraitLikelihood idref="traitLikelihood"/>
			<multivariateNormalPrior idref="rootPrior"/>
		</traitGibbsOperator>
		
    		<randomWalkOperator windowSize="10.0" weight="5">
			<parameter idref="rootTrait"/>
		</randomWalkOperator>
	</operators>

The randomWalkOperator on the rootTrait is not a necessity, but could assist in efficient mixing. The two priors specified above need to be included in the prior block:

			<prior id="prior">
				<oneOnXPrior>
					<parameter idref="hky.kappa"/>
				</oneOnXPrior>
				<generalizedSkyLineLikelihood idref="skyline"/>
				<exponentialMarkovLikelihood idref="eml1"/>
				<multivariateNormalPrior idref="rootPrior"/>
				<multivariateWishartPrior idref="precisionPrior"/>
			</prior>

Adding the Brownian diffusion likelihood to the likelihood block accomplishes the joint inference of sequences and tip traits:

			<likelihood id="likelihood">
				<treeLikelihood idref="treeLikelihood"/>
				<multivariateTraitLikelihood idref="traitLikelihood"/> 
			</likelihood>

Add the parameters (precisionMatrix, rootTrait), statistics (diffusionRate, productStatistics and correlation), and multivariateTraitLikelihood to the fileLog.

			
		<log id="fileLog" logEvery="50000" fileName="RacRABV_homogeneous.log">
			.
			.
			.
			<matrixParameter idref="precisionMatrix"/>
			<parameter idref="rootTrait"/>
			<treeDispersionStatistic idref="diffusionRate"/>
			<productStatistic idref="treeLengthPrecision1"/>
			<productStatistic idref="treeLengthPrecision2"/>
			<treeDispersionStatistic idref="diffusionRate"/>
 			<correlation idref="locationCorrelation"/>
 			<multivariateTraitLikelihood idref="traitLikelihood"/>
		</log>

Finally, to annotate the trees in the posterior distribution with the sampled continuous locations at each node, we need to add the multivariateTraitLikelihood to the logTree element. Including also the multivariateDiffusionModel will report additional information that may potentially be useful:

		<logTree id="treeFileLog" logEvery="50000" nexusFormat="true" fileName="RacRABV_homogeneous.trees" sortTranslationTable="true">
			<treeModel idref="treeModel"/>
			<posterior idref="posterior"/>
			<multivariateDiffusionModel idref="diffusionModel"/>   <!-- reports info about diffusion model -->
			<multivariateTraitLikelihood idref="traitLikelihood"/>
		</logTree>