You are here

Discrete Phylogeographic Analysis

BEAST Documentation->Tutorials-> Discrete Phylogeographic Analysis

Setting up a standard discrete phylogeographic analysis

This tutorial describes how to edit a BEAST XML file to set up a discrete phylogeographic analysis using a standard continuous-time Markov chain (CTMC, [1]). The tutorial assumes you are using BEASTv1.2 or higher. The example XML file for this analysis can be found here [2] and considers influenza A H5N1 diffusion among (K=7) locations. This tutorial sets up a standard CTMC, but points out the necessary additions for Bayesian stochastic search variable selection (BSSVS[3]) as well; a separate tutorial focuses on this procedure. Start by adding K location XML elements to the XML file (before the taxa is a good place):

	<!-- The list of sampling locations                                        -->
	<!-- K = 7                                                                 -->
	<location id="Fujian"/> 
	<location id="Guangdong"/> 
	<location id="Guangxi"/> 
	<location id="Hebei"/> 
	<location id="Henan"/> 
	<location id="HongKong"/> 
	<location id="Hunan"/> 

Add a location attribute to each taxon:

	<taxa id="taxa">
		<taxon id="A_chicken_Fujian_1042_2005">
			<date value="2005.0" direction="forwards" units="years"/>
			<attr name="location">Fujian</attr>
		</taxon>
		<taxon id="A_Duck_Fujian_1734_2005">
			<date value="2005.0" direction="forwards" units="years"/>
			<attr name="location">Fujian</attr>
		</taxon>
		.
		.
		.
		<taxon id="A_blackbird_Hunan_1_2004">
			<date value="2004.0" direction="forwards" units="years"/>
			<attr name="location">Hunan</attr>
		</taxon>
	</taxa>

Set up a general data type and a unique site pattern for the location data (after the alignment patterns would be good place):

	<!-- The general data type for locations                                   -->
	<generalDataType id="geography">
		<location idref="Fujian"/> 
		<location idref="Guangdong"/> 
		<location idref="Guangxi"/> 
		<location idref="Hebei"/> 
		<location idref="Henan"/> 
		<location idref="HongKong"/> 
		<location idref="Hunan"/> 
	</generalDataType>

	<!-- 1 unique site pattern for the locations                                   -->
	<attributePatterns id="geoPatterns" attribute="location">
		<generalDataType idref="geography"/>
		<taxa idref="taxa"/>
	</attributePatterns>

Add a svsGeneralSubstitutionModel and associated statistics (maybe after the treeLikelihood). The svsGeneralSubstitutionModel sets up the CTMC, equivalent to e.g. GTR for nucleotide substitution. The dimension of the frequencies in the CTMC needs to match the number of locations (K). Set the dimensions of the rates and indicators to K*(K-1)/2. The sumStatistic and productStatistic will be useful for the BSSVS extension of the standard model.

	<!-- CTMC model for discrete state reconstructions             -->
	<svsGeneralSubstitutionModel id="originModel" name="origin">
		<dataType idref="geography"/>
		<frequencies>
			<frequencyModel id="geoFreqs" normalize="true">
				<dataType idref="geography"/>
				<frequencies>
					<parameter id="geoFreqs.frequencies" dimension="7"/>
				</frequencies>
			</frequencyModel>
		</frequencies>
		<rates relativeTo="1">
			<parameter id="rates" dimension="21" value="1.0"/>
		</rates>
		<rateIndicator>
			<parameter id="indicators" dimension="21" value="1.0"/>
		</rateIndicator>
	</svsGeneralSubstitutionModel>

	<sumStatistic id="nonZeroRates" name="nonZeroRateCount" elementwise="true">
		<parameter idref="indicators"/>
	</sumStatistic>

	<productStatistic id="actualRates" elementwise="false">
		<parameter idref="indicators"/>
		<parameter idref="rates"/>
	</productStatistic>

Similar to the likelihood approach of nucleotide substitution, complete the model specification by adding a siteModel and phylogenetic likelihood (ancestralTreeLikelihood).

    <siteModel id="geoSiteModel">
        <substitutionModel>
            <svsGeneralSubstitutionModel idref="originModel"/>
        </substitutionModel>
        <mutationRate>
            <parameter id="geoSiteModel.mu" value="0.1" lower="0.0" upper="10.0"/>
        </mutationRate>
    </siteModel>

    <ancestralTreeLikelihood id="geoTreeLikelihood">
        <patterns idref="geoPatterns"/>
        <treeModel idref="treeModel"/>
        <siteModel idref="geoSiteModel"/>
        <svsGeneralSubstitutionModel idref="originModel"/>
    </ancestralTreeLikelihood>

The additional parameters we need to sample are the geoSiteModel.mu and rates. Therefore, complete the operator list with two scaleOperator elements, one on geoSiteModel.mu and one on the rates; allow both to have a relatively high weight, in particular the operator on the rates. The two operators at the bottom are required for BSSVS, but commented out for the time being.

	<operators id="operators" optimizationSchedule="log">
		.
		.
		.
		<!-- discrete state model operators -->
		<scaleOperator scaleFactor="0.75" weight="10">
			<parameter idref="geoSiteModel.mu"/>
		</scaleOperator>
		<scaleOperator scaleFactor="0.75" weight="30" scaleAllIndependently="true" autoOptimize="true">
			<parameter idref="rates"/>
		</scaleOperator>

		<!-- BSSVS operators	-->
		<!--
		<bitFlipOperator weight="30">
			<parameter idref="indicators"/>
		</bitFlipOperator>
		<bitFlipInSubstitutionModelOperator scaleFactor="0.75" weight="30" autoOptimize="true">
			<svsGeneralSubstitutionModel idref="originModel"/>
			<parameter idref="geoSiteModel.mu"/>
		</bitFlipInSubstitutionModelOperator>             -->
	</operators>

For the same parameters, we need to specify priors. To this purpose, we propose a gammaPrior (with shape = scale = 1.0) for the rates and an exponentialPrior (mean = 1) for geoSiteModel.mu. The additional poissonPrior is commented out and only used for BSSVS. Also put the model reference in the prior: <svsGeneralSubstitutionModel idref="originModel"/>. 

			<prior id="prior">
				.
				.
				.
				<cachedPrior>
					<gammaPrior shape="1.0" scale="1.0" offset="0.0">
						<parameter idref="rates"/>
					</gammaPrior>
					<parameter idref="rates"/>
				</cachedPrior>
				<exponentialPrior mean="1" offset="0">
					<parameter idref="geoSiteModel.mu"/>
				</exponentialPrior>
				<!-- BSSVS truncated Poisson prior	-->
				<!--
				<poissonPrior mean="0.693" offset="6.0">
					<statistic idref="nonZeroRates"/>
				</poissonPrior>
				-->
				<svsGeneralSubstitutionModel idref="originModel"/>
			</prior>

The additional phylogenetic likelihood we specified above needs to be added to the likelihood element.

			<likelihood id="likelihood">
				<treeLikelihood idref="treeLikelihood"/>
				<ancestralTreeLikelihood idref="geoTreeLikelihood"/>
			</likelihood>

Because the rates and indicators vectors may contain many parameters, we log those in a separate log file together with the geoSiteModel.mu, which completes the CTMC rate matrix:

		 <log id="rateMatrixLog" logEvery="10000" fileName="H5N1_HA_discrete_rateMatrix.log">
			<parameter idref="rates"/>
			<parameter idref="indicators"/>
   			<sumStatistic idref="nonZeroRates"/>
   			<parameter idref="geoSiteModel.mu"/>
		</log>

Note that in the standard model all indicators will be fixed at their initial value (=1), but logging those will be useful for BSSVS. For diagnostic purposes, we can also add ancestralTreeLikelihood and geoSiteModel.mu to the regular parameter log file.

		<log id="fileLog" logEvery="10000" fileName="H5N1_HA_discrete.log">
			.
			.
			.
			<parameter idref="geoSiteModel.mu"/>
			<ancestralTreeLikelihood idref="geoTreeLikelihood"/>
		</log>

Finally, to annotate the trees with with the discrete reconstructions using the tree-traversal alorithm we need to add ancestralTreeLikelihood to the logTree element:

		<logTree id="treeFileLog" logEvery="10000" nexusFormat="true" fileName="H5N1_HA_discrete.trees" sortTranslationTable="true">
			<treeModel idref="treeModel"/>
			<discretizedBranchRates idref="branchRates"/>
			<!-- this entry annotates trees with the discrete reconstructions -->
			<ancestralTreeLikelihood idref="geoTreeLikelihood"/>
			<posterior idref="posterior"/>
		</logTree>

Return to Tutorials

Philippe Lemey, 2009