You are here

Benchmarks

This is a page where BEAST users can post benchmarks for their various machines. It is simply intended as a rough guide to which combinations of JVM/processor/OS work best. The test files are the benchmark1.xml or benchmark2.xml example files (in /examples/Benchmarks/ in the BEAST distribution).

We have 2 benchmark files: the first, benchmark1.xml is a very large tree with a modest amount of sequence data for each tip and the second, benchmark2.xml is a modest sized tree with a large amount of data per tip.

Contents

 [hide]

Benchmark1 results 

The file benchmark1.xml contains 1441 influenza HA1 gene fragments (HA1 is a subdomain of the haemagglutinin gene). The model is an HKY substitution processes and a constant size coalescent tree prior. This alignment has 593 unique site patterns (a total of 854,513 nucleotides). 

To reduce stochastic variation, all benchmarks are run using '-seed 666' so, in theory, they produce identical sequence of computations. However, single precision implementation may not do so because the imprecision may result in a different result for the acceptance and rejection step of the MCMC causing the chain to take a different path.

Table 1.1 different implementations on the same hardware 

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
4.50 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore/NVidia GTX285 GPU 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_gpu
4.55 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_sse
4.63 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_cpu
9.17 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAST native library

Notes & Analysis 

The use of BEAGLE provides considerable improvements in efficiency. The core computation for the option '-beagle_cpu' is extremely similar to the BEAST native library but involves only one JNI call per evaluation (rather than one for each node - 1440 of them - for BEAST native library). The SSE BEAGLE plugin provides a very modest improvement in speed for this benchmark - compare this with benchmark2.xml which has longer sequences. 

Using the GPU (a GeForce GTX 285 chip with 240 single precision cores) makes almost no difference in performance over the SSE implementation. This is presumably because of the per call performance cost of using the GPU - with only 500 independent site patterns, it simply can't keep all the cores working flat-out.

Table 1.2 miscellaneous comparisons - benchmark1.xml 

593 unique site patterns, 1441 sequences. These benchmarks are running for 100,000 steps to increase resolution.

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
4.012 1.8pre 1.7.0_17-b02 RedHat Linux Intel(R) Xeon(R) X5690 / NVidia Tesla K20 3.47GHz, 12 cores / 0.7GHz, 2496 cores Dell Precision R5500 23/08/13 Andrew Rambaut using BEAGLE: -beagle_gpu -beagle_double -beagle_scaling dynamic
4.23 1.8pre 1.7.0_17-b02 RedHat Linux Intel(R) Xeon(R) X5690 3.47GHz, 12 cores Dell Precision R5500 23/08/13 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_scaling dynamic
4.49 1.8pre JSE 1.6.0_51 Mac OS X 10.8.4 Intel(R) Xeon(R) X5670 2.93GHz Apple MacPro5,1 06/05/13 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_scaling dynamic

Notes & Analysis 

Other comparisons between architectures and OSs. These benchmarks are run for 100,000 steps, thus the proportionally longer runtimes. 

 

Benchmark2 Results 

The file benchmark2.xml performs 0.1 million steps using GTR+gamma for a large data set consisting of 62 carnivore protein-coding mitochondrial genomes of length 10867 bp. There are 5565 unique site patterns (a total of 345,030 nucleotides). 

Table 2.1 different implementations on the same hardware 

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
0.84 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore/NVidia GTX285 GPU 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_gpu
1.14 1.7Pre 1.6.0_26 Mac OS X 10.7.1 Intel(R) Xeon(R) 6Core/NVidia Quadro 7000 GPU 2.93GHz Apple MacPro5,1 6/10/11 Andrew Rambaut using BEAGLE: -beagle_gpu
1.18 1.7Pre 1.6.0_26 Mac OS X 10.7.1 Intel(R) Xeon(R) 6Core/NVidia Quadro 7000 GPU 2.93GHz Apple MacPro5,1 6/10/11 Andrew Rambaut using BEAGLE: -beagle_gpu -beagle_double
4.74 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_sse
5.94 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_cpu
7.08 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAST native library

Notes & Analysis 

The use of BEAGLE still provides considerable good speed improvements but the tree is much smaller so the overhead of JNI calls is smaller. BEAGLE SSE plugin gives a further speed improvement because of the large number of site patterns. The SSE extensions should do 2 double precision calculations at once so have a theoretical 2-fold improvement in speed. The 1.25-fold speed up on such large vectors suggests improvements could be made to this code.

In this case use of the GeForce GTX285 gives a more than 5-fold increase in speed showing that with sufficient independent site patterns the 240 cores can begin to make a real difference.

The Quadro 4000 has a 1GHz clock speed compared with the GTX285's 1.5GHz but is able to do double precision almost as fast as single.

Table 2.2 comparisons of GPGPU boards 

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
0.84 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore/NVidia GTX285 GPU 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut GTX285: 240 core, 1.48GHz, 1GB
1.14 1.7pre 1.6.0_26 Mac OS X 10.7.1 Intel(R) Xeon(R) QuadCore/NVidia Quadro 7000 GPU 2.93GHz Apple MacPro5,1 6/10/11 Andrew Rambaut Q7000: 256 core, 0.95GHz, 2GB
4.99 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore/NVidia GT120 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut GT120: 32 core, 1.4 GHz, 512MB

Notes & Analysis 

The GeForce GT120 only has 32 cores and even with the long sequences fully saturating these, this GPU is slower than using the SSE implementation on the CPU (but not using the CPU itself).

Table 2.3 multithreading comparisons 

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
1.80 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_instances 4
2.21 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_instances 3
2.71 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_instances 2
4.74 1.6.0 1.6.0_20 Mac OS X 10.5.4 Intel(R) Xeon(R) QuadCore 2.93GHz Apple MacPro4,1 28/08/10 Andrew Rambaut using BEAGLE: -beagle_sse

Notes & Analysis 

Here the runs are making use of a feature of the BEAGLE options in BEAST which allow you to split the data into partitions of equal size and distribute them amongst instances of the BEAGLE library using multithreading to evaluate them concurrently. The speed ups (x1.75, x2.14 & x2.63) are not as high as one might expect but worth using if multiple cores are available.

Table 2.4 miscellaneous comparisons - benchmark2.xml 

5565 unique site patterns, 62 sequences, These benchmarks are running for 1,000,000 steps to increase resolution.

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
5.13 1.8pre 1.7.0_17-b02 RedHat Linux Intel(R) Xeon(R) X5690 / NVidia Tesla K20 3.47GHz, 12 cores / 0.7GHz, 2496 cores Dell Precision R5500 23/08/13 Andrew Rambaut using BEAGLE: -beagle_gpu -beagle_double
5.26 1.8pre 1.7.0_17-b02 RedHat Linux Intel(R) Xeon(R) X5690 / 2 x NVidia Tesla K20 3.47GHz, 12 cores / 0.7GHz, 2496 cores Dell Precision R5500 23/08/13 Andrew Rambaut using BEAGLE: -beagle_gpu -beagle_double -beagle_instances 2
24.80 1.8pre 1.7.0_17-b02 RedHat Linux Intel(R) Xeon(R) X5690 3.47GHz, 12 cores Dell Precision R5500 23/08/13 Andrew Rambaut using BEAGLE: -beagle_sse
26.20 1.8pre JSE 1.6.0_45 Mac OS X 10.8.3 Intel(R) Xeon(R) X5670 2.93GHz Apple MacPro5,1 06/05/13 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_scaling dynamic
31.86 1.8pre 1.6.0-openjdk.x86_64 RedHat Linux Intel(R) Xeon(R) CPU E5-4617 2.90GHz Dell PowerEdge 820 06/05/13 Andrew Rambaut using BEAGLE: -beagle_sse -beagle_scaling dynamic

Notes & Analysis 

Other comparisons between architectures and OSs. These benchmarks are run for 1,000,000 steps, thus the proportionally longer runtimes. An interesting comparison between a single K20 (which has 2496 cores) with 5565 site patterns and splitting it between 2 K20s with 2782 site patterns each. The latter is marginally slower showing the need to saturate the available cores to gain maximum performance.

Older results 

These are some results for and older form of benchmark2.xml which had 3 data partitions. These results are not directly comparable to the benchmark2.xml results above.

This performs 0.1 million steps using GTR+gamma for each codon position of a large data set consisting of 62 carnivore protein-coding mitochondrial genomes of length 10867 bp. The unique site patterns are divided into 3 equal-sized subsets to allow comparison of threading and other parallel technologies. Using the '-threads 3' command line option will calculate the likelihood of each of these partitions in separate threads on multi-core/multi-processor machines. This will only produce a proportional speed up if the number of site patterns in each pattern is equal - i.e., the threads are load-balanced. The benchmark2.xml file shows how to break up a set of patterns into equally sized chunks using the 'patternSubSet' element.

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
0.70 1.6.0 1.6.0_21 Ubuntu 10.0.4 Intel(R) Xeon(R) CPU X5670/NVidia Tesla C2050 2.93GHz Dell Precision T7500 24/08/10 Andrew Rambaut using BEAGLE library -beagle_gpu (on Tesla C2050)
3.83 1.6.0 1.6.0_21 Ubuntu 10.0.4 Intel(R) Xeon(R) CPU X5670 2.93GHz Dell Precision T7500 24/08/10 Andrew Rambaut using BEAGLE library -beagle_sse
4.80 1.5.4 with '-threads 3' 1.6.0_16 CentOS 5.4 Intel(R) Xeon(R) CPU E5462 2.80GHz SGI Altix XE 1300 24/08/10 David Schibeci libNucleotideLikelihoodCore.so compiled with ICC 10.1.025
5.84 1.5 with '-threads 3' 1.6.0_07 OSX 10.5.6 Intel Xeon "Nehalem" 2.93GHz Apple Mac Pro Dual 4-core 3/08/09 Andrew Rambaut This is about 2.5-fold faster than the single threaded run (the 14.54 minute run, below).
10.27 1.5.4 1.6.0_16 CentOS 5.4 Intel(R) Xeon(R) CPU E5462 2.80GHz SGI Altix XE 1300 24/08/10 David Schibeci libNucleotideLikelihoodCore.so compiled with ICC 10.1.025
13.66 1.5 1.6.0-dp OSX 10.4.10 Intel Xeon "Woodcrest" 3.0GHz Apple Xserve 4-core 3/08/09 Andrew Rambaut  
14.54 1.5 1.6.0_07 OSX 10.5.6 Intel Xeon "Nehalem" 2.93GHz Apple Mac Pro Dual 4-core 3/08/09 Andrew Rambaut  

benchmark.xml 

This file is now called old_benchmark.xml

This performs 10 million steps using HKY+gamma, constant size coalescent on 17 sequences of dengue 4 virus envelope genes of length 1485 nucleotides (138 unique site patterns).

Time (minutes) BEAST version Java JVM Operating System Processor model Processor speed Computer model Date Tester Notes
12.73 1.5.1 1.6.0_16 CentOS 5.2 Intel(R) Xeon(R) CPU E5462 2.80GHz SGI Altix XE 1300 9/09/09 David Schibeci libNucleotideLikelihoodCore.so compiled with ICC 10.1.022
14.23 1.5.1 1.6.0_16 CentOS 5.2 Intel(R) Xeon(R) CPU E5462 2.80GHz SGI Altix XE 1300 9/09/09 David Schibeci  
14.89 1.5beta3 1.6.0-dp OSX 10.4.10 Intel Xeon "Woodcrest" 3.0GHz Apple Xserve 4-core 13/04/09 Andrew Rambaut  
15.16 1.4.4 1.6.0-dp OSX 10.4.10 Intel Xeon "Woodcrest" 3.0GHz Apple Xserve 4-core 31/07/07 Andrew Rambaut  
16.94 1.4.8 1.6.0_16 CentOS 5.2 Intel(R) Xeon(R) CPU E5462 2.80GHz SGI Altix XE 1300 9/09/09 David Schibeci libNucleotideLikelihoodCore.so compiled with ICC 10.1.022
16.97 1.4.4 1.6.0-dp OSX 10.4.10 Intel Xeon "Clovertown" 3.0GHz Apple MacPro 8-core 11/07/07 Andrew Rambaut The developers' preview of Java 1.6 on Mac provides a significant improvement
17.42 1.4.6 1.6.0_03-b05 Ubuntu 7.10 Intel Xeon "Clovertown" X5355 2.66GHz Dell Precision 490 8-Core 11/12/07 Joseph Heled
17.52 1.4.4 1.5.0_07 OSX 10.4.10 Intel Xeon "Woodcrest" 3.0GHz Apple Xserve 4-core 31/07/07 Andrew Rambaut v1.4.4 computes the coalescent likelihood faster
17.72 1.4.8 1.6.0_16 CentOS 5.2 Intel(R) Xeon(R) CPU E5462 2.80GHz SGI Altix XE 1300 9/09/09 David Schibeci
21.21 1.4.2 1.5.0_06 OSX 10.4.9 Intel Xeon "Clovertown" 3.0GHz Apple MacPro 8-core 7/05/07 Andrew Rambaut  
21.34 1.4.0 1.5.0_06 OSX 10.4.x Intel Xeon 'Woodcrest' 3.0GHz Apple MacPro 4-core ? Andrew Rambaut  
26.45 1.4.0 1.5.0_06 (server) Redhat Opteron 2.4GHz Unbranded cluster ? Andrew Rambaut  
27.42 1.4.2 1.5.0_07 OSX 10.4.8 Intel Core 2 Duo 2.33GHz Apple MacBookPro 27/04/07 Andrew Rambaut  
25.28 1.4.6 1.5.0.11 Ubuntu 7.04 Intel(R) Core(TM)2 CPU T7400 2.16GHz Dell Latitude D820 15/11/07 Joseph Heled
29.83 1.4.2 1.5.0_07 Mac OS X 10.4.8 G5 2.5GHz Apple G5 Quad ? Andrew Rambaut  
34.52 1.4.2 1.5.0_07 OSX 10.4.10 Intel Core Duo 2.0GHz Apple MacBookPro 6/07/07 Alexei Drummond how sad is that :-(
  • Please give the exact BEAST version as given in the title when the program is run
  • To obtain the Java Virtual Machine version, type java -version
  • Operating System means Windows/Mac OS X/Linux etc. Please give exact version (and for Linux give the distribution).
  • For computer model please give details like Quad-core etc.

Notes

All benchmarks should be made using an unmodified version of the benchmark.xml BEAST file which can be found in the examples folder of the BEAST package. In addition, the benchmark should be run on the command-line. For UNIX/Linux or Mac OS X please download the UNIX version, open a terminal and change directory into the lib/ directory in the installed package. Then type:

java -Xmx128M -jar beast.jar -seed 666 ../examples/benchmark.xml 

On Windows, a similar command can be run by bringing up a command-line (perhaps someone who knows about such things could post it here?).

The reason for using the command-line is to be able to specify the random number seed. This will ensure that each run will perform exactly the same sequence of calculations (i.e., the starting tree will be the same and the MCMC will propose exactly the same moves) which will make it easier to compare the computational performance.

In order to check that the run is the same, you can compare the initial output to the screen (after the title information) to this:

                  BEAST v1.4.2, 2002-2007
       Bayesian Evolutionary Analysis Sampling Trees
                             by
           Alexei J. Drummond and Andrew Rambaut

               Department of Computer Science
                   University of Auckland
                  alexei@cs.auckland.ac.nz

             Institute of Evolutionary Biology
                  University of Edinburgh
                     a.rambaut@ed.ac.uk

Downloads, Help & Resources:
        http://beast.bio.ed.ac.uk/

Source code distributed under the GNU Lesser General Public License:
        http://code.google.com/p/beast-mcmc/

Additional programming & components created by:
        Roald Forsberg
        Gerton Lunter
        Sidney Markowitz
        Oliver Pybus

Thanks to (for use of their code):
        Korbinian Strimmer

Random number seed: 666

MacRoman
Parsing XML file: benchmark.xml
Read alignment, 'alignment':
  Sequences = 17
      Sites = 1485
   Datatype = nucleotide
Site patterns 'patterns' created from positions 1-1485 of alignment 'alignment'
  pattern count = 138
Creating the tree model, 'treeModel'
  initial tree topology = ((((((D4Indon76,D4Indon77),D4SLanka78),D4ElSal94),(D4Mexico84,D4Philip64)),((D4PRico86,D4Philip56),D4Thai63)),(((((D4Philip84,D4Thai84),D4Brazi82),(D4NewCal81,D4Thai78)),D4Tahiti85),(D4ElSal83,D4Tahiti79)))
Using strict molecular clock model.
Creating HKY substitution model. Initial kappa = 1.0
Creating site model: 
  4 category discrete gamma with initial shape = 0.5
TreeLikelihood using native nucleotide likelihood core
  Ignoring ambiguities in tree likelihood.
  Partial likelihood scaling off.
Branch rate model used: strictClockBranchRates
Creating the MCMC chain:
  chainLength=10000000
  autoOptimize=true

Pre-burnin (100000 states)
0              25             50             75            100
|--------------|--------------|--------------|--------------|
*************************************************************

state   Posterior       Root Height     L(tree)         L(coalecent)
0       -3,840.5443     70.3277         -3,770.5866     -69.9577    
10000   -3,850.8894     64.3348         -3,778.9309     -71.9585 

The final state should be:

10000000        -3,842.7260     69.6966         -3,771.7167     -71.0093

Your run should produce identical results. Check that the output includes the line TreeLikelihood using native nucleotide likelihood core. This specifies that BEAST is using the compiled 'C' native library which will produce the quickest results for the given platform. If you have any doubts please contact us rather than posting possibly misleading results (or post the results on the 'discussion' section for this page - see the tab above).