# BAMBE

## Bayesian Analysis in Molecular Biology and Evolution

### Version 2.03 beta, January 2001

© Copyright 2000, 2001, Donald Simon & Bret Larget, Department of Mathematics and Computer Science, Duquesne University.

### MCMC convergence

There is no guarantee that the algorithms here will correctly sample from the desired posterior distributions in a finite run. While theory says that the long-run frequencies will converge to the desired posterior probabilities, inferences from insufficiently long runs can be biased if the chain has not moved sufficiently far from the initial state and converged to near stationarity. Obtaining consistent results from several different starting points with different random seeds is a minimum criterion to have confidence that the numerical results provided by BAMBE are close to their analytically determined values.

We generally follow something like this procedure in analyzing new data sets.

1. Categorize the sites in a biologically relevant way.
2. Complete one short run with global during burn in and local in the main loop, updating parameters every cycle, running long enough to reach approximate stationarity.
3. Graph the .lpd file. (The free software Gnuplot or packages such as S-PLUS or MatLab are good choices.)
• If the graph does not finish in a plateau, the run was too short.
• Note the approximate level of the final plateau.
4. Look at the .out file.
• Verify that the number of unique sites, number of constant sites, and nucleotide base distributions are reasonable.
• Look at differences in number of constant sites and nucleotide base distributions by category.
5. Look at the .par file.
• Adjust the initial parameter values in the rc file so they are approximately equal to final values achieved by the simulation.
• Note the tree and parameter acceptance rates. It is good if they are between about 15% and 40%. Tuning parameters may require adjustment, but we will fine tune this later.
6. Beginning at the last tree from initial run, make several short runs (`burn=0`, `cycles=1000`) with `update-kappa=true` and other parameter updating turned off. Adjust `kappa-delta` until the parameter acceptance rate is between about 50% and 70%. Values from 0.05 to 0.2 are usually acceptable.
7. Beginning at the last tree from initial run, make several short runs (`burn=0`, `cycles=1000`) with `update-theta=true` and other parameter updating turned off. Adjust `theta-const` until the parameter acceptance rate is between about 50% and 70%.
8. Beginning at the last tree from initial run, make several short runs (`burn=0`, `cycles=1000`) with `update-pi=true` and other parameter updating turned off. Adjust `pi-const` until the parameter acceptance rate is between about 50% and 70%. Values in the thousands are usually acceptable.
9. Beginning at the last tree from initial run, make one slightly longer run (`burn=0`, `cycles=5000`) with all parameter updating turned on. The parameter acceptance rate should be between 15% and 40%. You may need to go back and tinker with tuning parameters.
10. With the run control file up to date with tuning parameters and initial values, run several (three or four) runs from random starts for several thousand cycles (maybe `burn=5000` and `cycles=50000`).
11. Plot all .lpd files to make sure the plateaus are in about the same place. Determine a common number of cycles to discard for all runs so that each run is well into its plateau.
12. Run `meansd` on the tail of the .par files to check that the plateau's are actually similar and to readjust initial values for parameters.
13. Run `summarize` on the different .top files (discarding many initial sample points) to check if the same tree topologies are being selected at roughly the same posterior probabilities.
14. Examine the .sum files after running `summarize` to see if named clades are reasonably defined. Resummarize if necessary. Look at transition tables between subtree topologies to see if mixing is similar in different runs and adequate.
15. If at this point, if you find that the runs are converging to very different places, you may wish to start at a nonrandom tree found by another method. It may be that the algorithms in BAMBE are inadequate for your data.
16. If all seems in order, it is time to do several (three or four) longer runs to save for inference. The sample rate should be as small as possible. The ultimate size of the files you produce should be the limiting factor. We often produce files with 200,000 tree topologies, and there is no reason not to make them bigger. They may be compressed later or discarded after the essential summarization has occurred. The runs should be long enough to achieve the accuracy you desire. You can judge time requirements from the previous exploratory runs.