© Copyright 2000, 2001, Donald Simon & Bret Larget, Department of Mathematics and Computer Science, Duquesne University.

`summarize`

acts on a tree topology file
generated by BAMBE.
It counts the appearance of each tree topology,
automatically identifies clades,
displays the frequency each tree topology appears,
and shows transitions between subtree topologies within clades.
We define a **named clade**
with these criteria:

- A named clade must have at least two members.
- The members of a named clade must appear
as a monophyletic group in at least a proportion
*threshold*of all sampled tree topologies. The threshold must be greater than one half. - There may not be more than
*max_top*different subtree topologies among all sampled trees where the named clade is monophyletic. - A named clade cannot be a proper subset of another named clade.

These rules for naming clades give the user some flexibility and make reading summaries of the posterior for large trees substantially easier.

The options for `summarize`

are:

Option | Description |
---|---|

`-n` |
Number of lines to skip from each input file. 0 is the default. |

`-p` |
Threshold for named clade definition (greater than one half). 0.9 is the default. |

`-c` |
Maximum number of subtree topologies within a named clade (not more than 10). 10 is the default. |

The most general way to call `summarize`

is:

% summarize [-nskip] [-pthreshold] [-cmax_tops] < file 1> < file 2 > ...

The square brackets indicate optional arguments. If the symbol `-' is used in place of a file name, the program expects the input from standard input.

% summarize -n 200 -p .8 -c 8 run1.top run2.top run3.top > runs.sum

will ignore the first 200 input lines from each file and summarize the concatenation of the remaining files with named clades having a threshold of 80% and no more than eight subtree topologies observed in the combined sample.

% head -20000 run1.top | tail -10000 | summarize - > run1.sum

will run `summarize`

on lines 10,001 through 20,000
of run1.top.

The `summarize`

program output contains several components.
We describe this output from an example of a summary
of a sample of 200,000 tree topologies describing the evolutionary relationship
of fourteen taxa.

The first section of the summary output shows the classification of taxa into named clades and how often each named clade and subtree topology appears. Taxa that do not appear in a named clade are listed separately.

******************** Named clades ******************** 200000 A {1,2} 200000 A1 (1,2) 200000 B {4,5} 200000 B1 (4,5) 200000 C {6,7} 200000 C1 (6,7) 170693 D {8,9,10,11,12} 99386 D1 (((8,9),10),(11,12)) 55889 D2 (((8,9),11),(10,12)) 5303 D3 (((8,9),(10,12)),11) 3752 D4 ((8,9),(10,(11,12))) 3236 D5 ((((8,9),10),12),11) 1184 D6 ((8,9),((10,12),11)) 1019 D7 (((8,9),(11,12)),10) 360 D8 ((((8,9),10),11),12) 329 D9 ((((8,9),11),12),10) 235 D10 ((((8,9),11),10),12) 3 13 14

The next section of summary output gives a complete sorted list of each observed tree topology. The actual file contained 174 different tree topologies, of which we show the first 10 and the last 3. The first column is the raw count. The second column is the posterior probability of the tree topology. The third column is the cumulative posterior probability. Notice that the first ten tree topologies account for nearly 90% of the posterior probability. You must refer back to the named clades for a complete description. Notice that most of the uncertainty in the top ten trees involves the taxa in named clade D.

******************** Tree topologies ******************** Count Prob. Cum. Tree topology 93239 0.466 0.466 (A1,(3,(B1,((C1,(D1,13)),14)))) 52018 0.260 0.726 (A1,(3,(B1,((C1,(D2,13)),14)))) 6268 0.031 0.758 (A1,(3,(B1,((C1,((((8,9),10),13),(11,12))),14)))) 5981 0.030 0.788 (A1,(3,(B1,((C1,(((8,9),10),((11,12),13))),14)))) 5004 0.025 0.813 (A1,(3,(B1,((C1,(D3,13)),14)))) 4084 0.020 0.833 (A1,(3,((B1,14),(C1,(D1,13))))) 3546 0.018 0.851 (A1,(3,(B1,((C1,(D4,13)),14)))) 3098 0.015 0.866 (A1,(3,(B1,((C1,(D5,13)),14)))) 2726 0.014 0.880 (A1,(3,(B1,((C1,((((8,9),11),13),(10,12))),14)))) 2715 0.014 0.893 (A1,(3,((B1,14),(C1,(D2,13))))) . . . 1 0.000 1.000 (A1,(3,((B1,14),(C1,((8,9),((10,(11,12)),13)))))) 1 0.000 1.000 ((A1,(B1,((C1,(((8,9),10),((11,12),13))),14))),3) 1 0.000 1.000 (A1,(3,(B1,((C1,(((((8,9),13),10),11),12)),14))))

The next section of summary output
is similar to bootstrap proportions
given by other methods.
Relative to the most probable tree topology,
the posterior probability of **every**
clade, named or not, is provided.
Taxa 8,9,10,11,12, and 13 appear together 99.1% of the time.
The program did not name them as a clade because
the number of distinct subtree topologies
among the sampled trees exceeded ten.

***** Posterior probabilities of clades in most probable tree topology ***** Count Prob. Tree topology 200000 1.000 {1,2,3,4,5,6,7,8,9,10,11,12,13,14} 200000 1.000 {1,2} 198620 0.993 {3,4,5,6,7,8,9,10,11,12,13,14} 200000 1.000 {4,5,6,7,8,9,10,11,12,13,14} 200000 1.000 {4,5} 189343 0.947 {6,7,8,9,10,11,12,13,14} 200000 1.000 {6,7,8,9,10,11,12,13} 200000 1.000 {6,7} 198142 0.991 {8,9,10,11,12,13} 170693 0.853 {8,9,10,11,12} 115978 0.580 {8,9,10} 200000 1.000 {8,9} 121991 0.610 {11,12}

For each named clade we summarize the transitions between subtree topologies. These tables can be useful for examining mixing efficiency. Ideally, the transitions would occur as frequently as one might expect from independent samples from the posterior, but this is almost never approached. It is important that there be a sufficient number of transitions between various likely subtree topologies. It is interesting in this example that there are never any direct transitions between subtrees D1 and D2. Presumably, a different sampler that allowed such transitions directly may greatly increase mixing properties.

******************** Clade transition matrices ******************** | D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 - -----+----------------------------------------------------------------- D1 |98485 0 1 151 97 0 50 10 0 0 591 D2 | 0 55411 191 0 0 57 0 0 17 9 204 D3 | 0 202 4994 0 47 34 0 0 0 0 26 D4 | 158 0 1 3514 0 10 36 0 0 0 33 D5 | 91 0 49 0 3068 0 0 15 0 0 13 D6 | 0 54 40 10 0 1060 0 0 0 0 20 D7 | 49 0 0 38 0 0 929 0 3 0 0 D8 | 9 0 0 0 15 0 0 329 0 5 2 D9 | 0 12 0 1 0 0 4 0 305 7 0 D10 | 0 13 0 0 0 0 0 4 4 214 0 - | 593 197 27 38 9 23 0 2 0 0 28418

The last portion of the summary output shows "clade trees" where the subtree topology differences within named clades is ignored. We see here that ignoring uncertainty in the tree topology of named clade D, the best clade tree appears nearly 80% of the time. The remaining uncertainty is mostly in the location of taxon 13. Lumping taxon 13 with clade D, we see that there is very little uncertainty in the tree.

******************** Clade tree topologies ******************** Count Prob. Cum. Tree topology 159821 0.799 0.799 (A,(3,(B,((C,(D,13)),14)))) 7429 0.037 0.836 (A,(3,((B,14),(C,(D,13))))) 6268 0.031 0.868 (A,(3,(B,((C,((((8,9),10),13),(11,12))),14)))) 5981 0.030 0.897 (A,(3,(B,((C,(((8,9),10),((11,12),13))),14)))) 2726 0.014 0.911 (A,(3,(B,((C,((((8,9),11),13),(10,12))),14)))) 2185 0.011 0.922 (A,(3,(B,((C,((((8,9),13),11),(10,12))),14)))) 2030 0.010 0.932 (A,(3,(B,((C,((((8,9),13),10),(11,12))),14)))) 1677 0.008 0.941 (A,(3,((B,(C,(D,13))),14))) 1524 0.008 0.948 (A,(3,(B,((C,(((8,9),13),((10,12),11))),14)))) 1406 0.007 0.955 (A,(3,(B,((C,(((8,9),13),(10,(11,12)))),14)))) . . . 1 0.000 1.000 (A,(3,((B,14),(C,((8,9),((10,(11,12)),13)))))) 1 0.000 1.000 (A,(3,((B,14),(C,((8,9),(((10,12),11),13)))))) 1 0.000 1.000 (A,(3,((B,(C,((((8,9),13),11),(10,12)))),14)))

Back to the table of contents.

This page was most recently updated on January 19, 2001.

bambe@mathcs.duq.edu