BAMBE

Bayesian Analysis in Molecular Biology and Evolution

Version 2.03 beta, January 2001

Sequence data format

The program bambe can read in data in two different formats. The first we call BAMBE format. It is a relaxed version of PHYLIP sequential format. An example is:

6 10 # This is a comment.  There are 6 taxa with 10 sites each.
alligator # This is not part of the taxon name.
AAAAGCTGCT
boa constrictor
G--AAGCGTG
cat
a--ggttccc
dog
aatg
  ggc  -x-
elephant #  Case is irrelevant.  Spaces, line breaks, and tabs are allowed.
A-?NYRGTTC
ferret
AACCGGTTGG
Any text may appear after the last taxon data.

The BAMBE data format carries these specifications:

The first line must contain two integers separated by white space (spaces, tabs, carriage returns). Any amount of white space may precede the first integer and any characters may follow the second integer. The first integer is the number of taxa. The second is the number of sites in the alignment.
The data for each taxon comes in a block of lines. The first line contains the taxon name and optional comments. Subsequent lines carry the sequence, optionally broken by white space.
# is the comment symbol. Any characters after # on a taxon name line are ignored. There should not be any comments on the sequence data lines.
Each taxon name may have up to 80 characters, including spaces, but not the symbol `#'. The taxon name is all characters minus trailing right white space before the first `#' or the end of the taxon name line.
Empty lines are allowed at any point in the data file.
It is an error if the end of file occurs before all data is read.
Sequence data may be upper or lower case.
In addition to the regular symbols `A',`G',`C',`T' (or `U'), the standard symbols indicating indeterminate bases are recognized. Alignment gaps are given by the symbol `-' and are treated identically to `N', `X', and `?' (although better models would treat alignment gaps differently). Case is irrelevant.
The symbol `.' is an error if it occurs in the first taxon's data. For subsequent taxa it matches the symbol at the corresponding site of the first taxon.

Data in interleaved format from the CLUSTALW software is also readable. The software recognizes CLUSTAL format if the string ``CLUSTAL'' appears in the first line.

The recognized symbols tabulated below. A warning message is given for unrecognized symbols which are treated as gaps.

Table of recognized characters.
Symbol Mnemonic Interpretation
A or a Adenine A
G or g Guanine G
C or c Cytosine C
T or t Thymine T
U or u Uracil T
R or r Purine (large) A or G
Y or y Pyrimidine (small) C or T
M or m Amino (positive charge) A or C
K or k Ketone (negative charge) G or T
W or w Weak interaction A or T
S or s Strong interaction G or C
B or b Not A G or C or T
H or h Not G A or C or T
D or d Not C A or G or T
V or v Not T A or G or C
-,N,n,X,x, or ? Any A or G or C or T
. Dot Same as first taxon

Table of recognized characters.
Symbol	Mnemonic	Interpretation
A or a	Adenine	A
G or g	Guanine	G
C or c	Cytosine	C
T or t	Thymine	T
U or u	Uracil	T
R or r	Purine (large)	A or G
Y or y	Pyrimidine (small)	C or T
M or m	Amino (positive charge)	A or C
K or k	Ketone (negative charge)	G or T
W or w	Weak interaction	A or T
S or s	Strong interaction	G or C
B or b	Not A	G or C or T
H or h	Not G	A or C or T
D or d	Not C	A or G or T
V or v	Not T	A or G or C
-,N,n,X,x, or ?	Any	A or G or C or T
.	Dot	Same as first taxon

Back to the table of contents.

This page was most recently updated on January 19, 2001.

bambe@mathcs.duq.edu