Input file formats for different kinds of data are discussed in this chapter. In addition, the use of in-memory data editing options is explained. Note that there is no limit on the amount of molecular sequence or distance matrix data that can be analyzed in MEGA; the size of data set is constrained only by the computer memory available.
Either sequence data or distance data can be entered in MEGA as ASCII-text files. These data must be organized in a format specific to MEGA. These input file formats are consistent and flexible, and they include options for writing extensive comments in the data file.
Every data file must contain the key words #MEGA and TITLE. These key words can be written in any combination of lower- and upper-case letters.
#MEGA | This key word indicates that the data file is prepared for analysis using MEGA. It must be present on the very first line in the data file. |
TITLE | The word TITLE must be written on the second line. It may be followed by some description of data on the same line. This description is written in all the output files containing results. If the specified description exceeds 128 characters in length, the additional characters are ignored. |
After the MEGA format identifier (#MEGA) and the title (TITLE), the data should follow. Comments may be written on one or more lines right after the TITLE line and before the data (see examples in sections 2.2 and 2.3).
Distance matrices as well as sequence data may come from species, populations, or individuals. These evolutionary entities are designated as OTUs (Operational Taxonomic Units). Each OTU must have an identification tag, i.e., an OTU Iabel. In the input files prepared for use in MEGA, these labels should be written according to the following conventions.
'#' Sign | Every OTU Iabel must be written on a new line, and a '#' sign must proceed the label. OTU Iabels cannot be longer than 40 characters; extra characters are disregarded. OTU Iabels are not required to be unique, but identical labels may result in ambiguities. |
Forbidden Characters |
The '#' sign, blanks, and tabs cannot be a part of an OTU Iabel. For multiple word labels, an underscore can be used to represent a blank space. All underscores are converted into blank spaces, and subsequent displays of the OTU Iabel show this change. For example, E._coli becomes E. coli. |
The sequence data must consist of two or more sequences of equal length. All sequences must be aligned (MEGA does not include an alignment program) and should be arranged either in interleaved (block-wise) or in noninterleaved (continuous) format (see below).
Nucleotide or amino acid sequences should be written in IUPAC single-letter codes. In this system, A, T(U), C, and G represent the four different nucleotides, and all alphabets except B, J, O, U, X, and Z represent the twenty different amino acids (see Table 2. 1). However, the use of N (and n) for ambiguous nucleotides and X (and x) for ambiguous amino acid residues must be avoided. Sequences can be written in any combination of upper- and lower-case letters. Special symbols for alignment gaps, missing data, and identical sites can also be included in the sequences.
Special Symbols | Blank spaces and Tabs are frequently used to format data files, so they are simply ignored by MEGA. Unique ASCII characters, except alphabets and '*', can be used as special symbols for alignment gaps, missing-information sites, and identical sites. Frequently used symbols for identical sites, alignment gaps, and missing-information sites are '.', '-', and '?', respectively. | |||||||||||||||
Table 2.1 IUPAC single-letter codes used in MEGA
|
Noninterleaved Format |
In the noninterleaved format, the complete sequence for an OTU is written on one or more lines following its label as shown in the following example. | ||||
#mega TITLE: Noninterleaved sequence data #mouse AATTTTTACCCCGGGGGG AGGGGGGACCCCGGGGGG #human AACCCTTACCCCGGGGGG AGGGGGGACCCCGGGGGG #cat AATTTTTACAAAGGGGGG AGGGGGGACCCCGGGGGG |
|||||
In noninterleaved format there are alternate ways of writing the OTU label and the sequence: | |||||
|
|||||
Interleaved Format |
In contrast to the noninterleaved format, interleaved sequences are arranged in blocks consisting of homologous sites for all OTUs. The sequences for all the OTUs must be present in the same order in every block, and these sequences should be written on the consecutive lines in each of the blocks. Sequence blocks should be separated from each other by at least one blank line. | ||||
#mega TITLE: Noninterleaved sequence data #mouse AATTTTTACCCCGGGGGG #human AACCCTTACCCCGGGGGG #cat AATTTTTACAAAGGGGGG #mouse AGGGGGGACCCCGG #human AGGGGGGACCCCGG #cat AGGGGGGACCCCGG |
|||||
Comments | Comments can be placed after the TITLE Iine and before the data as well as within the sequences. Comments inside a sequence must be contained within a pair of double quotation marks. |
Format type --> Title --> Comments --> |
#mega TITLE: 2 exons from gene XYZ Authors: James R. and Ray S., 1987 Sequencing procedure: PCR |
Sequence & Comments --> |
#cat ATTCCCGGCCG "intron 10" ACCC #rat ATTCCCGGGGG "intron of length 8" ACCC #rabbit ATTCCCGGGAA "no intron" ACCC |
There are m(m-1)/2 pairwise distances for m OTUs. These distances can be arranged either in the lower-left or in the upper-right triangular matrix. Following the key word #MEGA on the first line and the TITLE on the second line, all OTU Iabels should be written on consecutive lines. OTU Iabels should be prefixed with the '#' mark and should be written according to the conventions described in section 2.1.2. This list should be separated from the following distance matrix by at least one blank line.
Format type --> Title --> Blank 1 --> |
#mega TITLE: Upper-right triangular matrix |
OTU names on consecutive lines Blank 2 --> |
#one #two #three #four #five |
OTU 1 vs. others, etc. |
1.0 2.0 3.0 4.0 3.0 2.5 4.6 1.3 3.6 4.2 |
In this example, blank line 1 is optional, but blank line 2 is required. The two alternate distance matrix formats are:
Lower-left matrix: | Upper-right matrix: | ||||||||||||||||||||||||||||||
|
|
Comments | In data files containing distance matrices, comments can only be placed after the TITLE Iine and before the OTU Iabels. |
Input sequence data consist of two or more aligned sequences of equal length. In MEGA, any subset of this sequence data can be selected for analysis using options available in the Data menu. Select 0TUs and Select Sites/Codons commands are used to choose a desired subset of data. This subset is referred to as the current data, and it is maintained until it is modified.
Selecting Mode for Analysis |
The Select Mode command is used to select the protein coding or noncoding mode for nucleotide sequences. The coding mode provides codon-by-codon and site-by-site analyses, whereas the non-coding mode provides only site-by-site analysis. |
Selecting OTUs | By default, all OTUs are included in the current data. Some of these OTUs can be removed by using the Data|Select OTUs command. These OTUs Will stay deleted until the Select OTUs command is used again. |
Selecting Sites or Codons |
Options for selecting domains as well as individual sites or codons are provided in MEGA. To start with, all the sites (codons) are included in the current data. With the Domains option, up to 10 nonoverlapping domains of sites (or codons) can be chosen. Individual sites (or codons) are chosen by using the Individual command. |
The options for including alignment gaps and missing information sites and the choice of nucleotide positions in codons provide a second level of data editing. These options are prompted every time before the analysis begins (see section 4.5), and they only affect the current analysis.
Choosing Sites in Codons |
Any combination of first, second, and third nucleotide positions in the codons can be chosen if the nucleotide sequences are used in the protein coding mode. |
Excluding Missing- Information Sites and Alignment Gaps |
In distance computation, alignment gaps and missing-information sites can be treated in two different ways. One is to eliminate all these gap and missing-information sites from all the sequences . The other is to ignore only the gap and missing-information sites that are involved in a particular pairwise comparison. These options are usually prompted before distance calculation and tree reconstruction. Detailed discussions on this topic are p resented in the chapters on Distance Estimation and Phylogenetic Inference. |
A set of OTUs can be selected in distance matrix data by using the Select OTUs command from the Data menu. The distance matrix is reduced automatically by removing rows and columns corresponding to the excluded OTUs.