2. Input Data and Formats


Input file formats for different kinds of data are discussed in this chapter. In addition, the use of in-memory data editing options is explained. Note that there is no limit on the amount of molecular sequence or distance matrix data that can be analyzed in MEGA; the size of data set is constrained only by the computer memory available.

2.1 MEGA Format

Either sequence data or distance data can be entered in MEGA as ASCII-text files. These data must be organized in a format specific to MEGA. These input file formats are consistent and flexible, and they include options for writing extensive comments in the data file.

2.1.1 Key Words

Every data file must contain the key words #MEGA and TITLE. These key words can be written in any combination of lower- and upper-case letters.

#MEGA This key word indicates that the data file is prepared for analysis using MEGA. It must be present on the very first line in the data file.
TITLE The word TITLE must be written on the second line. It may be followed by some description of data on the same line. This description is written in all the output files containing results. If the specified description exceeds 128 characters in length, the additional characters are ignored.

After the MEGA format identifier (#MEGA) and the title (TITLE), the data should follow. Comments may be written on one or more lines right after the TITLE line and before the data (see examples in sections 2.2 and 2.3).

2.1.2 OTU Labels

Distance matrices as well as sequence data may come from species, populations, or individuals. These evolutionary entities are designated as OTUs (Operational Taxonomic Units). Each OTU must have an identification tag, i.e., an OTU Iabel. In the input files prepared for use in MEGA, these labels should be written according to the following conventions.

'#' Sign Every OTU Iabel must be written on a new line, and a '#' sign must proceed the label. OTU Iabels cannot be longer than 40 characters; extra characters are disregarded. OTU Iabels are not required to be unique, but identical labels may result in ambiguities.
Forbidden
Characters
The '#' sign, blanks, and tabs cannot be a part of an OTU Iabel. For multiple word labels, an underscore can be used to represent a blank space. All underscores are converted into blank spaces, and subsequent displays of the OTU Iabel show this change. For example, E._coli becomes E. coli.

2.2 Sequence Input Formats

The sequence data must consist of two or more sequences of equal length. All sequences must be aligned (MEGA does not include an alignment program) and should be arranged either in interleaved (block-wise) or in noninterleaved (continuous) format (see below).

Nucleotide or amino acid sequences should be written in IUPAC single-letter codes. In this system, A, T(U), C, and G represent the four different nucleotides, and all alphabets except B, J, O, U, X, and Z represent the twenty different amino acids (see Table 2. 1). However, the use of N (and n) for ambiguous nucleotides and X (and x) for ambiguous amino acid residues must be avoided. Sequences can be written in any combination of upper- and lower-case letters. Special symbols for alignment gaps, missing data, and identical sites can also be included in the sequences.

Special Symbols Blank spaces and Tabs are frequently used to format data files, so they are simply ignored by MEGA. Unique ASCII characters, except alphabets and '*', can be used as special symbols for alignment gaps, missing-information sites, and identical sites. Frequently used symbols for identical sites, alignment gaps, and missing-information sites are '.', '-', and '?', respectively.

Table 2.1 IUPAC single-letter codes used in MEGA
Symbols Name Remarks
DNA and RNA
A
G
C
T
U
Adenine
Guanine
Cytosine
Thymine
Uracil
Purine
Purine
Pyrimidine
Pyrimidine
Pyrimidine
Amino Acids
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Alanine
Cysteine
Aspartic acid
Glutamic acid
Phenylalanine
Glycine
Histidine
Isoleucine
Lysine
Leucine
Methionine
Asparagine
Proline
Glutamine
Arginine
Serine
Threonine
Valine
Tryptophan
Tyrosine
Ala
Cys
Asp
Glu
Phe
Gly
His
lle
Lys
Leu
Met
Asn
Pro
Gln
Arg
Ser
Thr
Val
Trp
Tyr

Noninterleaved
Format
In the noninterleaved format, the complete sequence for an OTU is written on one or more lines following its label as shown in the following example.
  #mega
  TITLE: Noninterleaved sequence data

  #mouse      AATTTTTACCCCGGGGGG
              AGGGGGGACCCCGGGGGG
  #human      AACCCTTACCCCGGGGGG
              AGGGGGGACCCCGGGGGG
  #cat        AATTTTTACAAAGGGGGG
              AGGGGGGACCCCGGGGGG
In noninterleaved format there are alternate ways of writing the OTU label and the sequence:
  (a)  
  #mouse      AATTTTTACCCCGGGGGG
  (b)  
  #mouse   
  AATTTTTACCCCGGGGGG
Interleaved
Format
In contrast to the noninterleaved format, interleaved sequences are arranged in blocks consisting of homologous sites for all OTUs. The sequences for all the OTUs must be present in the same order in every block, and these sequences should be written on the consecutive lines in each of the blocks. Sequence blocks should be separated from each other by at least one blank line.
  #mega
  TITLE: Noninterleaved sequence data

  #mouse      AATTTTTACCCCGGGGGG
  #human      AACCCTTACCCCGGGGGG
  #cat        AATTTTTACAAAGGGGGG

  #mouse      AGGGGGGACCCCGG
  #human      AGGGGGGACCCCGG
  #cat        AGGGGGGACCCCGG
Comments Comments can be placed after the TITLE Iine and before the data as well as within the sequences. Comments inside a sequence must be contained within a pair of double quotation marks.
   Format type -->
   Title       -->
   Comments    -->
   #mega
   TITLE: 2 exons from gene XYZ
      Authors: James R. and Ray S., 1987
      Sequencing procedure: PCR
   Sequence &
   Comments    -->
   #cat          ATTCCCGGCCG "intron  10" ACCC
   #rat          ATTCCCGGGGG "intron  of length 8" ACCC
   #rabbit       ATTCCCGGGAA "no intron" ACCC

2.3 Distance Input Formats

There are m(m-1)/2 pairwise distances for m OTUs. These distances can be arranged either in the lower-left or in the upper-right triangular matrix. Following the key word #MEGA on the first line and the TITLE on the second line, all OTU Iabels should be written on consecutive lines. OTU Iabels should be prefixed with the '#' mark and should be written according to the conventions described in section 2.1.2. This list should be separated from the following distance matrix by at least one blank line.

   Format type -->
   Title       -->
   Blank 1     -->
   #mega
   TITLE: Upper-right triangular matrix
   OTU names on
   consecutive
   lines
   
   
   Blank 2     -->
   #one
   #two
   #three
   #four
   #five
   OTU 1 vs.
   others, etc.
   1.0   2.0   3.0   4.0
         3.0   2.5   4.6
               1.3   3.6
                     4.2

In this example, blank line 1 is optional, but blank line 2 is required. The two alternate distance matrix formats are:

Lower-left matrix:Upper-right matrix:
d12
d13d23
d14d24d34
d15d25d35d45
d12d13d14d15
d23d24d25
d34d35
d45
Comments In data files containing distance matrices, comments can only be placed after the TITLE Iine and before the OTU Iabels.

2.4 Editing Sequence Data

Input sequence data consist of two or more aligned sequences of equal length. In MEGA, any subset of this sequence data can be selected for analysis using options available in the Data menu. Select 0TUs and Select Sites/Codons commands are used to choose a desired subset of data. This subset is referred to as the current data, and it is maintained until it is modified.

Selecting Mode
for Analysis
The Select Mode command is used to select the protein coding or noncoding mode for nucleotide sequences. The coding mode provides codon-by-codon and site-by-site analyses, whereas the non-coding mode provides only site-by-site analysis.
Selecting OTUs By default, all OTUs are included in the current data. Some of these OTUs can be removed by using the Data|Select OTUs command. These OTUs Will stay deleted until the Select OTUs command is used again.
Selecting Sites
or Codons
Options for selecting domains as well as individual sites or codons are provided in MEGA. To start with, all the sites (codons) are included in the current data. With the Domains option, up to 10 nonoverlapping domains of sites (or codons) can be chosen. Individual sites (or codons) are chosen by using the Individual command.

The options for including alignment gaps and missing information sites and the choice of nucleotide positions in codons provide a second level of data editing. These options are prompted every time before the analysis begins (see section 4.5), and they only affect the current analysis.

Choosing Sites
in Codons
Any combination of first, second, and third nucleotide positions in the codons can be chosen if the nucleotide sequences are used in the protein coding mode.
Excluding
Missing-
Information
Sites and
Alignment Gaps
In distance computation, alignment gaps and missing-information sites can be treated in two different ways. One is to eliminate all these gap and missing-information sites from all the sequences . The other is to ignore only the gap and missing-information sites that are involved in a particular pairwise comparison. These options are usually prompted before distance calculation and tree reconstruction. Detailed discussions on this topic are p resented in the chapters on Distance Estimation and Phylogenetic Inference.

2.5 Editing Distance Data

A set of OTUs can be selected in distance matrix data by using the Select OTUs command from the Data menu. The distance matrix is reduced automatically by removing rows and columns corresponding to the excluded OTUs.


[Next] [Table of Contents]