Converting GCG Format

These files consist of one or more groups of non-blank lines separated by one or more blank lines; the non-blank lines look similar to this:

 

Chloroflex

Chloroflex Length: 428 Mon Sep 25 17:34:20 MDT 2000 Check: 0 ..

1 MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV

51 NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC

101 DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG

151 VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS

201 VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR

251 REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV

301 KLEVDGHRGE LTLARA.ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD

351 PLETQD.... ...DAVRIER AVEEVLVP.. .......... ..........

401 .......... .......... ........

 

 

The “Check” tag near the end of a line signifies the first line in a new sequence expression. The name of the sequence is obtained from the preceding line; the following lines, up to the next blank line, are accepted as the sequence. For each line in the sequence, the leading digits are stripped off, and the rest of the line is used. The following shows a conversion of the above sequence.

 

#mega

Title: infile.gcg

 

#Chloroflex

MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV

NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC

DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG

VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS

VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR

REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV

KLEVDGHRGE LTLARA.ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD

PLETQD.... ...DAVRIER AVEEVLVP.. .......... ..........

.......... .......... ........