User Tools

Site Tools


This is an old revision of the document!



Example 1 - 生物系統情報の取得


  use G;
  my $gb = load("g-language-1.8.10/share/genomes/ecoli.gbk");
  foreach(sort keys %{$gb->{TAXONOMY}}) {
    print "$_: ", $gb->{TAXONOMY}->{$_}, "\n";


                 G-language  Genome Analysis Environment v.1.8.10


            Please cite: 
               Arakawa K. et al. (2003) Bioinformatics.
               Arakawa K. et al. (2006) Journal of Pestice Science.
               Arakawa K. et al. (2008) Genes, Genomes and Genomics.

	      License: GNU General Public License
      Copyright (C) 2001-2009 G-language Project
      Institute for Advanced Biosciences, Keio University, JAPAN 


Accession Number: NC_000913

  Length of Sequence :   4639675
           A Content :   1142228 (24.62%)
           T Content :   1140970 (24.59%)
           G Content :   1176923 (25.37%)
           C Content :   1179554 (25.42%)
              Others :         0 (0.00%)
          AT Content :    49.21%
          GC Content :    50.79%

0: Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia.
1: Bacteria
2: Proteobacteria
3: Gammaproteobacteria
4: Enterobacteriales
5: Enterobacteriaceae
6: Escherichia.
all: Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia.
class: Gammaproteobacteria
domain: Bacteria
family: Enterobacteriaceae
genus: Escherichia
order: Enterobacteriales
phylum: Proteobacteria
species: coli



  $ grep -A 2 ORGANISM g-language-1.8.10/share/genomes/ecoli.gbk
    ORGANISM  Escherichia coli str. K-12 substr. MG1655
              Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
              Enterobacteriaceae; Escherichia.

Example 2 - 複数のGenbankファイルを用いた解析


  use G;
  my $dir = 'g-language-1.8.10/share/genomes/';
  opendir(DIR, $dir) || die "directory open error:$!";
  foreach my $file (sort readdir(DIR)) {
    next if ($file !~ /\.gbk$/);
    my $gb = load("$dir/$file", "no msg");
    print $gb->{DEFINITION}, "\n";


Borrelia burgdorferi B31, complete genome.
Bacillus subtilis subsp. subtilis str. 168, complete genome.
Buchnera aphidicola str. APS (Acyrthosiphon pisum), complete genome.
Synechococcus sp. WH 8102, complete genome.
Escherichia coli str. K-12 substr. MG1655, complete genome.
Mycoplasma genitalium G37, complete genome.
Plasmid F, complete sequence.
Pyrococcus furiosus DSM 3638, complete genome.


$ grep DEFINITION g-language-1.8.10/share/genomes/*.gbk




相対的同義語コドン使用頻度(Relative synonymous codon usage; RSCU)は予測されるコドン使用頻度との比として表されるため、コドンの使用頻度に偏りが生じないときは1となります。RSCUは以下のように求めることができます。


  &codon_compiler($gb, -output=>'stdout', -data=>'RSCU');

ライム病ボレリア(Borrelia burgdorferi)のゲノムに対するRSCU値は論文の表1に示しています(。

Example 3- Retrieving rRNA sequences and phylogenetic tree reconstruction

$gb→rRNA() returns the array of all feature IDs of rRNAs, and can specify rRNA species (e.g. '16S', '23S', '5S', 'SSU', 'LSU'). For exmaple, $gb→rRNA('16S') returns a list of feature IDs of 16S rRNAs, sorted by the nucleotide sequence lengths.

The following script outputs a fasta-formatted file of nucleotide sequences of the longest 16S rRNA genes for multiple bacterial strains.

use G;
my $rRNA = '16S';
my $dir = 'g-language-1.8.10/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
    next if($file !~ /\.gbk$/);
    my $gb = load("$dir/$file","no msg");
    next unless($gb->rRNA($rRNA));
    print OUT ">$gb->{ORGANISM}\n", $gb->get_geneseq($gb->rRNA($rRNA)),"\n";

To check the number of sequences in the fasta-formatted output file (gene.fasta), type:

$ grep '>' gene.fasta | wc -l

Phylogenetic tree can be reconstructed by adding the following two lines: the first line invokes multiple alignment, and the second line calculates Neighbor joining (NJ) tree.

       clustalw -align -infile=gene.fasta -outfile=gene.aln -outorder=input
       clustalw -tree -infile=gene.aln

Executing this script generates the following three output files: the gene.aln file (multiple alignment), gene.dnd file (guide tree), and file (NJ tree). The following phylogram was drawn using the file as an input file of TreeView (

Example 4- Calculating genome statistics and their correlation analyses

The following script calculates various genome statistics such as rRNA gene number, tRNA gene number, protein-coding sequence (cds) number, genome size, GC content [100x(G+C)/(A+C+G+T)], and GC skew index (gcsi), which is a measure of the degree of GC skew [(G-C)/(G+C)] (

use G;
my @key = qw(rRNA tRNA cds size GC gcsi);
my %stat;
print OUT join("\t", @key), "\n";
my $dir = 'g-language-1.8.10/share/genomes/';
opendir(DIR, $dir) || die "directory open error:$!";
foreach my $file (sort readdir(DIR)){
    next if($file !~ /\.gbk$/);
    my $gb = load("$dir/$file","no msg");
    $stat{rRNA} = scalar $gb->rRNA(); #	rRNA gene number
    $stat{tRNA} = scalar $gb->tRNA(); #	tRNA gene number
    $stat{cds} = scalar $gb->cds(); # protein-coding sequence (cds) number
    $stat{size} = length($gb->{SEQ}); #	genome size
    my $acgt = $gb->{SEQ} =~ tr/acgt/acgt/; # A+C+G+T
    my $gc = $gb->{SEQ} =~ tr/gcGC/gcGC/; # G+C
    $stat{GC} = sprintf "%.1f", 100 * $gc / $acgt; # GC	content [100x(G+C)/(A+C+G+T)]
    $stat{gcsi} = (&gcsi($gb))[0]; # GC	skew index (gcsi)
    my @tmp;
    foreach (@key){ push(@tmp, $stat{$_}); }
    print OUT join("\t", @tmp), "\n";

Executing this script generates the following tab-delimited output file ('stat.txt').

rRNA    tRNA    cds     size    GC      gcsi
5       18      851     910724  28.6    0.44353725960019
30      86      4176    4215606 43.5    0.214855185905019
3       32      564     640681  26.3    0.126305531560014
6       44      2519    2434428 59.4    0.0837418081678612
22      89      4321    4639675 50.8    0.0966615833014818
3       36      476     580076  31.7    0.127587425631069
0       0       108     99159   48.2    0.20973363827319
4       46      2125    1908256 40.8    0.0203799266306591

Scatter plots and correlation coefficients between all pairs of these values can be computed by adding the following R commands (

my $rcmd = new Rcmd;
my @result = $rcmd->exec(
                         dat = read.delim("stat.txt")
                         postscript("", pointsize=10, width=10, height=50)
                         upPan <- function(...) points(..., col="darkblue")
                         lowPan <- function(x, y, ...) text(mean(par("usr")[1:2]), mean(par("usr")[3:4]),signif(cor(x, y),2),cex=2)
                         pairs(dat, pch=20, lower.panel=lowPan, upper.panel=upPan)

Executing this script generates the following '' file.

There are a strong positive correlation between rRNA gene number and tRNA gene number (see rRNA-tRNA plot, r=0.9), as reported previouly (, and between the number of protein-coding sequences and genome size (see cds-size plot, r=1), as reported previouly (

Example 5- Calculating gene statistics

The following script prints various statistics for protein-coding sequences (cds) in Mycoplasma genitalium genome such as start position (start), GC content at 3rd codon positions (gcc3), GC skew at 3rd codon positions (gcs3), the codon adaptation index (cai) (, and the weighted sum of relative entropy (Ew) as a measure of the degree of synonymous codon evenness (

use G;
my $gb = load("g-language-1.8.10/share/genomes/mgen.gbk");
foreach my $cds ($gb->cds()){
    &bui($gb, -output=>'n', -id=>$cds, -position=>3); # base usage indices (gcc3, gcs3) at 3rd codon posotions
&cai($gb, -output=>'n', -w_output=>'n');
&Ew($gb, -output=>'n');
my @key = qw(start gcc3 gcs3 cai Ew);
print OUT join("\t", @key), "\n";
foreach my $cds ($gb->cds()){
    my @tmp;
    foreach (@key){ push(@tmp, $gb->{$cds}->{$_}); }
    print OUT join("\t", @tmp), "\n";

Executing this script generates the following tab-delimited output file ('stat.txt').

$ head stat.txt
start   gcc3    gcs3    cai     Ew
686     0.1583  -0.1333 0.7351  0.6295
1828    0.1780  -0.1636 0.7443  0.6675
2845    0.2296  -0.1409 0.7257  0.7239
4812    0.1868  -0.0769 0.7164  0.7150
7294    0.1827  +0.0526 0.7566  0.6604
8551    0.1818  +0.0526 0.7391  0.6084
9156    0.1304  -0.3939 0.7644  0.5625
9923    0.1519  +0.0746 0.7635  0.6090
11251   0.1533  -0.1500 0.7432  0.6235

Scatter plots and correlation coefficients between all pairs of these values can be computed by adding the R commands described above (Example 4). Executing this script generates the following '' file.

M. genitalium has a very distinctive variation in GC content within the genome (see start-gcc3 plot), and GC content is correlated with codon usage bias (see gcc3-cai and gcc3-Ew plots), as reported previously (Figure 2 in In Borrelia burgdorferi ("g-language-1.8.10/share/genomes/bbur.gbk"), GC skew (gcs3) is correlated with codon usage bias.

variousperlscriptsjapanese.1285938245.txt.gz · Last modified: 2014/01/18 07:44 (external edit)