User Tools

Site Tools


introductiontog-languageenglish

An introduction of G-language System

What is G-language GAE?

G-language Genome Analysis Environment (G-language GAE) is a generic bioinformatics package developed since 2001 mainly by Institute for Advanced Biosciences, Keio University. At the moment more than 200 analyses program are included, with graphical user interface to make analysis easier. Also G-language GAE can be used as modules for Perl and has its own shell, making it very powerful for expansion.

References:

  • "G-language Genome Analysis Environment: a workbench for nucleotide sequence data mining", Arakawa K, Mori K, Ikeda K, Matsuzaki T, Kobayashi Y, Tomita M, Bioinformatics, 2003, 19(2):305-306 (PubMed)
  • "G-language System as a platform for large-scale analysis of high-throughput omics data", Arakawa K, Tomita M, Journal of Pesticide Science, 2006, 31(3):282-288(PDF)

About this tutorial

From here on specific example will be shown to explain what kind of bioinformatics research is possible with G-language GAE. A phenomena called GC skew in bacterium genome will be taken as account and a in a series of analysis there will be biological observations in every turn.

-Beginners- here only analysis that can be done by only using graphical user interface will be taken as account, it is to say only those functions that come ready to G-language GAE in standard.

-Intermediate- will introduce standard analysis with a bit of expansion using Perl that can also be done by writing it on the shell one line at the time.

Finally in -Advanced- explanation of high analysis that can be done by making Perl script using G-language GAE as a Perl module is given.

Basics of G-language GAE

Analysis by Graphical User Interface

Let’s first start up the system and run!

G-language GAE has a user interface (GUI) to highly analyze with generally only one click. To start up GUI only type command line glang.

unix % glang

then, three window: G-language Manager which is the main window to execute such as analysis, Console which shows system messages and Configurator which configures analysis in detail, will start up. In start up mode, G-language GAE already has a software called bacteria analysis, which is setup with mycoplasma bacteria genome as analysis subject and parameters for doing analysis. For now lets press the Execute button of G-language Manager and analyze with the initial settings.

Results come as soon as execution is done. Result as letter sequence will be outputted in the Text Output window and the others will be shown in different windows. G-language GAE converts and outputs the results in to graphs as possible as it can so that it can be understood in a glimpse. This enables researchers to continue their observation and research, minimizing the time it takes to accommodate data as possible.

G-language GAE has in total more than 200 analysis programs but the ones used here for bacteria analysis system, are especially ready by 30 appropriate analysis program for complete genome analysis. For example the codon_usage outputted today is for showing codon table, base_information_content is for calculating information amount around start codons, genomicsskew is for showing GC skew of various regions, view_cds is one which converts the amount of bases of 100 bp around the start codon and stop codon into a graph, and genome_map which converts and displays information of base amount in genome and gene as a map.

G-language Manager

G-language Manager, which is the main window of GUI, is ready with three buttons. Configure button opens the Configurator window, Execute button executes the selected analysis as we have already seen in the example. Generate button outputs the selected analysis to a Perl script instead of executing it. By editing the selected Perl script more detail settings and function expansion can be made.

In the Help menu it is possible to acquire information about G-language GAE and see the project web page. In the project web page other than various documents you can download the latest version. From the File menu, application termination and system analysis loading can be done. Analysis system file in G-language Configuration File (GCF) format can be selected and loaded from File→Load Script menu. In standard bacteria analysis system default.gcf is selected, however here in /usr/local/g-language, you can select any of the various GCF file, making it possible to do a large scale of analysis. Below is shown a GCF file list offered in standard. Additional GCF file can be obtained from the project web page.

GCF file name Explanation
all.gcf Contains the total, more than 200 analysis program included in G-language GAE.
default.gcf Analysis system for bacteria. Analysis adequate for complete bacteria genome.
oasys.gcf System for exhaustive calculation of information concerning to a specific gene.
CASYS.gcf cDNA analysis system. Clustering cDNA
Mapping.gcf System to map cDNA and EST data on the genome.
STeP.gcf STS-ePCR system. Using ePCR data such us OMIM is mapped in places between STS.

In File→Convert Perl Script menu, it is possible to load Perl program and immediately change it to GUI. The system automatically recognizes input and output Perl subroutines inside the program and converts it to GCF and loads it to GUI. By using File→Convert Perl Script menu and Generate button, it is possible to improve software repeating interface such as from Perl script to GUI and then again to Perl script.

Configurator

In Configurator it is possible to select a program for analysis and set its options in detail. To do this, specify the filename of the database to be used in the $gb part in the top, select the analysis program from the left side with the On/Off button. You can specify the sequence of analysis by typing a number to the space on the right side.

If more detailed analysis is desirable, click any analysis Edit button, and a list of useable programs for that analysis will show up. Here you can input the values of the options to be desired to be change. Press the Save button once configuration of programs that will be used and its option is done and the modified settings will be saved. It will only remain to press the Execute button to start the analysis.

There are many types of analysis programs but have many things in common, such as instance_G (normally as $gb and no modification is necessary), which shows the genome and/or the base sequence, -filename which is the file name of the outputted file, the –Return button which specifies a variable to store the returned value and –output to specify the format of the result (f is for file output, g is for graph output, show is for after graph output and stdout is for standard output).

Customize your analysis

Let only gcwin which plots the transition of GC content, be switched On and all the others switched Off in Configurator. Save it in that condition and execute.

Next, press Edit to show options, change window to for example 1000 to 5000 and Save/Execute it. The window option is to specify the window size to divide the genome and in default it is set to 10000. Notice that since the genome size of mycoplasma is really small the default value is too big.

After finding the optimal window change the at option to 1. When this options is 1 you can see the transition of AT amount instead of GC amount.

Similarly, switch any analysis of interest to On, customize various options and experiment.

G-language Shell

First, let’s start it up and use it

G-language GAE is ready with an shell for analyzing like a conversation. In this shell every Perl functions, any G-language GAE related functions and some easy UNIX shell functions are useable, making it a very powerful interface. To start up type the command line G.

unix % G 
           __/__/__/__/__/__/__/__/__/__/__/__/__/ 
              
           G-language Genome Analysis Environment 
            Version: 1.51 
            License: GPL 
 
            Copyright (C) 2001-2005 
            G-language Project 
            Institute for Advanced Biosciences, 
            Keio University, JAPAN 
               http://www.g-language.org/ 
           __/__/__/__/__/__/__/__/__/__/__/__/__/ 
          
 G > 

G> will be shown on the head of the line and results will instantly show up if a Perl program or any supported shell command( ls, cd, cp, rm, less, convert) is typed in continuation. There is a complement function with tab for file name even in times in the middle of commanding Perl. As in UNIX by pressing the up key, because of the history function, it is possible to call the last command and erase a row with Ctrl-K.

G >$value = 256; 
G >print $value * 4; 
1024 
G >ls 
bmp  docs  g2s  gcf  mgen.gbk 
 G >$gb = new G("/usr/local/g-language/mgen.gbk"); 
Accession Number: NC_000908 
  Length of Sequence :    580074 
         A Content :    200543 (34.57%) 
         T Content :    195695 (33.74%) 
         G Content :     92312 (15.91%) 
         C Content :     91524 (15.78%) 
            Others :         0 (0.00%) 
        AT Content :    68.31% 
        GC Content :    31.69% 
G > 

To end the shell, use the quit command. Then, a message of whether to save the workspace will appear. By typing y all the variables used while working is save for eternity and the next time you start up it will be available immediately. In a normal program the memory content is gone after ending, but the function of the eternal memory of the shell keeps it even after ending and is automatically loaded the next time it is started up.

G >quit 
  save value ?  y/n ? 
quit >>y 
zux100: ~/ ? G 
           __/__/__/__/__/__/__/__/__/__/__/__/__/ 
              
           G-language Genome Analysis Environment 
            Version: 1.51 
            License: GPL 

            Copyright (C) 2001-2005 
            G-language Project 
            Institute for Advanced Biosciences, 
            Keio University, JAPAN 
               http://www.g-language.org/ 
           __/__/__/__/__/__/__/__/__/__/__/__/__/ 
           
G >$gb->seq_info() 
Accession Number: NC_000908 
  Length of Sequence :    580074 
         A Content :    200543 (34.57%) 
         T Content :    195695 (33.74%) 
         G Content :     92312 (15.91%) 
         C Content :     91524 (15.78%) 
            Others :         0 (0.00%) 
        AT Content :    68.31% 
        GC Content :    31.69% 
G > 

Also by typing command “makelog”, after finishing any work in the shell, every input of the work is saved in to an executable Perl script. This log is saved into a file called G_INTERNAL.log inside the worked directory.

G > makelog 

Using bacteria analysis system in a command line

Although bacteria analysis system used with GUI had analysis functions defined by GCF, to repeatedly do a sequence of work, there are cases where command lines are better suited than GUI. In such cases, it will work by giving the GCF file as a parameter to the G command of the shell.

unix % G /usr/local/g-language/gcf/default.gcf 

Similarly, by giving a Perl script with “.pl” extension as parameter, it is handled as a loaded Perl module inside G-language GAE like perl –MG.

unix % G test.pl

To use analysis programs independently, write it as a function inside the shell. Generally loaded genome file is given as new G() to the first parameter.

G >$gb = new G("/usr/local/g-language/mgen.gbk"); 
Accession Number: NC_000908 
  Length of Sequence :    580074 
         A Content :    200543 (34.57%) 
         T Content :    195695 (33.74%) 
         G Content :     92312 (15.91%) 
         C Content :     91524 (15.78%) 
            Others :         0 (0.00%) 
        AT Content :    68.31% 
        GC Content :    31.69% 
G >gcwin($gb); 

Option setting is done by writing an option name followed by “⇒” to specify value.

G > gcwin($gb, -window=>1000); 

G-language practice –Beginners-

GC skew

What is GC skew?

GC skew is a parameter that represents the amount bias of G and C in a single DNA molecule strand and its formula is (C-G)/(C+G). Chargaff's base distribution law, famous for contributing in the discovery of the structure of the DNA by Watson and Crick, which states that “the amount of As and Cs and of Gs and Cs inside a cell is almost the same” has later been added a second term that defines by experience that “the quantity of Gs and Cs, and As and Ts in one strand of DNA is almost the same”. This phenomenon occurs in the whole genome, but bias can be seen in small set of regions. In fact in bacterium this tendency is clearly not relative in various places, and it is known that those places match replication starting/ending points. There is two theories about how GC skew phenomena occurs, these are different mutation probability in leading strand and lagging strand and mutation bias by usage of codons.

The GC skew phenomenon was known since early after the first time the complete genome of bacteria was read and it is still one of a basic analysis to understand the tendency of an entire genome. G-language GAE is ready with various tools to earn information concerning to GC skew always needed to be considered when a whole genome is analyzed.

Reference:

  • Lobry, J.R., 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13, 660-665

Analysis

Now, let’s start GC skew analysis. To compare various genome tendencies this time complete genomes of 5 different species will be used.

Name of organism Name of specie Accession number Explanation
Escherichia coli Escherichia coli NC_000913 The most known model organism
Bacillus subtillis Bacillus subtillis NC_000964 A relative of bacillus subtillis used for fermentation of natto. It shows an especially strong GC skew.
Cyanobacteria Synechococcus sp. NC_005070 Bacteria that does photosynthesis.
Pyrococcus furiosus Pyrococcus furiosus NC_003413 An archae resistant to high temperature
Mycoplasma Mycoplasma genitalium NC_000908 Microorganism, with the shortest complete genome among the completely read.

First start up the G-language shell and download the genome of these 5 from the database and make it in a state that is useable.

unix %  G 
           __/__/__/__/__/__/__/__/__/__/__/__/__/ 
              
           G-language Genome Analysis Environment 
            Version: 1.51 
            License: GPL 
 
            Copyright (C) 2001-2005 
            G-language Project 
            Institute for Advanced Biosciences, 
            Keio University, JAPAN 
               http://www.g-language.org/ 
           __/__/__/__/__/__/__/__/__/__/__/__/__/ 
           
G >$eco = new G("NC_000913", "no msg"); 
G >$bsub = new G("NC_000964", "no msg"); 
G >$mgen = new G("NC_000908", "no msg"); 
G >$cyano = new G("NC_005070", "no msg"); 
G >$pyro = new G("NC_003413", "no msg"); 
G >

Some time might take but by typing the above commands the 5 genome will be downloaded from the data base and stored into $eco, $bsub, $mgen, $cyano, $pyro for later analysis. Normally the new G() command reads the genome data and simultaneously shows the base ratio composition, however with “no msg” option it is set so that these data will not be shown.

Now, let’s see the GC skew of these genomes. The function that calculates GC skew is gcskew($gb).

G >gcskew($eco) 
G >gcskew($bsub) 
G >gcskew($mgen) 
G >gcskew($cyano) 
G >gcskew($pyro) 

By typing as the above, 5 new windows will show up each with a graph. GC skew function divides the genome every 10,000bp calculates the (C-G)/(C+G) of those regions and plots it. Thus every region with more Cs than Gs has a positive value and negative values happen in the opposite.

In colon bacillus and bacillus subtillis, it can be seen that regions with relatively large amount of Gs and relatively C rich region are divided beautifully. In contrast this tendency is not observed in Cyanobacteria and Pyrococcus. Mean while the number of data point in the mycoplasma is fewer than the others. It seems that 10,000bp window was to long for mycoplasma 580,000bp(other genomes had 2.5Mbp~4.6Mbp). Lets retry to graph shortening the widow size to 1,000bp in the option.

G > gcskew($mgen, -window=>1000) 

With this it has been drawn with almost the same detail as the other graphs. However as with Cyanobacteria and Pyrococcus mycoplasma GC skew does not have any clear shift point. Should it be that in these organisms GC skew does not exist? It is known that accumulation plot is effective for seeing switching in number from positive to negative such as GC skew. This is done not by plotting each value per window, rather by adding values consecutively. The function, which calculates GC skew accumulation, is cum_gcskew($gb). As the anterior let’s see the GC skew accumulation with this function (changing window length to 1,000bp for mycoplasma).

G >cum_gcskew($eco) 
G >cum_gcskew($bsub) 
G >cum_gcskew($mgen, -window=>1000) 
G >cum_gcskew($cyano) 
G >cum_gcskew($pyro) 

Shift point of Colon bacillus and Bacillus subtillis, which showed clear GC skew is emphasized by plotting the accumulation. Also for mycoplasma and Cyanobacteria that had too strong noise to observe GC skew can now be observed that it has a GC skew shift point and thus understand that it has a weak GC skew. However Pyrococcus does not show any clear shift point even with the accumulation plot.

Generally the leading strand in bacteria is rich in Gs than in its lagging strand. Thus it is said that the GC skew shift point is the borderline of the leading strand and the lagging strand, in other words the replication start point and ending point. In GC skew accumulation, the highest point corresponds to the estimated replication start point and the lowest point corresponds to the replication termination point. The program that finds these two points is named find_ori_ter($gb). Let’s predict the replication starting point and finishing point of colon bacillus and bacillus subtillis, which are already known.

G > find_ori_ter($eco) 
find_ori_ter: 
   Window size = 1000 
   Predicted Origin:   3923500 
   Predicted Terminus: 1549500 
G > find_ori_ter($bsub) 
find_ori_ter: 
   Window size = 1000 
   Predicted Origin:   4213500 
   Predicted Terminus: 1941500

In E.coli, the replication starting point is near 3,923,500bp and the ending point is near 1,588,800bp and in the case of Bacillus subtillis the replication starting point is near 1bp and the ending point is near 2,017,000bp. It is not that it entirely matches, but it can be said that the prediction of the replication starting point and finishing point is relatively highly accurate. (The replication point of Bacillus Subtillis predicted is in the last part of the genome but since the genome is round shape, it is fair to say that it is equal to the first part. Thus it is very near to the real replication starting point, which is 1bp). It is known that the replication staring point in mycoplasma is also in 1bp and the replication finishing point is unknown, however at least the peak of the GC skew accumulation is near 1bp and it quite matches the phenomena. Mean while from the fact that in many Archaea's replication point exists in various places, it is reasonable that clear shift point and GC skew cannot be observed.

Next, let’s find out the mutation regions that corresponds to the GC skew phenomena. From Colon Bacillus and Bacillus Subtillis, which had a pretty strong GC skew, let’s plot simultaneously the whole genome, only the code area, only between genes and only the third base of codons relatively susceptible to mutation. This time every length of the plotted will be different, so we will set the window to be the 1/250 of the entire length of each. The function that will do this is genomicskew($gb).

G >genomicskew($eco)
G >genomicskew($bsub)

Because the region between genes and the third base of the codon is much shorter than the code region and the entire genome, their swing lengths are wider. However, all of these sections demonstrate that GC skew exists. Thus, it could be suggested that mutation responsible for GC skew occurs almost equally in the leading strand and not specifically in a particular section of the genome.

Now, what would be the difference between species that have strong GC skew tendency and species that do not have it? It is yet not understood clearly, however at least it seems to be correlated with the speed of reproduction of the bacteria. The time it takes for a cell division in Bacillus Subtillis is approximately 0.45/h, 0.35h for Colon Bacillus, 12h for Mycoplasma and 8h for Cyanobacteria, and it can be seen that these speed and GC skew value have a correlation.

References:

  • Rocha, E.P.C., 2004. Codon usage bias from tRNA's point of view: Redundancy, specialization, and efficient decoding for translation optimization. Genome Res. 14, 2279-2286(reproduction speed)
  • Freeman, J.M., et al. 1998. Patterns of Genome Organization in Bacteria. Science. 279, 1827a(replication starting/ending point)
  • Grigoriev, A., 1998. Analyzing genomes with cumulative skew diagrams. Nucleic Acids Res. 26, 2286-2290(accumulation plotting graph)

G-language practice –Intermediate-

Tendency of oligo signal sequence

Oligo signal sequence: what isχsequence?

In Colon Bacillus about once in every replication procedure, the replication fork progress is impeded and it results in the cut in the double stranded DNA chain. In these occasions RecBCD enzyme responsible for replication fork repair to continue replication by enhancing homologous recombination becomes very important. RecBCD recognizes the 8 bases long χ (kai) arrangement 5’-GCTGGTGG-3’, as the hot spot for recombination in the genome and regulates its own function as well, howeverχsequence is only functional when it faces a specific direction, thus it is famous as a signal arrangement facing meaningfully the replication direction in the genome.

References:

  • Kowalczykowski, S.C., 2000. Initiation of genetic recombination and recombination-dependent replication.Trends Biochem. Sci. 25, 156-165
  • Uno, R. et al., 2000. The orientation bias of Chi sequence is a general tendency of G-rich oligomers. Gene. 259, 207-215

Here it is shown the way to calculate the ratio of a specific oligo arrangement facing the replication direction, it is to say, the ration it exists in the leading the strand.

Analysis

It is necessary to have the leading strand sequence obviously because it is for calculating the ratio in the leading strand. In G-language GAE by using the leading_strand($gb) function as below it is possible to get both the clockwise and the counterclockwise leading strand from the replication starting point to the replication finishing point. You can get a genome sequence of only leading strands by connecting these.

G > ($seq1, $seq2) = leading_strand($eco) 
G > $leading = $seq1 . $seq2 

It seems that by using the new G($gb) function to read the genome database it is shown a simple statistic data of that genome base sequence, however in reality this is done by calling function seqinfo($gb). Let’s verify that the G amount in the leading strand is numerous as shown in GC skew using seqinfo($gb). Because functions that use genome information or sequence as argument in G-language GAE also automatically deduce model, here not only structure body that has the genome information annotation but also scalar value with only base sequence is inputted.

G >seqinfo($leading) 
  Length of Sequence :   4639675 
           A Content :   1137315 (24.51%) 
           T Content :   1145883 (24.70%) 
           G Content :   1216052 (26.21%) 
           C Content :   1140425 (24.58%) 
              Others :         0 (0.00%) 
          AT Content :    49.21% 
          GC Content :    50.79% 

It can be seen that compared to the entire genome, although the amount of A and T does not change, G increases and C decreases. Becauseχsequence has a biased composition that is 5 out of 8 being G, it can be inferred that it exists more in the leading strand than in the lagging strand. Then, let’s count the number ofχsequences. To count short sequence that exists inside long sequences, it is used the oligomer_counter($gb, “oligomer”).

To count the ones facing the replication direction, the sequence GCTGGTGG that is in the leading strand is counted. Mean while, to count the ones facing reversely it is possible by counting the ones existing in the lagging strand but these will be the complementary sequence of GCTGGTGG of the leading strand, in other words the number of CCACCAGC. Easy complementary sequence can be obtained by using function complement ($seq). When the ratio of the ones that face the replication direction is calculated the result becomes as the bellow.

G > print oligomer_counter($leading, "gctggtgg") 
761 
G > print oligomer_counter($leading, complement("gctggtgg")) 
247 
G > print 761+247 
1008 
G > print 761/1008 * 100 
75.4960317460317

Let’s calculate similarly any other using 8 bases sequence. By experimenting some, soon you would probably notice some of the bellow.

  • Many of the 8 bases sequence exists around 100~300 in the genome, thus 1008 is too much.
  • Similarly the number 75% that face the replication direction is very high.

What would happen if we look at this in a statistical way? First, to calculate the number found in a genome, the probability that a combination of 8 bases exists in a given sequence, because there are 4 types of bases, would be about 0.25^8= 0.000015. Colon Bacillus has 4.6Mbp so; the number of a specific 8 bases sequence existing here in one strand is about 70. Taking into account the number that would exist in the complementary strand, it would be approximately 140. Even thinking from this perspective the number ofχsequence can be said that is meaningfully numerous.

Also, for the replication direction, from the base amount the probability that the sequence exists in the leading strand and the probability that it exists in the lagging strand is each;

F1 = [A]a[T]t[G]g[C]c

And

F2= [A]a'[T]t'[G]g'[C]c'

([A][T][G][C] is the amount of each base in the leading strand, atgc is the number of bases in the oligo sequence, a’t’g’c’ is the number of bases in the complement of the oligo sequence) By calculating

F1/(F1+F2)
G >$f1 = 0.2451 ** 0 * 0.2470 ** 2 * 0.2621 ** 5 * 0.2458 ** 1 
G >$f2 = 0.2451 ** 2 * 0.2470 ** 0 * 0.2621 ** 1 * 0.2458 ** 5 
G >print $f1/($f1 + $f2) * 100 
56.7651511506961 

F1/(F1+F2) the ratio of can be obtained. Applying this to the χsequence, it can be estimated that approximately 57% is the ratio, and can be deduced that 75% largely surpass it. However provided that in this analysis it is only an estimate from the amount of each base, and in reality it is known that a Marcov quality that includes more than one base arrangement is involved in a genome, so be careful not to deduce any conclusion without using higher dimension of information such as the previous.

Reference:

  • Lobry, L.R. et al., 2003. Polarisation of prokaryotic chromosomes. Curr. Opin. Microbiol. 6, 101-108

G-language practice –Advanced-

Replication bias of every oligo

We have investigated “signal oligo sequence bias” in 8 base sequences called χsequence. However, to statistically compare, it is required to do the same analysis to all 65536 other 8 base sequences, and see the tendency of each of them. Easy repetitive task such as this can be automated using a simple program. Because G-language GAE can be handled as a Perl module as well, let’s write a Perl script and calculate the replication bias of all oligo sequences.

First, before ending the shell save the genome data of Colon Bacillus. To save the genome data give the file name as a parameter to function “output ()” by calling it from the structural body made by “new G()”. In this moment, from the file extension, G-language GAE automatically discriminates the output format and it saves it in an appropriate format. Here let’s save the file in a GenBank format.

G > $eco->output("ecoli.gbk") 

Now let’s make a Perl script. Let’s make a file named for example test1.pl and open it with emacs editor. And to declare that you are going to use G-language GAE, write down “use G;”. After that, load with “new G() “the previously saved Colon Bacillus genome data.

  use G; 
  $gb = new G("ecoli.gbk", "no msg"); 

Next as we previously did with the shell, let’s obtain the leading strand.

  ($seq1, $seq2) = leading_strand($gb); 
  $leading = $seq1 . $seq2; 

From here on, process by Perl will start. First it will count the number of every 8 base in the leading strand. This can count what type of 8 base sequences exists shifting one letter at a time from the head to the tail of the leading strand. To do a repetition as this we use “for”. To express with “for” from the number 0 (the head) to the last part where 8 bases can be taken in a window, one letter at a time, it is expressed as below.

  for ($i = 0; $i <= length($leading) - 8; $i ++){ 
  } 

The commands in this “for” are repeated each time a letter is shifted. Here it is necessary to write to cut 8 letters from every $”i-th” letter from the leading strand and add one to the count number. To cut a specific letter sequence, use the standard Perl function “substr()”, and use hash to count letter sequence.

  for ($i = 0; $i <= length($leading) - 8; $i ++){ 
      $oligo{substr($leading, $i, 8)} ++; 
  }

Now we have succeeded counting every 8 base sequence in the leading strand and storing it to a hash called “%oligo”. After that it only remains to calculate the ratio between all the 8 base sequence (including those in the complementary strand) and the ones that face the replication direction. As already done with theχsequence using the shell this part is written as the below.

    $total = $oligo{$nuc} + $oligo{complement($nuc)}; 
    $percent = $oligo{$nuc} / $total * 100; 
    printf "$nuc, $total, $percent¥n"; 

To finish all you need to do now is to do the same work with all the 8 base sequence. To do repetitively with all the elements in hash we use “foreach”.

  foreach $nuc (sort keys %oligo){ 
      $total = $oligo{$nuc} + $oligo{complement($nuc)}; 
      $percent = $oligo{$nuc} / $total * 100; 
      printf "$nuc, $total, $percent¥n"; 
  } 

The script is complete with the above. Bellow is the entire script.

  use G; 
  $gb = new G("ecoli.gbk", "no msg"); 
  ($seq1, $seq2) = leading_strand($gb); 
  $leading = $seq1 . $seq2; 
  for ($i = 0; $i <= length($leading) - 8; $i ++){ 
      $oligo{substr($leading, $i, 8)} ++; 
  } 
  foreach $nuc (sort keys %oligo){ 
      $total = $oligo{$nuc} + $oligo{complement($nuc)}; 
      $percent = $oligo{$nuc} / $total * 100; 
      printf "$nuc, $total, $percent¥n"; 
  } 

Execute the script and output the result into a file named “out.csv”. Data separated by commas are useful when saved in CSV format for later usage with Excel or other table calculation programs. unix % perl test1.pl >out.csv

If the program successfully ends, you can see the result as the bellow when you use the “less” command for out.csv. Let’s search what sequence was the most biased, what is the rank ofχsequence among all the other oligo sequences, what rank does the χsequence have with respect to the number of sequences, downloading it to a table calculating software.

aaaaaaaa, 242, 38.4297520661157 
aaaaaaac, 351, 38.4615384615385 
aaaaaaag, 318, 47.1698113207547
aaaaaaat, 502, 48.804780876494 
aaaaaaca, 418, 48.5645933014354 
aaaaaacc, 375, 39.4666666666667 
aaaaaacg, 387, 53.2299741602067 
aaaaaact, 286, 49.6503496503497 
aaaaaaga, 429, 51.2820512820513 
aaaaaagc, 563, 49.911190053286 
aaaaaagg, 336, 52.3809523809524 
aaaaaagt, 285, 51.9298245614035 
aaaaaata, 492, 49.7967479674797 
... 

Gene replication direction bias

Until now we have seen the relation between base and oligo sequence and replication direction, but what would it be happening to the direction of the genes?

To process repeatedly with every oligo sequence, move each base we used the “for” sentence and to see inside the hash for each oligo sequence we used the “foreach” sentence, but to see every gene in the genome, the bellow sentence structure is used in G-language GAE.

  foreach $cds ($gb->cds()){ 
  } 

In “$gb→cds() $cds”, a pointer to point every gene in a genome is stored, and by using “$cds” you can gain many information about the gene. For example the main ones are as the bellow.

Notation Explanation
$gb→{$cds}→{start} Gene starting position
$gb→{$cds}→{end} Gene ending position
$gb→{$cds}→{direction} The strand where the gene is located (direct or complement)
$gb→{$cds}→{gene} Gene name
$gb→{$cds}→{locus_tag} Gene number
$gb→{$cds}→{product} Gene product
$gb→{$cds}→{db_xref} Link to outside data base
$gb→{$cds}→{translation} Translated amino acid sequence
$gb→get_geneseq($cds) Gene base sequence
$gb→startcodon($cds) Start codon
$gb→stopcodon($cds) Stop codon
$gb→before_startcodon($cds, 100) 100 bp upstream base sequence of the gene
$gb→after_stopcodon($cds, 100) 100 bp downstream base sequence of the gene
$gb→get_intron($cds) Intron sequence
$gb→get_exon($cds) Exon sequence

Let’s calculate the ratio of genes facing the replication direction, by finding in which strand each gene is in and increasing the count each time it is in the leading strand. The total number of genes can be obtained with “scalar($gb→cds())”. To find out in which strand each are we use formula “query_strand($gb, <position>, -direction⇒<direction>)”. Here “<position>” specifies the position of it in the genome represented by number, and this time we will use the starting position of the gene. Also in “<direction>” it is used the direction of the gene.

Taking into account the above, the script should be as the below.

  use G; 
  $gb = new G("ecoli.gbk", "no msg"); 
  foreach $cds ($gb->cds()){ 
      my $strand = query_strand($gb, $gb->{$cds}->{start},  
                                           -direction=>$gb->{$cds}->{direction}); 
      if($strand eq 'leading'){ 
          $i ++; 
      } 
  } 
  print $i/scalar($gb->cds());

Let’s use this script and calculate the ratio of the genes facing the replication direction.

Gene expression amount and replication direction bias

What is Codon Adaptation Index?

Until now we have seen various relation between the genome’s constituting elements and the replication direction, however, to end let’s see the expression level of a gene which is a more functional information, and analyze the relation between gene expression level and replication direction. To see the expression level of a gene it is preferable to use exhaustive high throughput experiment techniques such as micro-array, however, here we will use Codon Adaptation Index (CAI), a representative indicator to estimate gene expression level from genome information. CAI is a way to estimate the relative expression level from correlation between codon bias of the target gene and codon bias based in synonymous codon usage in high expressional genes such as ribosome related genes. Certainly quantitative and precise gene expression level is impossible to predict, however, approximate expression level criterion can be obtained and at least in bacteria with strong codon bias direction, that criterion is generally considered to be useful for criterion purpose.

Reference:

  • Sharp, P.M. et al., 1987. The codon Adaptation Index–a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 15, 1281-1295

Analysis

The tendency of the whole genome nevertheless does have an influence to some genes and short base sequences. That is because of base bias by GC skew or/and the difference in number of genes existing between the leading strand and the lagging strand. Colon Bacillus does cell division every 30 minutes so the replication procedure in bacteria has a very deep meaning to the bacteria genome. Let’s hypothesize that there is nevertheless some kind of influences to the gene expression level by the leading strand and the lagging strand.

When using CAI in G-language GAE use function cai($gb). For every gene CAI returns a number between 0 and 1(0 for those not correlated to strongly expressional genes, and 1 for correlation). Other than that with cai($gb) one can get a list of CAI value of every gene, new gene CAI value is added to “$gb→{$cds}→{cai}”. This time we will use this value and since the list is unnecessary, with “–output” option we adjust so that the list will not be outputted. cai($gb, -output⇒"/dev/null");

Now, as we saw in GC skew, to see the tendency in the genome region, it is useful to divide the graph into short sequences, give each a value and plot it to a graph based by those values. Also to emphasize the tendencies, accumulation plotting is appropriate. Considering the above let’s plot the CAI value. However two things must be paid attention.

First CAI value will take values between 0 and 1, however, to do accumulation plotting the values must be positive and negative, so arrangements are necessary. Because the CAI value of all the genes of Colon Bacillus is 0.485, by subtracting this value from the CAI value the average will be 0 the maximum 0.5 and the minimum will be –0.5.

Next, until now it was only GC amount so only one strand of DNA was enough to be considered, however, since genes exists in both strands, the plot must be divided to leading strand and lagging strand. To discriminate which one to use, function “query_strand($gb)” already used for gene direction analysis is useful.

For each gene we calculate the CAI value arranged to have a 0 average and divide it into leading strand and lagging strand. Because later on these will be graphed we store each using arrays. To add elements into arrays we use Perl standard function “push()” . Not only CAI values but let’s also store the position of the gene in the genome. Programming the above will result as the below.

  foreach $cds ($gb->cds()){ 
      my $strand = query_strand($gb, $gb->{$cds}->{start}, 
                                            -direction=>$gb->{$cds}->{direction}); 
      if ($strand eq 'leading'){ 
          $cai_leading += $gb->{$cds}->{cai} - 0.485; 
      }else{      
          $cai_lagging += $gb->{$cds}->{cai} - 0.485; 
      } 
      push(@data1, $cai_leading); 
      push(@data2, $cai_lagging); 
      push(@pos, $gb->{$cds}->{start}); 
  } 

Finally, plot the graph from the obtained data. In G-language GAE there is a function named **grapher() that makes graphs easily, so let’s input data in here. Just type each data as reference, each data x coordinate list (the genome position stored above) and each data y coordinate list (the CAI value of the leading strand stored above).

  grapher(¥@pos, ¥@data1, ¥@data2); 

By only this procedure a graph will be made inside a folder in the work directory named grph.png. To make the graph more comprehensive, add title, data label, x axis and y-axis label.

  grapher(¥@pos, ¥@data1, ¥@data2, -x1=>"leading", -x2=>"lagging", 
                                   -title=>"caiskew", -x=>"position", -y=>"cai");

Resuming the above a program as the below is made.

  use G; 
  $gb = new G("ecoli.gbk", "no msg"); 
  cai($gb, -output=>"/dev/null"); 
 
  foreach $cds ($gb->cds()){ 
      my $strand = query_strand($gb, $gb->{$cds}->{start}, 
                                            -direction=>$gb->{$cds}->{direction}); 
      if ($strand eq 'leading'){ 
          $cai_leading += $gb->{$cds}->{cai} - 0.485; 
      }else{      
          $cai_lagging += $gb->{$cds}->{cai} - 0.485; 
      } 
      push(@data1, $cai_leading); 
      push(@data2, $cai_lagging); 
      push(@pos, $gb->{$cds}->{start}); 
  } 
  grapher(¥@pos, ¥@data1, ¥@data2, -x1=>"leading", -x2=>"lagging", 
                                   -title=>"caiskew", -x=>"position", -y=>"cai"); 

To display the graph, type in command “display”.

unix % display graph/graph.png 

Then a graph as below will be shown.

As we have already seen, the replication starting point in Colon Bacillus is near 3,923,500 bp and the ending point is near 1,588,800 bp. First if you see the lagging strand, almost the entire area is negative. This means that genes with CAI values lower then average are numerous in the lagging strand, it is to say gene expression is relatively low in the lagging strand.

Mean while the graph of the leading strand changes briskly. And if we focus the relation between the replication starting point and ending point, we can verify that near the replication ending point the graph continues to decline, and near the replication starting point it largely inclines. Thus it seems that in the leading strand high expression genes are concentrated near the replication starting point and expression level lower as it heads to the ending point.

Now, in this graph near 3.5Mbp in the leading strand the value increasingly inclines. CAI value is defined as ribosome related genes as high expression genes, and many times ribosome related genes gather up and creates gene cluster, and it seems that it is affecting the data. So let’s redraw the graph taking out the ribosome related genes. In the first part of the “foreach” sentence write:

  next if ($gb->{$cds}->{product} =~ /ribosom/); 

And it will be enough to take down all the ribosome related genes. How did the graph change?

Reference:

  • Daubin, V. et al., 2003. G+C3 Structuring Along the Genome: A Common Feature in Prokaryotes. Molecular Biology and Evolution. 20, 471-483

Concluding Remarks

The example analyses shown, although the later were complex analysis, are scripts with less then 20 lines. As these, with G-language GAE complex analysis can be done efficiently and easily.

So now, it only remains for you to explore the truth of life using this G-language Genome Analysis Environment.

introductiontog-languageenglish.txt · Last modified: 2014/01/18 07:44 (external edit)