User Tools

Site Tools


This is an old revision of the document!

Problem 1-2 - Basic DNA sequence analysis (2)

Refine previous program that enables to load whole genome sequence of M. genitalium and compare the sequential difference (bias) through nucleotide usage. Problems 1-1 and 1-2 are similar major difference comes from the type of DNA sequence. In Problem 1-1 data are already prepare by cutting and pasting target sequence from data, but in Problem 1-2, users need to acquire DAN sequence from genome data format which includes comments and other information. If the DNA sequence is ready in both case count up process is same. Again, read the passage and think logically, try to think about meanings and procedures. Then make a basic design of program.

1 Programming design

1.1 Improve data loading

The script for the Problem 1-1 looks like this.


open(FILE, ?????);

while (<FILE>) {
  ?????  # Remove linefeed code
  ?????  # Join a sequence into variable $seq

$A = ?????
$T = ?????
$G = ?????
$C = ?????

Look at the genome data to see what needs to be done for acquiring DAN sequence.

LOCUS       L43967     580074 bp    DNA   circular  BCT       05-NOV-1998
DEFINITION  Mycoplasma genitalium G37 complete genome.
        1 taagttatta tttagttaat acttttaaca atattattaa ggtatttaaa aaatactatt
   580021 gaaatgatca tatatttaaa tgattataat atttctttaa tactaaaaaa atac

Firstly, there are comments such as “LOCUS”, “DEFINITION” and “ACCESSION” in the front of data. Target DNA sequence is in the last part of the file and one line above the sequence is comment “ORIGIN”. So, skip until the comment comes out. Next, look at the very last line of the sequence. There are two slashes which indicates end of the DNA sequence.

From these points of view, loading data in the program can be improved to following way.



open(FILE, ?????);

while (<FILE>) {
     if (?????) {
          while (<FILE>) {
              last if (/\/\//);  # Terminate program if a sequence includes "//"
                   ?????              # Remove linefeed code
                   ?????              # Join a sequence into variable $seq


$A = ?????;
$T = ?????;
$G = ?????;
$C = ?????;

2 Comparing both results

2.1 Decimal calculation

In the Problem 1-2, comparison of result in the Problem 1-1 is required. Users are comparing two sets of unequal population so consider outputting result in percentages this time calculating up to the second decimal place will be enough for validity.

If users are to count adenine for instance then the script should look like this.


length() function returns the length of a variable. If the variable was string then it returns the number of character so that length() function for $seq returns number of nucleotide in M. genitalium.

2.2 Output of decimals

Outputs in the Perl is fixable by printf() function like in the C. So let’s cut up to the second decimal places.

printf("A:   %.2f\n", $percent);

“.2” in the front of f means to print the numeral to the second decimal places. In the same way, “.4” means to print the numeral to the forth decimal places.

3 Advanced: Combine two script into a single process

User has made two programs one from Problem 1-1 and the other from Problem 1-2. These two programs are very much similar so as an advanced problem let’s combine two into one. Some ideas to shape up implement are to ask users that sequence to compare if any and if nothing is typed then default sequence were used to analyze.

problem_12.1290036414.txt.gz · Last modified: 2014/01/18 07:44 (external edit)