User Tools

Site Tools


This is an old revision of the document!

Problem 2-2 & 2-3 - CDS search and translation (2)

[Problem 2-2]

Refine the script from the previous problems to estimate all three patterns of possible amino acids from target coding region.

[Problem 2-3]

First, make a program to generate complementary strand from the given DNA sequence. Then, search for the coding regions in the complementary strand and estimate all three patterns of possible amino acid sequence. Put result of the Problem 2-2, three possible sequences from template DNA, together and choose the most suitable coding region among 6 possible reading frames.

Problem 2-2

The previous program translated codon into amino acid chain from beginning of DNA sequence. But, thinking it carefully, what would happen if translation began from not beginning but from second position, or from third poison?

~~~~~ ~~~~~ ~~~~~
His    Ala   Asp
  ~~~~~ ~~~~~ ~~~~~
   Met   Leu   Thr
    ~~~~~~ ~~~~~ ~~~~~~
      Cys   Stop  

In the above figure, first and third possible reading frame translates amino acid as well as second possible reading frame. But, second possible reading frame is clearly different that the sequence starts from methionine. Moreover, third possible reading frame possesses stop codon in second position which means translation terminates if it was a coding region. Therefore, users need to estimate three possible when searching coding region.

There are two types of way for implementation.

  1. Create sequences that DO NOT possesses first base and first two bases from the given DNA sequence, respectively.
  2. Estimate three possible patterns within a program using only given DNA sequence.

Users can choose ether 1 or 2, but we will demonstrate 2 since program is reusable for any types of DNA sequences.

Refine previous program

Easy way to solve this problem is to modify previously made subroutine that translates codons into amino acids. Main point in here is to slide one or two positions in target DNA sequence so that users can search suitable coding regions from among three possible reading frames.

To be more precise, put another for statement that loops around 0 to 2 for each possible reading frame. It would be nice if users can store each reading frame into some array.

Problem 2-3

In the Problem 2-2, users dealt three possible types of reading frame, but there is one more possibility on reading frame to consider, a complementary strand.

A complementary strand is a strand facing with a template strand.

  5' __G T A C G A C T G __3'
  3'   C A T G C T G A C   5'

DNA has a direction for each strand in double helix which flows from 5’ end to 3’ end. From this constraint in central dogma, a gene exists in the direction from 5’ end to 3’ end.

Double helix is consisted of two directionally different strands and each of strands has complementary features that adenine joins with thymine, and guanine with cytosine.

To attain complementary strand from the template DNA sequence, uses can simply bring up complementary features of a nucleic acid.

  1. Reverse the template DNA sequence
  2. Substitute nucleotide

Putting all possibilities together, there are six types possibilities for reading frame, three from the template strand and another three from complementary strand.

Define new subroutine

Now, let’s make new subroutine in the Perl to generate complementary strand from the template strand. As an implementation, it would be nice to give $seq as an argument and attain complementary sequence as returned value. The processes for the generation are reversing the given sequence and substitution of nucleotide, A to T and G to C.

Main flow of the program is same as the previous problems.

sub complemental () {
       my $nuc = shift;
       my $complement = '';

       ??????; # Reverse the sequence
       ??????; # Substitute nucleotides

       return $complement;

In the Perl there are handy function to reverse given string, reverse(). For example $a = reverse(“foobar”) returns “raboof”. So use function reverse() to reverse the template sequence.

For the substitution, use function tr / / / for single character substitution.

$nuc =~ tr [atgc]

This sample code substitutes “a” to “t”, “t” to “a”, “g” to “c”, and “c” to “g” in the $nuc. Since the Perl is specialized in natural language processing, deeper understanding of the language lead users to higher analysis of genomic studies.

Search coding regions

All possible reading frames is now ready. Subsequently, how can users estimate the suitable coding region from the six?

Obviously, coding region needs to be proved by experimental process, but here users are going to make some estimation on which parts are coding region from bioinformatics point of view.

Users are going to follow the following basic rule to search suitable reading frame. This rule is not the perfect rule, but it fits in the most cases.

  1. A coding region begins from “atg” which codes methionine (M or Met).
  2. Sometimes in bacterial genome a coding region begins from “gtg” which codes valine (V or Val) but in this case it turn outs to be start codon which codes methionine.
  3. Stop codons are “taa”, “tga”, and “tag” which means coding region should end with ether sequences.
  4. If more than one candidate clears above condition, select the longest sequence.

Advanced programing

Computational analysis in the field of biology itself is a superior principle. But, in the actual study, writing program is process next to primary work.

Therefore, programs should support and aid the researcher’s work. To achieve such a specification, sharing the programs, writing readable program or reusable program is going to be an important point.

Try to be aware of general-purpose properties of programs or replacement of redundant script into subroutine. If users became accustomed to the Perl, try to read some technical books on the Perl. Effective Perl ( is one of best book for next step Perlers. This textbook should bring users to accomplish tasks in much more efficient development cost.

Thank you for joining with us for the series of works, we wish your continuous success.

problem_2b.1290429545.txt.gz · Last modified: 2014/01/18 07:44 (external edit)