problem_13

# Differences

This shows you the differences between two versions of the page.

 problem_13 [2010/11/17 23:38]ike problem_13 [2014/01/18 07:44] (current) Both sides previous revision Previous revision 2010/11/23 15:21 gaou 2010/11/17 23:38 ike 2010/11/17 23:37 ike 2010/11/17 23:34 ike 2010/11/17 23:33 ike 2010/11/17 23:32 ike 2010/11/17 23:31 ike created Next revision Previous revision 2010/11/23 15:21 gaou 2010/11/17 23:38 ike 2010/11/17 23:37 ike 2010/11/17 23:34 ike 2010/11/17 23:33 ike 2010/11/17 23:32 ike 2010/11/17 23:31 ike created Line 1: Line 1: - ====== ​Problem ​1-3 - Basic DNA sequence analysis (3) ====== + ====== ​Practice ​1-3 - Basic DNA sequence analysis (3) ====== - **Doubled ​nucleotides are called dinucleotide and there are 16 types of nucleotides ​from four types base per single nucleotide. In the Problem ​1-3, modify previous program for dinucleotide frequency analysis. ​Check out the entire ​dinucleotide ​frequency pattern ​in the Mycoplasma genome and compare with the result ​for the Problem ​1-2 by expectation. + **A consecutive pair of nucleotides are called dinucleotide and there are 16 types of nucleotides ​as a combination of four bases. In the Practice ​1-3, modify previous program for dinucleotide frequency analysis. ​Calculate ​the dinucleotide ​composition ​in the //Mycoplasma// genome and compare ​it with the result ​of Practice 1-2, considering ​the result of Practice ​1-2 as the expected value. - Main part in this problem ​is (1) to calculate expectation ​and (2) how to compare the result ​and expectation. O/E values are defined as observation ​value divided by expectation. By using O/E value users can discuss some bias between ​observed value and expectation of AA and CT for instance.** + Key method ​in this practice ​is (1) the calculation of the expected value and (2) how to compare the observed ​and expected values. O/E (observed/​expected ratio) ​values are defined as observed ​value divided by the expected value. By using O/E values, one can identify a certain ​bias in the observed value from the expected value, such as to say, "AA is only half abundant than expected, but CT is twice as abundant as what has been expected.** - ===== 1 Count up dinucleotide frequency ===== + ===== 1 Counting the dinucleotide frequency ===== - ==== 1.1 Set variables ==== + ==== 1.1 Setting ​variables ==== - my %dinuc; ​  # Numbers of dinucleotide + my %dinuc; ​  # Numbers of dinucleotides - my %diexp; ​  # Expectation ​for dinucleotide + my %diexp; ​  # Expectation ​of dinucleotides - my %diobs; ​  # ​Frequency for dinucleotide + my %diobs; ​  # ​Observed frequency of dinucleotides - my %oe;      # O/E value for dinucleotide + my %oe;       ​# O/E value for dinucleotides - Above variables ​are thought ​to be needed ​in this analysis. “%” in front of variable name is call hash in the Perl. The idea of hash is similar to array with “@” in front of the variable name and has difference on argument where array is only accepted to integer but characters are also accepted in hash. + Above variables ​seem to be required ​in this analysis. “%” in front of variable name indicates a "hash" ​in Perl. The idea of hash is similar to array, which has “@” in front of the variable name, but is different from arrays since hash can accept character keys. @array = (1, 2, 3); @array = (1, 2, 3); - Above is an example of array. To access to value in array users need to \$array to get value 2. Above is an example of array. To access to value in array users need to \$array to get value 2. %hash = {'​string'​=>​1,​ '​message'​=>​2,​ '​line'​=>​3} %hash = {'​string'​=>​1,​ '​message'​=>​2,​ '​line'​=>​3} - But in hash, users can access to value 2 with key “message” like this: \$hash{'​message'​}. ​ But in hash, users can access to value 2 with key “message” like this: \$hash{'​message'​}. ​ - ==== 1.2 Improve count up ==== - Since dinucleotide composed of two characters, users can use neither tr/// to count up nor s/// because it does not return replaced number. To escape this problem, use for statement and slide single character with two characters in a frame. For example, if a sequence was “atgcggctg” first frame will be “at” and second will be “tg”. Therefore users check sequence from first character to one character minus last character. But, one importance notes is that the Perl counts characters from not one but from zero that means users need to start from zero to last characters minus two in for statement. + ==== 1.2 Improve counting ==== - There are function in the Perl called substr() which cuts out partial sequence from the given sequence. So to make frame composed of two sequences, use substr() to select ​two characters. + Since dinucleotides ​are composed of two characters, one cannot ​use tr/// or s/// because it does not return replaced number. To solve this problem, let's use the "​for"​ statement, and slide along the sequence single character at a time, taking ​two characters ​in a frame. For example, if a sequence was “atgcggctg” first frame will be “at” and second will be “tg”. In this way, you can check a sequence from the first character until the one-minus-the-last character. Note, however, that Perl position number starts from 0 and not 1, so the last position for counting dinucleotides will be two minus the length of the sequence. - \$parts = substr(\$seq, 0, 3); + There are function in Perl called ​substr() which cuts out partial sequence from the given sequence. So to make a frame composed of two characters, use substr() to select two characters. - Here, use hash for index so that users can use characters ​as key. + \$parts = substr(\$seq,​ 0, 2); + + Here, use hash to add its count by supplying the dinucleotide ​as a key. \$diobs{\$parts} ++; \$diobs{\$parts} ++; - \$parts in above code is a flame composed of two characters, dinucleotide,​ as a key for hash. + \$parts in above code is a frame composed of two characters, dinucleotide,​ as a key for hash. - %diobs{\$parts} ++; + Make sure that “\$” is in front of variable name, and not “%”, because here the counting value is a scalar part of the hash, and not the hash itself. - + - Make sure that “\$” is in front of variable name. + - + - \$diobs{\$parts} ++; + - Not “%” in the front of variable name because users are referencing to scalar value and not hash itself. + ===== 2 Calculating the O/E value ===== - ===== 2 Calculate O/E value ===== + ==== 2.1 Calculating the expected ​value (Basic) ​==== - ==== 2.1 Calculate ​expectation ​(Basic) ==== + What is an expected value? Literally, it is a value that is expected for an event. Thinking about dinucleotide,​ theoretical ​expectation ​for each of them is 1/16. But as you have seen in Practices 1-1 and 1-2, every species possesses unique A, T, G and C composition. Therefore, some nucleotide is less than other nucleotide and same things can be said for the balance between the dinucleotides. - What is expectation?​ Literally, it is a value that is expected per event. Thinking about dinucleotide, theoretically expectation for each of it is 1/16. But as users saw in the Problems 1-1 and 1-2, every species possesses unique A, T, G and C balance. Therefore, some nucleotide is less than other nucleotide and same things ca be said in the balance between dinucleotides. + Therefore, one can calculate the expected ​value of dinucleotide ​by multiplying the percentages ​of the first and second nucleotides. - Users can calculate expectation of dinucleotide by multiplying expectation of first nucleotide and expectation of second nucleotide. ​ ==== 2.2 Calculate expectation (Advanced) ==== ==== 2.2 Calculate expectation (Advanced) ==== Line 62: Line 56: - The Perl has foreach statement to access values in an array or hash which the C and Java do not possess. + Perl has "foreach" ​statement to access values in an array. Line 70: Line 64: - Above code shows that an element in array is stored into variable \$content. Beneath is an example of access to hash values. + Above code shows that an element in the array is stored into variable \$content. Beneath is an example of access to hash values. Line 78: Line 72: - “sort ​key” in the code sorts order of calculation ​in ASCII order. Now use foreach statement to access ​to values in hash %diobs and calculate the O/E values. + “sort ​keys” in the code sorts the keys to the %hash in an alphabetical ​order. Now use foreach statement to access ​the values in hash %diobs and calculate the O/E values. - ==== 2.3 Calculate O/E value ==== + ==== 2.3 Calculate ​the O/E value ==== - O/E value is calculated by observed value divided ​by expectation. Calculate O/E value by previously acquired observed ​value and expectation, and store it to the hash. + O/E value is calculated by dividing the observed value by expected value. Calculate O/E value by previously acquired observed and expected values, and store it in a hash. ===== 3 Output ===== ===== 3 Output ===== - All the values are now stored in hash %oe in ASCII order (aa, at, …). For each line of output let’s print out in “Name of dinucleotide”,​ “O/E value”, “Observed value” and “Expectation”. + All the values are now stored in hash %oe with dinucleotides as its key (aa, at, …). For each line of output let’s print out “Name of dinucleotide”,​ “O/E value”, “Observed value” and “Expected value”. O/E: Observation/​Expectation O/E: Observation/​Expectation
problem_13.txt · Last modified: 2014/01/18 07:44 (external edit)

### Page Tools 