G::Seq Primitive
SummaryIncluded librariesPackage variablesDescriptionGeneral documentationMethods
Summary
  G::Seq::Primitivew - Basic sequence analysis methods
Package variables
Privates (from "my" definitions)
@CodonTables = ('',\% CodonTable2,\% CodonTable3,\% CodonTable4,\% CodonTable5,\% CodonTable6, '', '',\% CodonTable9,\% CodonTable10,\% CodonTable11,\% CodonTable12,\% CodonTable13,\% CodonTable14, '',\% CodonTable16, '', '', '', '',\% CodonTable21,\% CodonTable22,\% CodonTable23,\% CodonTable24,\% CodonTable25)
Included modules
G::Messenger
SelfLoader
SubOpt
Inherit
Exporter
Synopsis
No synopsis!
Description
           This class is a part of G-language Genome Analysis Environment,
           collecting basic sequence analysis methods.
Methods
DoubleHelixDescriptionCode
complementDescriptionCode
shuffleseqDescriptionCode
splitprintseqDescriptionCode
to_fastaDescriptionCode
to_fastqDescriptionCode
translateDescriptionCode
Methods description
DoubleHelixcode    nextTop
  Name: DoubleHelix   -   we all love the DNA molecule!

  Description:
    This method prints the given sequence as an ASCII text-based graphic
    depicting a DNA molecule. Enjoy :-)
    
  Usage: 
    DoubleHelix($seq);

  Options:
    -speed      controls the time to display one base-pair.
                default is 0.05 (second)

  REST:
    http://rest.g-language.org/NR_029721/DoubleHelix/speed=0
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20060511-01 initial posting
complementcodeprevnextTop
   Name: complement   -   get the complementary nucleotide sequence

   Description:
         Given a sequence, returns its complement.

         eg. complement('atgc');  # returns 'gcat'

   REST: 
      http://rest.g-language.org/complement/atgc 
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20010727-01 initial posting
shuffleseqcodeprevnextTop
  Name: shuffleseq   -   create randomized sequence with conserved k-mer composition

  Description:
    Shuffle and randomize the given sequence, conserving the nucleotide/peptide
    k-mer content of the original sequence. 

    For k=1, i.e. shuffling sequencing preserving single nucleotide composition, 
    Fisher-Yates Algorithm is employed. 

    For k>1, shuffling preserves all k-mers (all k where k=1~k). For example,
    k=3 preserves all triplet, doublet, and single nucleotide composition. 
    Algorithm for k-mer preserved shuffling is non-trivial, which is solved
    by graph theoretical approach with Eulerian random walks in the graph of 
    k-1-mers. See references 3-5 for details of this algorithm. 
    
  Usage:
    $shuffled_seq = shuffleseq($gb);

  Options:
    -k            k-mer to preserve the composition (default: 1)
                  k=1 means only the base composition is preserved and the 
                  order is shuffled. k=1 preserves dinucleotide composition
                  as well as the single nucleotide composotion. Similarly,
                  k=n preserves all of 1~n -mers. Note that k < 5
                  with nucleotide sequence will take several minutes to 
                  finish. 

  References:
    1. Fisher R.A. and Yates F. (1938) "Example 12", Statistical Tables, London
    2. Durstenfeld R. (1964) "Algorithm 235: Random permutation", CACM 7(7):420
    3. Jiang M., Anderson J., Gillespie J., and Mayne M. (2008) "uShuffle: 
       a useful tool for shuffling biological sequences while preserving the k-let
       counts", BMC Bioinformatics 9:192
    4. Kandel D., Matias Y., Unver R., and Winker P. (1996) "Shuffling biological 
       sequences", Discrete Applied Mathematics 71(1-3):171-185  
    5. Propp J.G. and Wilson D.B. (1998) "How to get a perfectly random sample 
       from a generic Markov chain and generate a random spanning tree of a 
       directed graph", Journal of Algorithms 27(2):170-217

  Author:
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20080421-01 added k-mer count preservation option 20070612-01 initial posting
splitprintseqcodeprevnextTop
  Name: splitprintseq   -   format sequence data for printing

  Description:
    This method splits the given sequence string in segments of 
    certain length and inserts newline code ("\n") to print the
    sequence in a formatted way. By default, this function splits
    the string in segments of 60 letters. This length can be
    changed by supplying the second argument.

    
  Usage : 
    print splitprintseq($seq); # print sequence in a formatted way.
      or
    $formatted_seq = splitprintseq($seq, 100);  
    # split into segments of 100 characters and add "\n" to each line.

 Options:
    Length of segments can be specified as the second argument.
    Default is 60 characters to match GenBank.

  Author: 
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20050116-01 initial posting
to_fastacodeprevnextTop
  Name: to_fasta   -   output given sequence to a FASTA file

  Description:
    This method outputs the given sequence as a FASTA file.
    
  Usage : 
    $fasta_string = to_fasta($seq, -name=>"My sequence"); # output the sequence to out.fasta

    $fasta_string = to_fasta(%seq); # return FASTA formatted string of multi-fasta

 Options:
    -name        string for FASTA header (default: sequence)
                 if the first argument is an instance of new G(), 
                 $gb->{LOCUS}->{id} is used as default
    -length      number of characters per line (default: 60)
    -amino       when the given sequence is a G-language object, outputs the "translation" 
                 feature of all CDS as multi-FASTA format.
    -filter      reference to a list of FASTA IDs to output, when a hash of
                 multiple sequences are given. Only the sequences with IDs
                 supplied by this option is returned.
    -rfilter     reference to a list of FASTA IDs to NOT output, when a hash of
                 multiple sequences are given. Only the sequences not matching to
                 the supplied list  is returned.
    -filename    output filename (default: "out.fasta")
    -output      "f" to output to file, and return the fasta as a string.
                 "return" to just return the fasta as a string (default: "return")

  Author: 
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20120306-01 added -amino option 20110504-01 added -rfilter option 20110428-01 added support for multi-fasta given as a hash, and -filter option 20090312-01 changed the deafult behavior to return fasta as a string as well as outputting to a file 20070612-01 added -output option 20050116-01 initial posting
to_fastqcodeprevnextTop
  Name: to_fastq   -   output given sequence to a FASTQ file

  Description:
    This method outputs the given sequence as a FASTQ file. If the given sequence
    object does not contain $file->{QUAL} section, 'X' is used for quality score
    (corresponding to 50 in Phred quality).
    
  Usage : 
    $fastq_string = to_fastq($seq, -name=>"My sequence"); # output the sequence to out.fasta

 Options:
    -name        string for FASTQ header (default: sequence)
                 if the first argument is an instance of new G(), 
                 $gb->{LOCUS}->{id} is used as default
    -length      number of characters per line (default: 60)
    -filename    output filename (default: "out.fastq")
    -output      "f" to output to file, and return the fasta as a string.
                 "return" to just return the fasta as a string (default: "return")

  Author: 
    Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20091201-01 initial posting
translatecodeprevnextTop
  Name: translate   -   translate nucleotide sequence to amino acid sequence

  Description:
    This method translates a given nucleotide sequence into amino acid 
    sequence, using standard codon table. Change to the codon table or
    custom codon table can be supplied as optional arguments. Stop codons
    are denoted by '/' by default.

    eg. translate('ctggtg');  # returns 'LV'
    
  Usage : 
    $amino = translate($nucleotide);

    # to set stop codons to 'X'
    $amino = translate($nucleotide, tga=>"X", taa=>"X", tag=>"X");

    # to use custom codon table
    $amino = translate($nucleotide, %customCodonTable);

    # to use alternative codon table (2, Vertebrate Mitochondrial Code)
    $amino = translate($nucleotide, -table=>2);

  Options:
    Custom codon table can be optionally used (see above).

    Alternative Codon Tables can be used with:
    -table   number of alternative codon table (default: 1)
             Supported tables are listed below:

      1. The Standard Code
      2. The Vertebrate Mitochondrial Code
      3. The Yeast Mitochondrial Code
      4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
      5. The Invertebrate Mitochondrial Code
      6. The Ciliate, Dasycladacean and Hexamita Nuclear Code
      9. The Echinoderm and Flatworm Mitochondrial Code
      10. The Euplotid Nuclear Code
      11. The Bacterial, Archaeal and Plant Plastid Code
      12. The Alternative Yeast Nuclear Code
      13. The Ascidian Mitochondrial Code
      14. The Alternative Flatworm Mitochondrial Code
      16. Chlorophycean Mitochondrial Code
      21. Trematode Mitochondrial Code
      22. Scenedesmus obliquus Mitochondrial Code
      23. Thraustochytrium Mitochondrial Code
      24. Pterobranchia Mitochondrial Code
      25. Candidate Division SR1 and Gracilibacteria Code
  
      see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details of the tables.
REST: http://rest.g-language.org/translate/ctggtg
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20160104-01 fixed a serious bug in the previous revision, where the default codon table was set to be number 25 instead of 1 20141114-01 added alternative codon tables 20091203-01 added cutom codon table option 20010727-01 initial posting
Methods code
DoubleHelixdescriptionprevnextTop
sub DoubleHelix {
    SubOpt::opt_default(speed=>0.05);
    my @args = opt_get(@_);  
    my $gb = opt_as_gb(shift @args);
    my $speed = opt_val("speed");

    $| = 1;

    my $i;
    my (@offset) = qw/1 0 0 0 1 2 3 4 5 5 4 3 2 1 0 0 0 1/;
    my (@dist)   = qw/0 2 3 4 4 4 3 2 0 0 2 3 4 4 4 3 2 0/;

    foreach my $base (split(//, $gb->{SEQ})){
        my $msg = "         " .  q// /x$offset[($i%scalar(@offset))];
$msg .= uc($base); $i ++; $msg .= q//-/x$dist[($i%scalar(@dist))];
$msg .= uc(complement($base)) . "\n"; msg_send($msg); select(undef,undef,undef,$speed) unless($speed == 0); }
}
complementdescriptionprevnextTop
sub complement {
    my $nuc = reverse(shift);

    $nuc =~ tr
[acgturymkdhbvwsnACGTURYMKDHBVWSN]
[tgcaayrkmhdvbwsnTGCAAYRKMHDVBWSN];
return $nuc;
}
shuffleseqdescriptionprevnextTop
sub shuffleseq {
    opt_default("k"=>1);
    my @args = SubOpt::opt_get(@_);
    my $gb = SubOpt::opt_as_gb(shift @args);
    my $k = opt_val("k");

    if($k == 1){
	return join('', shuffle(split(//, $gb->{SEQ})));
    }

    my $graph = {};
    my $revgraph = {};
    my $arborescence = {};

    my $length = length($gb->{SEQ});
    my %oligo = oligomer_counter($gb, -length=>$k);
    
    foreach my $nuc (keys %oligo){
	$graph->{substr($nuc, 0, $k - 1)}->{substr($nuc, -1, 1)} = $oligo{$nuc};
	$revgraph->{substr($nuc, 1, $k - 1)}->{substr($nuc, 0, 1)} = $oligo{$nuc};
    }
    my $numoligos = keys(%$graph);
    undef %oligo;


    msg_error("Compiling arborescene...\n");
    
    my %edged;
    my $count = 1;
    $edged{substr($gb->{SEQ}, - ($k - 1))} ++;
    my %visited = %edged;

    while($numoligos > $count){
	my @keys = keys %edged;
	my $node = $keys[int(rand(@keys))];
	my @nucs = keys %{$revgraph->{$node}};
	my $rnuc = $nucs[int(rand(@nucs))];

	my $newnode = $rnuc . substr($node, 0, $k - 2);
	unless($visited{$newnode}){
	    $count ++;
	    my $first = substr($node, -1, 1);
	    $visited{$newnode} ++;
	    $edged{$newnode} ++;
	    $arborescence->{$newnode} = $first;
	    $graph->{$newnode}->{$first} --;
	}
	delete($revgraph->{$node}->{$rnuc});
	delete($edged{$node}) if(scalar(@nucs) == 1);
    }
    undef $revgraph;

    msg_error("Compilation of arborescence finished.\nNow generating shuffled sequence.\n");

    my $seed = substr($gb->{SEQ}, 0, $k - 1);
    my $seq = $seed;

    while(length($seq) < $length){
	my $num = 0;
	foreach my $val (values %{$graph->{$seed}}){
	    $num += $val;
	}
	if ($num < 1){
	    if(defined $arborescence->{$seed}){
		my $oldseed = $seed;

		$seq .= $arborescence->{$seed};
		$seed = substr($seed, 1) . $arborescence->{$seed};

		delete($arborescence->{$oldseed});

		next;
	    }else{
		die("Dead end: failed to trace the Eulerian. Try shuffling again.");
		#Mathematically, this should not happen.
} } my $rand = int(rand($num)); foreach my $key (keys %{$graph->{$seed}}){ if($rand <= $graph->{$seed}->{$key}){ $graph->{$seed}->{$key} --; delete($graph->{$seed}->{$key}) if($graph->{$seed}->{$key} < 1); $seed = substr($seed, 1) . $key; $seq .= $key; last; }else{ $rand -= $graph->{$seed}->{$key}; } } } msg_error("Done.\n"); return $seq;
}
splitprintseqdescriptionprevnextTop
sub splitprintseq {
    my $seq = shift;
    my $len = shift || 60;
    my $ret = '';

    while(length $seq){
	$ret .= substr($seq, 0, $len) . "\n";
	substr($seq, 0, $len) = '';
    }
    
    return $ret;
}
to_fastadescriptionprevnextTop
sub to_fasta {
    opt_default(length=>60, filename=>"out.fasta", name=>"sequence", output=>"return");
    my @args     = opt_get(@_);
    my $gb       = $args[0];
    my $filename = opt_val("filename");
    my $name     = opt_val("name");
    my $length   = opt_val("length");
    my $output   = opt_val("output");
    my $filter   = opt_val("filter");
    my $rfilter  = opt_val("rfilter");
    my $amino    = opt_val("amino");
    my $sequence = $gb;

    if($filename ne 'out.fasta'){
	$output = 'f';
    }

    my (%flt, %rflt);
    if(length($filter)){
	map{$flt{$_}++} @$filter;
    }elsif(length($rfilter)){
	map{$rflt{$_}++} @$rfilter;
    }
	
    if(length $gb->{LOCUS}->{id}){
	if($amino){
	    my %tmp;
	    for my $cds ($gb->cds()){
		$tmp{$gb->{$cds}->{gene}} = $gb->{$cds}->{translation};
	    }
	    @args = %tmp;
	}else{
	    $name = $gb->{LOCUS}->{id} if ($name eq 'sequence');
	    $sequence  = $gb->{SEQ};
	}
    }
    die("Error: No sequence passed to to_fasta") unless(length($sequence));
    
    my $result;
    if(scalar(@args) > 1){
	my %seq = (@args);
	
	for my $id (sort keys %seq){
	    if(length($filter)){
		next unless($flt{$id});
	    }elsif(length($rfilter)){
		next if($rflt{$id});
	    }
	    $result .= sprintf ">%s\n%s", $id, splitprintseq($seq{$id}, $length);
	}
    }else{
	$result = sprintf ">%s\n%s", $name, splitprintseq($sequence, $length);
    }
    
    if($output eq 'f'){
	open(OUT, ">$filename") || die($!);
	printf OUT $result;
	close(OUT);
    }
    
    return $result;
}
to_fastqdescriptionprevnextTop
sub to_fastq {
    SubOpt::opt_default(length=>75, filename=>"out.fastq", name=>"sequence", output=>"return");
    my @args  = SubOpt::opt_get(@_);
    my $first = @args[0];
    die("Error: No sequence passed to to_fastq") unless(length($first));
    my $filename = SubOpt::opt_val("filename");
    my $name     = SubOpt::opt_val("name");
    my $length   = SubOpt::opt_val("length");
    my $output   = SubOpt::opt_val("output");

    $name = $first->{LOCUS}->{id} if($name eq 'sequence' && length $first->{LOCUS}->{id});
    $first->{QUAL} = 'X' x length($first->{SEQ}) unless(length($first->{QUAL}));

    my $result;
    if(scalar(@args) > 1){
	my %seq = (@args);
	
	for my $id (sort keys %seq){
	    $result .= 	sprintf("\@%s\n%s\+%s\n%s", $id, splitprintseq(uc($seq{$id}), $length), $name, splitprintseq('i' x length($seq{$id}), $length));
	}
    }else{
	$result = sprintf("\@%s\n%s\+%s\n%s", $name, splitprintseq(uc($first->{SEQ}), $length), $name, splitprintseq($first->{QUAL}, $length));
    }
    
    if($output eq 'f'){
	open(OUT, ">$filename") || die($!);
	printf OUT $result;
	close(OUT);
    }
    
    return $result;
}
translatedescriptionprevnextTop
sub translate {
    opt_default("table"=>1);
    my @args  = opt_get(@_);
    my $seq   = lc(shift @args);
    my $table = opt_val("table"); 
    my %option = @args;

    my %codons = %CodonTable;
    %codons = %{$CodonTables[$table - 1]} if(length($CodonTables[$table - 1]));

    foreach my $key (%option){
	$codons{$key} = $option{$key};
    }

    my $amino = '';

    while(3 <= length($seq)){
	my $codon = substr($seq, 0, 3);

	substr($seq, 0, 3) = '';
	if ($codon =~ /[^atgc]/){
	    $amino .= 'X';
	}else{
	    $amino .= $codons{$codon};
	}
#	p %codons;
# say $amino;
} msg_error("Translation: illegal length.\n") if(length($seq) && $option{"silent"} != 1); return $amino;
}
General documentation
No general documentation available.