Primitive documentation.

Name: DoubleHelix - we all love the DNA molecule! Description: This method prints the given sequence as an ASCII text-based graphic depicting a DNA molecule. Enjoy :-) Usage: DoubleHelix($seq); Options: -speed controls the time to display one base-pair. default is 0.05 (second) REST: http://rest.g-language.org/NR_029721/DoubleHelix/speed=0
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20060511-01 initial posting

Name: complement - get the complementary nucleotide sequence Description: Given a sequence, returns its complement. eg. complement('atgc'); # returns 'gcat' REST: http://rest.g-language.org/complement/atgc
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20010727-01 initial posting

Name: shuffleseq - create randomized sequence with conserved k-mer composition Description: Shuffle and randomize the given sequence, conserving the nucleotide/peptide k-mer content of the original sequence. For k=1, i.e. shuffling sequencing preserving single nucleotide composition, Fisher-Yates Algorithm is employed. For k>1, shuffling preserves all k-mers (all k where k=1~k). For example, k=3 preserves all triplet, doublet, and single nucleotide composition. Algorithm for k-mer preserved shuffling is non-trivial, which is solved by graph theoretical approach with Eulerian random walks in the graph of k-1-mers. See references 3-5 for details of this algorithm. Usage: $shuffled_seq = shuffleseq($gb); Options: -k k-mer to preserve the composition (default: 1) k=1 means only the base composition is preserved and the order is shuffled. k=1 preserves dinucleotide composition as well as the single nucleotide composotion. Similarly, k=n preserves all of 1~n -mers. Note that k < 5 with nucleotide sequence will take several minutes to finish. References: 1. Fisher R.A. and Yates F. (1938) "Example 12", Statistical Tables, London 2. Durstenfeld R. (1964) "Algorithm 235: Random permutation", CACM 7(7):420 3. Jiang M., Anderson J., Gillespie J., and Mayne M. (2008) "uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts", BMC Bioinformatics 9:192 4. Kandel D., Matias Y., Unver R., and Winker P. (1996) "Shuffling biological sequences", Discrete Applied Mathematics 71(1-3):171-185 5. Propp J.G. and Wilson D.B. (1998) "How to get a perfectly random sample from a generic Markov chain and generate a random spanning tree of a directed graph", Journal of Algorithms 27(2):170-217 Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20080421-01 added k-mer count preservation option 20070612-01 initial posting

Name: splitprintseq - format sequence data for printing Description: This method splits the given sequence string in segments of certain length and inserts newline code ("\n") to print the sequence in a formatted way. By default, this function splits the string in segments of 60 letters. This length can be changed by supplying the second argument. Usage : print splitprintseq($seq); # print sequence in a formatted way. or $formatted_seq = splitprintseq($seq, 100); # split into segments of 100 characters and add "\n" to each line. Options: Length of segments can be specified as the second argument. Default is 60 characters to match GenBank. Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20050116-01 initial posting

Name: to_fasta - output given sequence to a FASTA file Description: This method outputs the given sequence as a FASTA file. Usage : $fasta_string = to_fasta($seq, -name=>"My sequence"); # output the sequence to out.fasta $fasta_string = to_fasta(%seq); # return FASTA formatted string of multi-fasta Options: -name string for FASTA header (default: sequence) if the first argument is an instance of new G(), $gb->{LOCUS}->{id} is used as default -length number of characters per line (default: 60) -amino when the given sequence is a G-language object, outputs the "translation" feature of all CDS as multi-FASTA format. -filter reference to a list of FASTA IDs to output, when a hash of multiple sequences are given. Only the sequences with IDs supplied by this option is returned. -rfilter reference to a list of FASTA IDs to NOT output, when a hash of multiple sequences are given. Only the sequences not matching to the supplied list is returned. -filename output filename (default: "out.fasta") -output "f" to output to file, and return the fasta as a string. "return" to just return the fasta as a string (default: "return") Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20120306-01 added -amino option 20110504-01 added -rfilter option 20110428-01 added support for multi-fasta given as a hash, and -filter option 20090312-01 changed the deafult behavior to return fasta as a string as well as outputting to a file 20070612-01 added -output option 20050116-01 initial posting

Name: to_fastq - output given sequence to a FASTQ file Description: This method outputs the given sequence as a FASTQ file. If the given sequence object does not contain $file->{QUAL} section, 'X' is used for quality score (corresponding to 50 in Phred quality). Usage : $fastq_string = to_fastq($seq, -name=>"My sequence"); # output the sequence to out.fasta Options: -name string for FASTQ header (default: sequence) if the first argument is an instance of new G(), $gb->{LOCUS}->{id} is used as default -length number of characters per line (default: 60) -filename output filename (default: "out.fastq") -output "f" to output to file, and return the fasta as a string. "return" to just return the fasta as a string (default: "return") Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20091201-01 initial posting

Name: translate - translate nucleotide sequence to amino acid sequence Description: This method translates a given nucleotide sequence into amino acid sequence, using standard codon table. Change to the codon table or custom codon table can be supplied as optional arguments. Stop codons are denoted by '/' by default. eg. translate('ctggtg'); # returns 'LV' Usage : $amino = translate($nucleotide); # to set stop codons to 'X' $amino = translate($nucleotide, tga=>"X", taa=>"X", tag=>"X"); # to use custom codon table $amino = translate($nucleotide, %customCodonTable); # to use alternative codon table (2, Vertebrate Mitochondrial Code) $amino = translate($nucleotide, -table=>2); Options: Custom codon table can be optionally used (see above). Alternative Codon Tables can be used with: -table number of alternative codon table (default: 1) Supported tables are listed below: 1. The Standard Code 2. The Vertebrate Mitochondrial Code 3. The Yeast Mitochondrial Code 4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code 5. The Invertebrate Mitochondrial Code 6. The Ciliate, Dasycladacean and Hexamita Nuclear Code 9. The Echinoderm and Flatworm Mitochondrial Code 10. The Euplotid Nuclear Code 11. The Bacterial, Archaeal and Plant Plastid Code 12. The Alternative Yeast Nuclear Code 13. The Ascidian Mitochondrial Code 14. The Alternative Flatworm Mitochondrial Code 16. Chlorophycean Mitochondrial Code 21. Trematode Mitochondrial Code 22. Scenedesmus obliquus Mitochondrial Code 23. Thraustochytrium Mitochondrial Code 24. Pterobranchia Mitochondrial Code 25. Candidate Division SR1 and Gracilibacteria Code see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details of the tables.
REST: http://rest.g-language.org/translate/ctggtg
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History: 20160104-01 fixed a serious bug in the previous revision, where the default codon table was set to be number 25 instead of 1 20141114-01 added alternative codon tables 20091203-01 added cutom codon table option 20010727-01 initial posting

sub DoubleHelix {

    SubOpt::opt_default(speed=>0.05);
    my @args = opt_get(@_);  
    my $gb = opt_as_gb(shift @args);
    my $speed = opt_val("speed");

    $| = 1;

    my $i;
    my (@offset) = qw/1 0 0 0 1 2 3 4 5 5 4 3 2 1 0 0 0 1/;
    my (@dist)   = qw/0 2 3 4 4 4 3 2 0 0 2 3 4 4 4 3 2 0/;

    foreach my $base (split(//, $gb->{SEQ})){
        my $msg = "         " .  q// /x$offset[($i%scalar(@offset))];
	$msg .= uc($base);
	$i ++;
	$msg .= q//-/x$dist[($i%scalar(@dist))];
	$msg .=  uc(complement($base)) .  "\n";
	msg_send($msg);
	select(undef,undef,undef,$speed) unless($speed == 0);
    }

}

sub shuffleseq {

    opt_default("k"=>1);
    my @args = SubOpt::opt_get(@_);
    my $gb = SubOpt::opt_as_gb(shift @args);
    my $k = opt_val("k");

    if($k == 1){
	return join('', shuffle(split(//, $gb->{SEQ})));
    }

    my $graph = {};
    my $revgraph = {};
    my $arborescence = {};

    my $length = length($gb->{SEQ});
    my %oligo = oligomer_counter($gb, -length=>$k);
    
    foreach my $nuc (keys %oligo){
	$graph->{substr($nuc, 0, $k - 1)}->{substr($nuc, -1, 1)} = $oligo{$nuc};
	$revgraph->{substr($nuc, 1, $k - 1)}->{substr($nuc, 0, 1)} = $oligo{$nuc};
    }
    my $numoligos = keys(%$graph);
    undef %oligo;


    msg_error("Compiling arborescene...\n");
    
    my %edged;
    my $count = 1;
    $edged{substr($gb->{SEQ}, - ($k - 1))} ++;
    my %visited = %edged;

    while($numoligos > $count){
	my @keys = keys %edged;
	my $node = $keys[int(rand(@keys))];
	my @nucs = keys %{$revgraph->{$node}};
	my $rnuc = $nucs[int(rand(@nucs))];

	my $newnode = $rnuc . substr($node, 0, $k - 2);
	unless($visited{$newnode}){
	    $count ++;
	    my $first = substr($node, -1, 1);
	    $visited{$newnode} ++;
	    $edged{$newnode} ++;
	    $arborescence->{$newnode} = $first;
	    $graph->{$newnode}->{$first} --;
	}
	delete($revgraph->{$node}->{$rnuc});
	delete($edged{$node}) if(scalar(@nucs) == 1);
    }
    undef $revgraph;

    msg_error("Compilation of arborescence finished.\nNow generating shuffled sequence.\n");

    my $seed = substr($gb->{SEQ}, 0, $k - 1);
    my $seq = $seed;

    while(length($seq) < $length){
	my $num = 0;
	foreach my $val (values %{$graph->{$seed}}){
	    $num += $val;
	}
	if ($num < 1){
	    if(defined $arborescence->{$seed}){
		my $oldseed = $seed;

		$seq .= $arborescence->{$seed};
		$seed = substr($seed, 1) . $arborescence->{$seed};

		delete($arborescence->{$oldseed});

		next;
	    }else{
		die("Dead end: failed to trace the Eulerian. Try shuffling again.");
		#Mathematically, this should not happen.
	    }
	}

	my $rand = int(rand($num));
	foreach my $key (keys %{$graph->{$seed}}){
	    if($rand <= $graph->{$seed}->{$key}){
		$graph->{$seed}->{$key} --;
		delete($graph->{$seed}->{$key}) if($graph->{$seed}->{$key} < 1);
		
		$seed = substr($seed, 1) . $key;
		$seq .= $key;
		    
		last;
	    }else{
		$rand -= $graph->{$seed}->{$key};
	    }
	}
    }

    msg_error("Done.\n");
    return $seq;

}

sub to_fasta {

    opt_default(length=>60, filename=>"out.fasta", name=>"sequence", output=>"return");
    my @args     = opt_get(@_);
    my $gb       = $args[0];
    my $filename = opt_val("filename");
    my $name     = opt_val("name");
    my $length   = opt_val("length");
    my $output   = opt_val("output");
    my $filter   = opt_val("filter");
    my $rfilter  = opt_val("rfilter");
    my $amino    = opt_val("amino");
    my $sequence = $gb;

    if($filename ne 'out.fasta'){
	$output = 'f';
    }

    my (%flt, %rflt);
    if(length($filter)){
	map{$flt{$_}++} @$filter;
    }elsif(length($rfilter)){
	map{$rflt{$_}++} @$rfilter;
    }
	
    if(length $gb->{LOCUS}->{id}){
	if($amino){
	    my %tmp;
	    for my $cds ($gb->cds()){
		$tmp{$gb->{$cds}->{gene}} = $gb->{$cds}->{translation};
	    }
	    @args = %tmp;
	}else{
	    $name = $gb->{LOCUS}->{id} if ($name eq 'sequence');
	    $sequence  = $gb->{SEQ};
	}
    }
    die("Error: No sequence passed to to_fasta") unless(length($sequence));
    
    my $result;
    if(scalar(@args) > 1){
	my %seq = (@args);
	
	for my $id (sort keys %seq){
	    if(length($filter)){
		next unless($flt{$id});
	    }elsif(length($rfilter)){
		next if($rflt{$id});
	    }
	    $result .= sprintf ">%s\n%s", $id, splitprintseq($seq{$id}, $length);
	}
    }else{
	$result = sprintf ">%s\n%s", $name, splitprintseq($sequence, $length);
    }
    
    if($output eq 'f'){
	open(OUT, ">$filename") || die($!);
	printf OUT $result;
	close(OUT);
    }
    
    return $result;

}

sub to_fastq {

    SubOpt::opt_default(length=>75, filename=>"out.fastq", name=>"sequence", output=>"return");
    my @args  = SubOpt::opt_get(@_);
    my $first = @args[0];
    die("Error: No sequence passed to to_fastq") unless(length($first));
    my $filename = SubOpt::opt_val("filename");
    my $name     = SubOpt::opt_val("name");
    my $length   = SubOpt::opt_val("length");
    my $output   = SubOpt::opt_val("output");

    $name = $first->{LOCUS}->{id} if($name eq 'sequence' && length $first->{LOCUS}->{id});
    $first->{QUAL} = 'X' x length($first->{SEQ}) unless(length($first->{QUAL}));

    my $result;
    if(scalar(@args) > 1){
	my %seq = (@args);
	
	for my $id (sort keys %seq){
	    $result .= 	sprintf("\@%s\n%s\+%s\n%s", $id, splitprintseq(uc($seq{$id}), $length), $name, splitprintseq('i' x length($seq{$id}), $length));
	}
    }else{
	$result = sprintf("\@%s\n%s\+%s\n%s", $name, splitprintseq(uc($first->{SEQ}), $length), $name, splitprintseq($first->{QUAL}, $length));
    }
    
    if($output eq 'f'){
	open(OUT, ">$filename") || die($!);
	printf OUT $result;
	close(OUT);
    }
    
    return $result;

}

sub translate {

    opt_default("table"=>1);
    my @args  = opt_get(@_);
    my $seq   = lc(shift @args);
    my $table = opt_val("table"); 
    my %option = @args;

    my %codons = %CodonTable;
    %codons = %{$CodonTables[$table - 1]} if(length($CodonTables[$table - 1]));

    foreach my $key (%option){
	$codons{$key} = $option{$key};
    }

    my $amino = '';

    while(3 <= length($seq)){
	my $codon = substr($seq, 0, 3);

	substr($seq, 0, 3) = '';
	if ($codon =~ /[^atgc]/){
	    $amino .= 'X';
	}else{
	    $amino .= $codons{$codon};
	}
#	p %codons;
#	say $amino;
    }

    msg_error("Translation: illegal length.\n") if(length($seq) && $option{"silent"} != 1);

    return $amino;

}

DoubleHelix	Description	Code
complement	Description	Code
shuffleseq	Description	Code
splitprintseq	Description	Code
to_fasta	Description	Code
to_fastq	Description	Code
translate	Description	Code