G::Seq
Primitive
Summary
G::Seq::Primitivew - Basic sequence analysis methods
Package variables
Privates (from "my" definitions)
@CodonTables = ('',\% CodonTable2,\% CodonTable3,\% CodonTable4,\% CodonTable5,\% CodonTable6, '', '',\% CodonTable9,\% CodonTable10,\% CodonTable11,\% CodonTable12,\% CodonTable13,\% CodonTable14, '',\% CodonTable16, '', '', '', '',\% CodonTable21,\% CodonTable22,\% CodonTable23,\% CodonTable24,\% CodonTable25)
Included modules
Inherit
Exporter
Synopsis
No synopsis!
Description
This class is a part of G-language Genome Analysis Environment,
collecting basic sequence analysis methods.
Methods
Methods description
Name: DoubleHelix - we all love the DNA molecule!
Description:
This method prints the given sequence as an ASCII text-based graphic
depicting a DNA molecule. Enjoy :-)
Usage:
DoubleHelix($seq);
Options:
-speed controls the time to display one base-pair.
default is 0.05 (second)
REST:
http://rest.g-language.org/NR_029721/DoubleHelix/speed=0
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History:
20060511-01 initial posting |
Name: shuffleseq - create randomized sequence with conserved k-mer composition
Description:
Shuffle and randomize the given sequence, conserving the nucleotide/peptide
k-mer content of the original sequence.
For k=1, i.e. shuffling sequencing preserving single nucleotide composition,
Fisher-Yates Algorithm is employed.
For k>1, shuffling preserves all k-mers (all k where k=1~k). For example,
k=3 preserves all triplet, doublet, and single nucleotide composition.
Algorithm for k-mer preserved shuffling is non-trivial, which is solved
by graph theoretical approach with Eulerian random walks in the graph of
k-1-mers. See references 3-5 for details of this algorithm.
Usage:
$shuffled_seq = shuffleseq($gb);
Options:
-k k-mer to preserve the composition (default: 1)
k=1 means only the base composition is preserved and the
order is shuffled. k=1 preserves dinucleotide composition
as well as the single nucleotide composotion. Similarly,
k=n preserves all of 1~n -mers. Note that k < 5
with nucleotide sequence will take several minutes to
finish.
References:
1. Fisher R.A. and Yates F. (1938) "Example 12", Statistical Tables, London
2. Durstenfeld R. (1964) "Algorithm 235: Random permutation", CACM 7(7):420
3. Jiang M., Anderson J., Gillespie J., and Mayne M. (2008) "uShuffle:
a useful tool for shuffling biological sequences while preserving the k-let
counts", BMC Bioinformatics 9:192
4. Kandel D., Matias Y., Unver R., and Winker P. (1996) "Shuffling biological
sequences", Discrete Applied Mathematics 71(1-3):171-185
5. Propp J.G. and Wilson D.B. (1998) "How to get a perfectly random sample
from a generic Markov chain and generate a random spanning tree of a
directed graph", Journal of Algorithms 27(2):170-217
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History:
20080421-01 added k-mer count preservation option
20070612-01 initial posting |
Name: splitprintseq - format sequence data for printing
Description:
This method splits the given sequence string in segments of
certain length and inserts newline code ("\n") to print the
sequence in a formatted way. By default, this function splits
the string in segments of 60 letters. This length can be
changed by supplying the second argument.
Usage :
print splitprintseq($seq); # print sequence in a formatted way.
or
$formatted_seq = splitprintseq($seq, 100);
# split into segments of 100 characters and add "\n" to each line.
Options:
Length of segments can be specified as the second argument.
Default is 60 characters to match GenBank.
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History:
20050116-01 initial posting |
Name: to_fasta - output given sequence to a FASTA file
Description:
This method outputs the given sequence as a FASTA file.
Usage :
$fasta_string = to_fasta($seq, -name=>"My sequence"); # output the sequence to out.fasta
$fasta_string = to_fasta(%seq); # return FASTA formatted string of multi-fasta
Options:
-name string for FASTA header (default: sequence)
if the first argument is an instance of new G(),
$gb->{LOCUS}->{id} is used as default
-length number of characters per line (default: 60)
-amino when the given sequence is a G-language object, outputs the "translation"
feature of all CDS as multi-FASTA format.
-filter reference to a list of FASTA IDs to output, when a hash of
multiple sequences are given. Only the sequences with IDs
supplied by this option is returned.
-rfilter reference to a list of FASTA IDs to NOT output, when a hash of
multiple sequences are given. Only the sequences not matching to
the supplied list is returned.
-filename output filename (default: "out.fasta")
-output "f" to output to file, and return the fasta as a string.
"return" to just return the fasta as a string (default: "return")
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History:
20120306-01 added -amino option
20110504-01 added -rfilter option
20110428-01 added support for multi-fasta given as a hash, and -filter option
20090312-01 changed the deafult behavior to return fasta as a string as well as outputting to a file
20070612-01 added -output option
20050116-01 initial posting |
Name: to_fastq - output given sequence to a FASTQ file
Description:
This method outputs the given sequence as a FASTQ file. If the given sequence
object does not contain $file->{QUAL} section, 'X' is used for quality score
(corresponding to 50 in Phred quality).
Usage :
$fastq_string = to_fastq($seq, -name=>"My sequence"); # output the sequence to out.fasta
Options:
-name string for FASTQ header (default: sequence)
if the first argument is an instance of new G(),
$gb->{LOCUS}->{id} is used as default
-length number of characters per line (default: 60)
-filename output filename (default: "out.fastq")
-output "f" to output to file, and return the fasta as a string.
"return" to just return the fasta as a string (default: "return")
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History:
20091201-01 initial posting |
Name: translate - translate nucleotide sequence to amino acid sequence
Description:
This method translates a given nucleotide sequence into amino acid
sequence, using standard codon table. Change to the codon table or
custom codon table can be supplied as optional arguments. Stop codons
are denoted by '/' by default.
eg. translate('ctggtg'); # returns 'LV'
Usage :
$amino = translate($nucleotide);
# to set stop codons to 'X'
$amino = translate($nucleotide, tga=>"X", taa=>"X", tag=>"X");
# to use custom codon table
$amino = translate($nucleotide, %customCodonTable);
# to use alternative codon table (2, Vertebrate Mitochondrial Code)
$amino = translate($nucleotide, -table=>2);
Options:
Custom codon table can be optionally used (see above).
Alternative Codon Tables can be used with:
-table number of alternative codon table (default: 1)
Supported tables are listed below:
1. The Standard Code
2. The Vertebrate Mitochondrial Code
3. The Yeast Mitochondrial Code
4. The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
5. The Invertebrate Mitochondrial Code
6. The Ciliate, Dasycladacean and Hexamita Nuclear Code
9. The Echinoderm and Flatworm Mitochondrial Code
10. The Euplotid Nuclear Code
11. The Bacterial, Archaeal and Plant Plastid Code
12. The Alternative Yeast Nuclear Code
13. The Ascidian Mitochondrial Code
14. The Alternative Flatworm Mitochondrial Code
16. Chlorophycean Mitochondrial Code
21. Trematode Mitochondrial Code
22. Scenedesmus obliquus Mitochondrial Code
23. Thraustochytrium Mitochondrial Code
24. Pterobranchia Mitochondrial Code
25. Candidate Division SR1 and Gracilibacteria Code
see http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details of the tables.
REST:
http://rest.g-language.org/translate/ctggtg
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp)
History:
20160104-01 fixed a serious bug in the previous revision, where the default codon table
was set to be number 25 instead of 1
20141114-01 added alternative codon tables
20091203-01 added cutom codon table option
20010727-01 initial posting |
Methods code
sub DoubleHelix
{ SubOpt::opt_default(speed=>0.05);
my @args = opt_get(@_);
my $gb = opt_as_gb(shift @args);
my $speed = opt_val("speed");
$| = 1;
my $i;
my (@offset) = qw/1 0 0 0 1 2 3 4 5 5 4 3 2 1 0 0 0 1/;
my (@dist) = qw/0 2 3 4 4 4 3 2 0 0 2 3 4 4 4 3 2 0/;
foreach my $base (split(//, $gb->{SEQ})){
my $msg = " " . q// /x$offset[($i%scalar(@offset))]; $msg .= uc($base);
$i ++;
$msg .= q//-/x$dist[($i%scalar(@dist))]; $msg .= uc(complement($base)) . "\n";
msg_send($msg);
select(undef,undef,undef,$speed) unless($speed == 0);
} } |
sub complement
{ my $nuc = reverse(shift);
$nuc =~ tr [acgturymkdhbvwsnACGTURYMKDHBVWSN] [tgcaayrkmhdvbwsnTGCAAYRKMHDVBWSN];
return $nuc; } |
sub shuffleseq
{ opt_default("k"=>1);
my @args = SubOpt::opt_get(@_);
my $gb = SubOpt::opt_as_gb(shift @args);
my $k = opt_val("k");
if($k == 1){
return join('', shuffle(split(//, $gb->{SEQ})));
}
my $graph = {};
my $revgraph = {};
my $arborescence = {};
my $length = length($gb->{SEQ});
my %oligo = oligomer_counter($gb, -length=>$k);
foreach my $nuc (keys %oligo){
$graph->{substr($nuc, 0, $k - 1)}->{substr($nuc, -1, 1)} = $oligo{$nuc};
$revgraph->{substr($nuc, 1, $k - 1)}->{substr($nuc, 0, 1)} = $oligo{$nuc};
}
my $numoligos = keys(%$graph);
undef %oligo;
msg_error("Compiling arborescene...\n");
my %edged;
my $count = 1;
$edged{substr($gb->{SEQ}, - ($k - 1))} ++;
my %visited = %edged;
while($numoligos > $count){
my @keys = keys %edged;
my $node = $keys[int(rand(@keys))];
my @nucs = keys %{$revgraph->{$node}};
my $rnuc = $nucs[int(rand(@nucs))];
my $newnode = $rnuc . substr($node, 0, $k - 2);
unless($visited{$newnode}){
$count ++;
my $first = substr($node, -1, 1);
$visited{$newnode} ++;
$edged{$newnode} ++;
$arborescence->{$newnode} = $first;
$graph->{$newnode}->{$first} --;
}
delete($revgraph->{$node}->{$rnuc});
delete($edged{$node}) if(scalar(@nucs) == 1);
}
undef $revgraph;
msg_error("Compilation of arborescence finished.\nNow generating shuffled sequence.\n");
my $seed = substr($gb->{SEQ}, 0, $k - 1);
my $seq = $seed;
while(length($seq) < $length){
my $num = 0;
foreach my $val (values %{$graph->{$seed}}){
$num += $val;
}
if ($num < 1){
if(defined $arborescence->{$seed}){
my $oldseed = $seed;
$seq .= $arborescence->{$seed};
$seed = substr($seed, 1) . $arborescence->{$seed};
delete($arborescence->{$oldseed});
next;
}else{
die("Dead end: failed to trace the Eulerian. Try shuffling again.");
}
}
my $rand = int(rand($num));
foreach my $key (keys %{$graph->{$seed}}){
if($rand <= $graph->{$seed}->{$key}){
$graph->{$seed}->{$key} --;
delete($graph->{$seed}->{$key}) if($graph->{$seed}->{$key} < 1);
$seed = substr($seed, 1) . $key;
$seq .= $key;
last;
}else{
$rand -= $graph->{$seed}->{$key};
}
}
}
msg_error("Done.\n");
return $seq; } |
sub splitprintseq
{ my $seq = shift;
my $len = shift || 60;
my $ret = '';
while(length $seq){
$ret .= substr($seq, 0, $len) . "\n";
substr($seq, 0, $len) = '';
}
return $ret; } |
sub to_fasta
{ opt_default(length=>60, filename=>"out.fasta", name=>"sequence", output=>"return");
my @args = opt_get(@_);
my $gb = $args[0];
my $filename = opt_val("filename");
my $name = opt_val("name");
my $length = opt_val("length");
my $output = opt_val("output");
my $filter = opt_val("filter");
my $rfilter = opt_val("rfilter");
my $amino = opt_val("amino");
my $sequence = $gb;
if($filename ne 'out.fasta'){
$output = 'f';
}
my (%flt, %rflt);
if(length($filter)){
map{$flt{$_}++} @$filter;
}elsif(length($rfilter)){
map{$rflt{$_}++} @$rfilter;
}
if(length $gb->{LOCUS}->{id}){
if($amino){
my %tmp;
for my $cds ($gb->cds()){
$tmp{$gb->{$cds}->{gene}} = $gb->{$cds}->{translation};
}
@args = %tmp;
}else{
$name = $gb->{LOCUS}->{id} if ($name eq 'sequence');
$sequence = $gb->{SEQ};
}
}
die("Error: No sequence passed to to_fasta") unless(length($sequence));
my $result;
if(scalar(@args) > 1){
my %seq = (@args);
for my $id (sort keys %seq){
if(length($filter)){
next unless($flt{$id});
}elsif(length($rfilter)){
next if($rflt{$id});
}
$result .= sprintf ">%s\n%s", $id, splitprintseq($seq{$id}, $length);
}
}else{
$result = sprintf ">%s\n%s", $name, splitprintseq($sequence, $length);
}
if($output eq 'f'){
open(OUT, ">$filename") || die($!);
printf OUT $result;
close(OUT);
}
return $result; } |
sub to_fastq
{ SubOpt::opt_default(length=>75, filename=>"out.fastq", name=>"sequence", output=>"return");
my @args = SubOpt::opt_get(@_);
my $first = @args[0];
die("Error: No sequence passed to to_fastq") unless(length($first));
my $filename = SubOpt::opt_val("filename");
my $name = SubOpt::opt_val("name");
my $length = SubOpt::opt_val("length");
my $output = SubOpt::opt_val("output");
$name = $first->{LOCUS}->{id} if($name eq 'sequence' && length $first->{LOCUS}->{id});
$first->{QUAL} = 'X' x length($first->{SEQ}) unless(length($first->{QUAL}));
my $result;
if(scalar(@args) > 1){
my %seq = (@args);
for my $id (sort keys %seq){
$result .= sprintf("\@%s\n%s\+%s\n%s", $id, splitprintseq(uc($seq{$id}), $length), $name, splitprintseq('i' x length($seq{$id}), $length));
}
}else{
$result = sprintf("\@%s\n%s\+%s\n%s", $name, splitprintseq(uc($first->{SEQ}), $length), $name, splitprintseq($first->{QUAL}, $length));
}
if($output eq 'f'){
open(OUT, ">$filename") || die($!);
printf OUT $result;
close(OUT);
}
return $result; } |
sub translate
{ opt_default("table"=>1);
my @args = opt_get(@_);
my $seq = lc(shift @args);
my $table = opt_val("table");
my %option = @args;
my %codons = %CodonTable;
%codons = %{$CodonTables[$table - 1]} if(length($CodonTables[$table - 1]));
foreach my $key (%option){
$codons{$key} = $option{$key};
}
my $amino = '';
while(3 <= length($seq)){
my $codon = substr($seq, 0, 3);
substr($seq, 0, 3) = '';
if ($codon =~ /[^atgc]/){
$amino .= 'X';
}else{
$amino .= $codons{$codon};
}
}
msg_error("Translation: illegal length.\n") if(length($seq) && $option{"silent"} != 1);
return $amino; } |
General documentation
No general documentation available.