| Summary | Included libraries | Package variables | Description | General documentation | Methods |
G::Seq::Util - Miscellaneous analysis methods related to sequence analysis.
This class is a part of G-language Genome Analysis Environment,
collecting miscellaneous sequence analysis methods.
| filter_cds_by_atg | Description | Code |
| find_king_of_gene | No description | Code |
| generate_oligomers | Description | Code |
| longest_ORF | Description | Code |
| molecular_weight | Description | Code |
| oligomer_translation | No description | Code |
| refseq2gbk | Description | Code |
| seqinfo | Description | Code |
| view_cds | Description | Code |
| filter_cds_by_atg | code | next | Top |
Name: filter_cds_by_atg - filter CDS by the presence of ATG near start codon
Description:
This method filters out the list of CDSs in a given genome, when
ATG is present near the start codon within the same coding frame.
Returns the list of CDSs that do not have nearby ATG, and sets
$genome->{$cds}->{on} to 0 for all CDSs that contain nearby ATG.
Usage:
@list_of_feature_ids = filter_cds_by_atg($gb)
Options:
-upstream length of upstream region (default:50)
-downstream length of downstream region (default:50)
-codon start codon to search (default: "atg")
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp) |
| generate_oligomers | code | prev | next | Top |
Name: generate_oligomers - generate all oligomers of given length
Description:
This method retuns a list of all oligomers of given length.
For example, for length 2, this method returns a list of all
di-nucleotides (aa,at,ag,ac,ta,tt,tg,tc,ga,gt,gg,gc,ca,ct,cg,cc),
and for length 8, returns all 65536 octamers.
Usage:
@oligomers = generate_oligomers($length)
Options:
None.
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp) |
| longest_ORF | code | prev | next | Top |
Name: longest_ORF - find longest ORFs within a given sequence
Description:
This method searches for putative coding regions by longest ORF
method, the longest spanning sequence seqment starting from "atg"
and ending with "tag", "taa", or "tgg" in the same coding frame.
Returning object is the G-language genome data object, so in
Shell, typing the following after calling this method shows the
list of putative ORFs:
$annotated_data->find(-type=>"CDS");
Usage:
$genome = longest_ORF($sequence);
Options:
none
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp) |
| molecular_weight | code | prev | next | Top |
Name: molecular_weight - calculates the molecular weight of given sequence or molecule
Description:
This method calculates the molecular weight of the given sequence, either ssDNA, dsDNA, RNA, Peptide,
or that of molecular compounds from chemical formula.
For RNA,
M.W. = (An x 329.2) + (Un x 306.2) + (Cn x 305.2) + (Gn x 345.2) + 159
For ssDNA,
M.W. = (An x 313.2) + (Tn x 304.2) + (Cn x 289.2) + (Gn x 329.2) + 79.0
For dsDNA,
M.W. of dsDNA = (nucleotides x 607.4) + 157.9
For peptides and chemical formula, mass is calculated by the cumulative addition of each
amino acid or atoms. For example,
molecular_weight("C6H12O6") #glucose
returns 180.0633881166
Usage:
$mass = molecular_weight($sequence);
Options:
-type automatically guessed, but can be optionally specified
to be "ssDNA" (or just "DNA"), "dsDNA", "RNA", "peptide", or "molecule"
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp) |
| refseq2gbk | code | prev | next | Top |
Name: refseq2gbk - convert RefSeq and GenBank IDs of prokaryotes
Description:
This method converts the RefSeq and GenBank IDs of prokaryotic genomes
based on the list provided by NCBI at
ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt |
| seqinfo | code | prev | next | Top |
Name: seqinfo - prints out basic nucleotide sequence statistics
Description:
This method prints out basic compositional statistics of the
given nucleotide sequence, in a format similar to the one printed
right after calling load(). Returns the number of A, T, G, and C bases.
Usage:
($a, $t, $g, $c) = seqinfo($genome)
Options:
Supply something as the second argument to suppress standard output.
ex. ($a, $t, $g, $c) = seqinfo($genome, 1)
This is useful for simple counting of the four nucleotides.
Author: Kazuharu Arakawa (gaou@sfc.keio.ac.jp) |
| view_cds | code | prev | next | Top |
Name: view_cds - displays a graph of nucleotide contents around start and stop codons
Description:
This method creates a graph showing the average A,T,G,C contents
around start/stop codons. This is useful to view consensus around
start/stop codons and to find characteristic pattern in CDS.
Usage :
view_cds(G instance);
Options:
-length length in bases to show around start/stop codons
(default: 100)
-gap gap shown in graph in between start/stop codon neighbors
(default: 3)
-filename outfile name (default: view_cds.png for graph,
view_cds.csv for file)
-output "f" for file, "g" for graph, "show" to display graph.
(default: "show")
Author:
Kazuharu Arakawa (gaou@sfc.keio.ac.jp) |
| filter_cds_by_atg | description | prev | next | Top |
opt_default(upstream=>50, downstream=>50, codon=>"atg"); my @args = opt_get(@_); my $gb = opt_as_gb(shift @args); my $upstream = int(opt_val("upstream") / 3);}
my $downstream = int(opt_val("downstream") / 3);
my $codon = opt_val("codon"); die ("codon must be 3 letters") if (length($codon) != 3); foreach my $cds ($gb->cds()){ my $seq = $gb->around_startcodon($cds, $upstream * 3, $downstream * 3, "without"); for(my $j = 0; $j <= length($seq) - 3; $j +=3){ if(substr($seq, $j, 3) eq $codon){ $gb->{$cds}->{on} = 0; last; } } } return $gb->cds();
| find_king_of_gene | description | prev | next | Top |
my $nuc=shift; my $gene='you have just found the king of genes.'."\n"; system("wget http://www.stagnightout.com/pics/what-to-wear/21280.jpg -O /tmp/afro.jpg -q"); msg_gimv('/tmp/afro.jpg'); return $gene;}
| generate_oligomers | description | prev | next | Top |
my $num = shift || 8; local *transferase = sub{ my @result; foreach my $key (@_){ push(@result, $key . $_) for (qw/a t g c/); } return @result; }; my @result = (''); for (1..$num){ @result = &transferase(@result); } return @result;}
| longest_ORF | description | prev | next | Top |
my $this = opt_as_gb(shift @_); my $new = new G::IO("blessed"); annotate_with_LORF($new, $this); return $new;}
| molecular_weight | description | prev | next | Top |
my @args = opt_get(@_); my $seq = shift @args; my $strand = opt_val("strand"); my $type = lc(opt_val("type")); unless(length($type)){ if($seq =~ /[0-9]/){ $type = "molecule"; }elsif(lc($seq) =~ /^[atgcn]+$/){ $type = "ssdna"; }elsif(lc($seq) =~ /^[augcn]+$/){ $type = "rna"; }else{ $type = "peptide"; } } $type = "ssdna" if($type eq 'dna'); if($type eq "molecule"){ my %atom; for my $line (readFile(dist_file('g-language', 'data/atom.csv')), 1){ my ($atomm, $mass) = split(/,/, $line,2); $atom{$atomm} = $mass; } my $res; while($seq =~ /([A-Z][a-z]*)(\d*)/g){ my $prod = $2 || 1; $res += $atom{$1} * $prod; } return $res; }else{ my (%count, $res); $count{$_}++ for (split(//, lc($seq))); if($type eq "peptide"){ require G::Seq::AminoAcid; my %mass = G::Seq::AminoAcid::mass(); for my $key (keys %count){ $res += $mass{uc($key)} * $count{$key}; } return $res; }elsif($type eq "ssdna"){ return $count{'a'} * 313.2 + $count{'u'} * 304.2 + $count{'t'} * 304.2 + $count{'c'}*289.2 + $count{'g'}*329.2 + $count{'n'} * 303.7 + 79; }elsif($type eq "rna"){ return $count{'a'} * 329.2 + $count{'u'} * 306.2 + $count{'t'} * 306.2 + $count{'c'}*305.2 + $count{'g'}*345.2 + $count{'n'} * 320.5 + 159; }elsif($type eq "dsdna"){ return length($seq) * 607.4 + 157.9; }else{ die("Unknown type $type in molecular_weight()"); } }}
| oligomer_translation | description | prev | next | Top |
my @args = opt_get(@_); my $seq = shift @args; my $frame = shift @args; my $len = length($seq); if ($frame > 3){ $seq = G::Seq::Util::complement($seq); $frame -= 3; } my %CodonTable = ( 'gac', 'D', 'caa', 'Q', 'gca', 'A', 'ctg', 'L', 'gat', 'D', 'cag', 'Q', 'gcc', 'A', 'ctt', 'L', 'gaa', 'E', 'agc', 'S', 'gcg', 'A', 'ata', 'I', 'gag', 'E', 'agt', 'S', 'gct', 'A', 'atc', 'I', 'aga', 'R', 'tca', 'S', 'gga', 'G', 'att', 'I', 'agg', 'R', 'tcc', 'S', 'ggc', 'G', 'cca', 'P', 'cga', 'R', 'tcg', 'S', 'ggg', 'G', 'ccc', 'P', 'cgc', 'R', 'tct', 'S', 'ggt', 'G', 'ccg', 'P', 'cgg', 'R', 'aca', 'T', 'gta', 'V', 'cct', 'P', 'cgt', 'R', 'acc', 'T', 'gtc', 'V', 'atg', 'M', 'aaa', 'K', 'acg', 'T', 'gtg', 'V', 'tgg', 'W', 'aag', 'K', 'act', 'T', 'gtt', 'V', 'tgc', 'C', 'cac', 'H', 'tac', 'Y', 'tta', 'L', 'tgt', 'C', 'cat', 'H', 'tat', 'Y', 'ttg', 'L', 'taa', '/', 'aac', 'N', 'ttc', 'F', 'cta', 'L', 'tag', '/', 'aat', 'N', 'ttt', 'F', 'ctc', 'L', 'tga', '/' ); my $return = ''; my $i; for ($i = 0; $i < $len; $i ++){ if ($i < $frame - 1){ $return .= substr($seq, $i, $frame - 1) . '-'; $i += $frame - 2; } elsif ($i + 3 <= $len){ $return .= $CodonTable{substr($seq, $i, 3)}; $i += 2; $return .= '-' unless ($i >= $len - 1); } else { $return .= substr($seq, $i); last; } } return $return;}
| refseq2gbk | description | prev | next | Top |
my $id = shift; $id =~ s/\.\d+//g; require LWP::Simple; my $url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt'; my $dir = _sdb_path() . '/data/prokaryote.txt'; my $rc = LWP::Simple::mirror($url, $dir); die("refseq2gbk: cannot retrieve data from NCBI.") unless(-e $dir); my %refseq2gbk; for my $line (readFile($dir)){ my @F = split(/\t/, $line); $F[8] =~ s/-//g; $F[9] =~ s/-//g; $F[8] =~ s/\.\d+//g; $F[9] =~ s/\.\d+//g; if($F[8] =~ /,/ || $F[9] =~ /,/){ my $i = 0; my @right = split(/,/, $F[9]); for my $tmp (split(/,/, $F[8])){ $refseq2gbk{$tmp} = $right[$i]; $refseq2gbk{$right[$i]} = $tmp; $i ++; } }else{ $refseq2gbk{$F[8]} = $F[9]; $refseq2gbk{$F[9]} = $F[8]; } } # p %refseq2gbk;}
return $refseq2gbk{$id};
| seqinfo | description | prev | next | Top |
return opt_as_gb(shift @_)->seq_info(shift @_);}
| view_cds | description | prev | next | Top |
opt_default(length=>100, filename=>"view_cds.png", gap=>3, output=>"show", application=>"gimv"); my @args = opt_get(@_); my $gb = opt_as_gb(shift @args); my $data = {}; my @pos; my $numcds = scalar $gb->cds(); my $i = 0; my $gap = opt_val("gap"); my $length = opt_val("length"); my $filename = opt_val("filename"); my $output = opt_val("output"); my $application = opt_val("application"); $filename = "view_cds.csv" if ($output eq "f" && opt_val("filename") eq "view_cds.png"); for my $nuc (qw/a t g c/){ for ($i = 0; $i < $length * 4 + 6 + $gap; $i++){ $data->{$nuc}->[i] = 0; } } foreach my $cds ($gb->cds()){ my $seq = $gb->around_startcodon($cds, $length, $length); for ($i = 0; $i < length($seq); $i ++){ $data->{substr($seq, $i, 1)}->[$i] += 100/$numcds;}
} $seq = $gb->around_stopcodon($cds, $length, $length); for ($i = 0; $i < length($seq); $i ++){ $data->{substr($seq, $i, 1)}->[$i + length($seq) + $gap] += 100/$numcds;
} } for ($i = 1; $i <= $length * 4 + 6 + $gap; $i ++){ push(@pos, $i); } if ($output eq "g" || $output eq "show"){ _UniMultiGrapher(\@ pos, -x => "position", -y => "percentage", [@{$data->{a}}], [@{$data->{t}}], [@{$data->{g}}], [@{$data->{c}}], -x1=>"A", -x2=>"T", -x3=>"G", -x4=>"C", -filename => $filename, -title => "Base Contents Around Start/Stop Codons" ); msg_gimv("graph/$filename") if($output eq "show"); }elsif ($output eq "f"){ open(OUT, '>data/' . $filename); print OUT "position,A,T,G,C\n"; for ($i = 0; $i < $length * 4 + 6 + $gap; $i ++){ printf OUT "%d,%3.2f,%3.2f,%3.2f,%3.2f\n", $i + 1, $data->{a}[$i], $data->{t}[$i], $data->{g}[$i], $data->{c}[$i]; } close(OUT); }