aerial.bio.utils.files

Liking cljdoc? Tell your friends :D

Clojure only.

canonical-csv-entry-info
check-sto
chunk-genome-fnas
collapse-group
collapse-one
entry-file->fasta-file
entry-parts
fastq->fna
fastqs->fnas
fna->fastq
fnas->fastqs
gaisr-seq-set?
gbank->fna
gbank-loci-sane-loci
gbanks->fnas
gen-entry-file
gen-entry-nv-file
gen-name-seq
gen-name-seq-pairs
gen-nc-genome-fnas
genbank-recs
genbk2gtf
get-csv-entry-info
get-ent-as-csv-info
get-entries
get-full-comma-sep-stg
get-gaisr-csv-info
get-legacy-csv-info
get-selection-fna
get-sto-as-csv-info
has-loc?
join-sto-fasta-file
join-sto-fasta-lines
make-entry
map-aln-seqs
map-seqs
nms-sqs->fasta-file
print-sto
read-aln-seqs
read-dirs-aln-seqs
read-farec
read-farecs
read-fqrec
read-fqrecs
read-seqs
reduce-aln-seqs
reduce-seqs
sample-fna
sample-fq
seqline-info-mapper
split-join-fasta-file
split-join-ncbi-fasta-file
sto->aln
sto->aln-blocked
sto->fna
sto-GC-and-seq-lines
write-farec
write-farecs
write-fqrec
write-fqrecs
write-sto

Various bio sequence file format readers, writers, verifiers, and manipulators.

Various bio sequence file format readers, writers, verifiers, and
manipulators.

raw docstring

canonical-csv-entry-info^clj

(canonical-csv-entry-info entries & {:keys [ev] :or {ev 0.0}})

check-sto^clj

(check-sto sto & {printem :printem :or {printem true}})

Checks a sto file to ensure that there are valid characters being used in the sequences consensus structure line. Will print out errors in the sto file by sequence number. Input requires a sto file

Checks a sto file to ensure that there are valid characters being
used in the sequences consensus structure line. Will print out
errors in the sto file by sequence number.  Input requires a sto
file

raw docstring

chunk-genome-fnas^clj

(chunk-genome-fnas genome-fna-dir & {:keys [chunk-size] :or {chunk-size 100}})

Take the set of fnas in directory GENOME-FNA-DIR (presumably created by split-join-ncbi-fasta-file or similar) and aggregate them into a new smaller set of files, where each new file contains the contents of CHUNK-SIZE input files (with the possible exception of the last file having a smaller number). This is useful for creating custom data sets for search.

Take the set of fnas in directory GENOME-FNA-DIR (presumably
created by split-join-ncbi-fasta-file or similar) and aggregate
them into a new smaller set of files, where each new file contains
the contents of CHUNK-SIZE input files (with the possible exception
of the last file having a smaller number).  This is useful for
creating custom data sets for search.

raw docstring

collapse-group^clj

(collapse-group pairs)

collapse-one^clj

(collapse-one fqa fasta)

Collapse the sequences in fqa, a fastq or fasta file and write the collapsed value as a fasta record to fasta, a file spec for fasta output.

Collapse the sequences in fqa, a fastq or fasta file and write the
collapsed value as a fasta record to fasta, a file spec for fasta
output.

raw docstring

entry-file->fasta-file^clj

(entry-file->fasta-file efile & {:keys [names-only]})

entry-parts^clj

(entry-parts entry & {:keys [ldelta rdelta] :or {ldelta 0 rdelta 0}})

ENTRY is a string "name/range/strand", where name is a genome name, range is of the form start-end and strand is 1 or -1. At least name must be supplied. DELTA is an integer, which will be subtracted from start and added to end.

Returns a triple [name [start end] strand]

ENTRY is a string "name/range/strand", where name is a genome
name, range is of the form start-end and strand is 1 or -1.  At
least name must be supplied.  DELTA is an integer, which will be
subtracted from start and added to end.

Returns a triple [name [start end] strand]

raw docstring

fastq->fna^clj

(fastq->fna fq faot)

Convert a fastq format file to a fasta file. Fastq files have 4 lines per 'record' (id, sq, qcid, qc/phred-scores), while the corresponding fasta has only the id and sq lines per 'record'. Both FQ and FAOT are file specs.

Convert a fastq format file to a fasta file. Fastq files have 4
lines per 'record' (id, sq, qcid, qc/phred-scores), while the
corresponding fasta has only the id and sq lines per 'record'. Both
FQ and FAOT are file specs.

raw docstring

fastqs->fnas^clj

(fastqs->fnas fqs & {:keys [outdir]})

Convert the fastqs in FQS (a seq) to corresponding fasta files. The fasta files are named as the fastqs but with file type .fna. If outdir is given, place the fastas there, otherwise place in same directory as corresponding fastqs. The items in fqs are file specs!

Convert the fastqs in FQS (a seq) to corresponding fasta files. The
fasta files are named as the fastqs but with file type .fna. If
outdir is given, place the fastas there, otherwise place in same
directory as corresponding fastqs. The items in fqs are file specs!

raw docstring

fna->fastq^clj

(fna->fastq infna outfq bc%)

Convert a fasta format file into a fastq format file. infna and outfq are fie specifications (strins) of the input fasta and the output fastq.

The issue here is simply what should the quality control (Phred scores) be? We assume the user understands the issue and provide a 'slight' helper for this: bq% is the probability the bases (all!!) are correct. We don't accommodate the case of providing bqs for each base, as that would indicate you already have a fastq version of the sq in question (or certainly should have such). The bc% is converted to the corresponding Phred score and then Sanger encoded (0 -> 33, and all such scores written as their corresponding ASCII character.)

NOTE: output is unblocked even if input fasta is in blocked format!

Convert a fasta format file into a fastq format file. infna and
outfq are fie specifications (strins) of the input fasta and the
output fastq.

The issue here is simply what should the quality control (Phred
scores) be? We assume the user understands the issue and provide a
'slight' helper for this: bq% is the probability the bases (all!!)
are correct. We don't accommodate the case of providing bqs for each
base, as that would indicate you already have a fastq version of the
sq in question (or certainly _should_ have such). The bc% is
converted to the corresponding Phred score and then Sanger
encoded (0 -> 33, and all such scores written as their corresponding
ASCII character.)

NOTE: output is unblocked even if input fasta is in blocked format!

raw docstring

fnas->fastqs^clj

(fnas->fastqs fastas fastqdir bc%)

Convert fastas (collection of fasta filespecs as strings) to corresponding fastqs in directory fastqdir. Conversion is by fna->fastq.

Convert fastas (collection of fasta filespecs as strings) to
corresponding fastqs in directory fastqdir. Conversion is by
fna->fastq.

raw docstring

gaisr-seq-set?^clj

(gaisr-seq-set? x)

Returns true if either

X is a filespec string with type extension fna, aln, sto, or gma X is a java.io.File (presumed to be of one of the above formats) X is a collection (presumed to have seqences as elements)

Returns true if either

X is a filespec string with type extension fna, aln, sto, or gma
X is a java.io.File (presumed to be of one of the above formats)
X is a collection (presumed to have seqences as elements)

raw docstring

gbank->fna^clj

(gbank->fna gbfile fafile)

Obtains the genomic sequence (the 'ORIGIN' record) of a genbank format file gbfile and writes it as the sequence (unblocked) of a fasta file located in fafile. The id line of the fasta consists of the 'LOCUS' fields of the gbfile.

Obtains the genomic sequence (the 'ORIGIN' record) of a genbank
format file gbfile and writes it as the sequence (unblocked) of a
fasta file located in fafile. The id line of the fasta consists of
the 'LOCUS' fields of the gbfile.

raw docstring

gbank-loci-sane-loci^clj

(gbank-loci-sane-loci gbloci)

GBFF loci are often, if not incoherent, certainly vague and not very useful. This function takes such and turns them into nice clean loci with simple start, end and strand. BUG: this will lose some information in certain cases, but should not with CDS and genes.

GBFF loci are often, if not incoherent, certainly vague and not very
useful. This function takes such and turns them into nice clean loci
with simple start, end and strand. BUG: this will lose some
information in certain cases, but should not with CDS and genes.

raw docstring

gbanks->fnas^clj

(gbanks->fnas gbfiles outdir)

Calls gbank->fna on all genbank format files in gbfiles, a collection of fully qualified file specifications of genbank files. The corresponding fasta file specification is composed of the basename of a genbank file, with '.fna' file type and placed in outdir. Outdir is a string which is an output directory specification, which must exist and be a directory.

Calls gbank->fna on all genbank format files in gbfiles, a
collection of fully qualified file specifications of genbank
files. The corresponding fasta file specification is composed of the
basename of a genbank file, with '.fna' file type and placed in
outdir. Outdir is a string which is an output directory
specification, which must exist and be a directory.

raw docstring

gen-entry-file^clj

(gen-entry-file entries file)

Take entries (a collection of nm/s-e/std - see make-entry / entry-parts) and write them out to file. Returns file

Take entries (a collection of nm/s-e/std - see make-entry /
entry-parts) and write them out to file. Returns file

raw docstring

gen-entry-nv-file^clj

(gen-entry-nv-file entries file)

Take entries (a collection of entry [,v]* where v(s) are optional values (JSD, EVals, etc)) and write them as csv file.

Take entries (a collection of entry [,v]* where v(s) are optional
values (JSD, EVals, etc)) and write them as csv file.

raw docstring

gen-name-seq^clj

(gen-name-seq entry
              &
              {:keys [basedir ldelta rdelta rna]
               :or {basedir (pams/default-genome-fasta-dir entry)
                    ldelta 0
                    rdelta 0
                    rna true}})

Generate a pair [entry genome-seq], from ENTRY as possibly modified by [L|R]DELTA and RNA. ENTRY is a string "name/range/strand", where

name is the genome NC name (and we only currently support NCs),

range is of the form start-end, where start and end are integers (1 based...) for the start and end coordinates of name's sequence to return. NOTE: start < end, as reverse compliment information comes from strand.

strand is either -1 for reverse compliment or 1 for standard 5'->3'

LDELTA is an integer, which will be subtracted from start. So, ldelta < 0 removes |ldelta| bases from 5', ldelta > 0 'tacks on' ldelta extra bases to the 5' end. Defaults to 0 (no change).

RDELTA is an integer, which will be added to the end. So, rdelta < 0 removes |rdelta| bases from 3', rdelta > 0 'tacks on' rdelta extra bases to the 3' end. Defaults to 0 (no change).

If RNA is true, change Ts to Us, otherwise return unmodified sequence.

BASEDIR is the location of the NC fasta files. Generally, this should always be the default location.

Generate a pair [entry genome-seq], from ENTRY as possibly modified
by [L|R]DELTA and RNA.  ENTRY is a string "name/range/strand",
where

name is the genome NC name (and we only currently support NCs),

range is of the form start-end, where start and end are integers (1
based...) for the start and end coordinates of name's sequence to
return.  NOTE: start < end, as reverse compliment information comes
from strand.

strand is either -1 for reverse compliment or 1 for standard 5'->3'

LDELTA is an integer, which will be _subtracted_ from start.  So,
ldelta < 0 _removes_ |ldelta| bases from 5', ldelta > 0 'tacks on'
ldelta extra bases to the 5' end.  Defaults to 0 (no change).

RDELTA is an integer, which will be _added_ to the end.  So, rdelta
< 0 _removes_ |rdelta| bases from 3', rdelta > 0 'tacks on' rdelta
extra bases to the 3' end.  Defaults to 0 (no change).

If RNA is true, change Ts to Us, otherwise return unmodified
sequence.

BASEDIR is the location of the NC fasta files.  Generally, this
should always be the default location.

raw docstring

gen-name-seq-pairs^clj

(gen-name-seq-pairs entries
                    &
                    {:keys [basedir strand ldelta rdelta rna]
                     :or {basedir pams/default-genome-fasta-dir
                          strand 0
                          ldelta 0
                          rdelta 0
                          rna true}})

Generate a sequence of pairs [name genome-seq] from the given collection denoted by entries, which is either a collection or a string denoting an entry filespec (either hand made or via gen-entry-file or similar).

See gen-name-seq for details. Basically this is (map gen-name-seq entries) with some options. In particular, entries may lack a strand component (again see gen-name-seq for format details), and STRAND here would be used to indicate all entries are to have this strand. If entries have a strand component, strand should be 0, otherwise you will force all entries to the strand given. Strand is either -1 (reverse compliment) or 1 for standard 5'->3', or 0 meaning 'use strand component in entries'

BASEDIR is the location of the NC fasta files. Generally, this should always be the default location.

Generate a sequence of pairs [name genome-seq] from the given
collection denoted by entries, which is either a collection or a
string denoting an entry filespec (either hand made or via
gen-entry-file or similar).

See gen-name-seq for details.  Basically this is (map gen-name-seq
entries) with some options.  In particular, entries may lack a
strand component (again see gen-name-seq for format details), and
STRAND here would be used to indicate all entries are to have this
strand.  If entries have a strand component, strand should be 0,
otherwise you will force all entries to the strand given.  Strand
is either -1 (reverse compliment) or 1 for standard 5'->3', or 0
meaning 'use strand component in entries'

BASEDIR is the location of the NC fasta files.  Generally, this
should always be the default location.

raw docstring

gen-nc-genome-fnas^clj

(gen-nc-genome-fnas full-nc-genomes-filespec)

One shot genome sequence fnas generator. Typically used once per data update. Needs to be generalized to be able to use Genbank fna archives. Currently assumes one fna AND ONLY NC_* genomes.

One shot genome sequence fnas generator.  Typically used once per
data update.  Needs to be generalized to be able to use Genbank fna
archives.  Currently assumes one fna AND ONLY NC_* genomes.

raw docstring

genbank-recs^clj

(genbank-recs gbfile
              &
              {:keys [feats attrs]
               :or {feats ["gene" "CDS" "misc_feature" "rRNA" "tRNA" "tmRNA"
                           "ncRNA"]
                    attrs ["gene" "locus_tag" "old_locus_tag" "protein_id"
                           "condon_start" "gene_synonym" "db_xref"]}})

Takes a genbank file, gbfile - a string file specification, and a set of features (feats) and associated attributes (attrs), so called qualifiers and values, and returns a vector where the first element describes the LOCUS of the gbfile and each subsequent element is a triple [feature loci attr-map], where feature (a string) is one of feats, loci is a vector [start end strand], (all strings, strand = 1|-1), and attr-map is a map with keys in attrs and their associated values from the genbank record.

Takes a genbank file, gbfile - a string file specification, and a set
of features (feats) and associated attributes (attrs), so called
qualifiers and values, and returns a vector where the first element
describes the LOCUS of the gbfile and each subsequent element is a
triple [feature loci attr-map], where feature (a string) is one of
feats, loci is a vector [start end strand], (all strings, strand =
1|-1), and attr-map is a map with keys in attrs and their associated
values from the genbank record.

raw docstring

genbk2gtf^clj

(genbk2gtf gbfile gtfout)

(genbk2gtf gbfile gtfout options)

Extract a GTF (gene transfer format) file from the record information (features and their attributes) given in the genbank format file gbfile. The GTF file also includes the p_id attribute on CDS records, as required by the cuff* suite of software (this is no longer all that meaningful as the cuff* suite of tools is basically deprecated as 'not good enough').

options is a map of options to use. Currently supports keys :id-order and :feats.

If :id-order is given, the value should be a vector of "locus_tag", "old_locus_tag", "gene", and "protein_id", the order given will determine which of these is used for gene_id and transcript_id. The default value is ["locus_tag", "old_locus_tag", "gene", and "protein_id"]

If :feats is given, the value should be a vector of feature type names taken from this list:

"gene", "CDS", "misc_feature", "rRNA", "tRNA", "tmRNA", "ncRNA"

This will determine what features to include and, importantly, the order will determine the feature type to encode them as. Features in genbank files are often encoded under multiple types, for example, rRNAs are also listed as genes. If you gave :feats ["CDS" "rRNA" "gene"] the rRNA records will have a feature type of rRNA instead of gene as rRNA occurs before gene in the list.

Extract a GTF (gene transfer format) file from the record
information (features and their attributes) given in the genbank
format file gbfile. The GTF file also includes the p_id attribute on
CDS records, as required by the cuff* suite of software (this is no
longer all that meaningful as the cuff* suite of tools is basically
deprecated as 'not good enough').

options is a map of options to use. Currently supports
keys :id-order and :feats.

If :id-order is given, the value should be a vector of
"locus_tag", "old_locus_tag", "gene", and "protein_id", the
order given will determine which of these is used for gene_id and
transcript_id. The default value is ["locus_tag",
"old_locus_tag", "gene", and "protein_id"]

If :feats is given, the value should be a vector of feature type
names taken from this list:

"gene", "CDS", "misc_feature", "rRNA", "tRNA", "tmRNA", "ncRNA"

This will determine what features to include and, importantly, the
order will determine the feature type to encode them as. Features in
genbank files are often encoded under multiple types, for example,
rRNAs are also listed as genes. If you gave :feats ["CDS" "rRNA"
"gene"] the rRNA records will have a feature type of rRNA instead
of gene as rRNA occurs before gene in the list.

raw docstring

get-csv-entry-info^clj

(get-csv-entry-info csv-hit-file)

get-ent-as-csv-info^clj

(get-ent-as-csv-info ent-file)

get-entries^clj

(get-entries filespec & [seqs])

get-full-comma-sep-stg^clj

(get-full-comma-sep-stg stg rdr)

get-gaisr-csv-info^clj

(get-gaisr-csv-info rows)

get-legacy-csv-info^clj

(get-legacy-csv-info rows)

get-selection-fna^clj

(get-selection-fna selections)

Get a fasta file for SELECTIONS. If selections is a file, assumes it is in fact the fasta file. If selections is a collection, assumes the collection is a set of pairs [nm sq], and converts to corresponding fasta file. Returns full path of result file.

Get a fasta file for SELECTIONS.  If selections is a file, assumes
it is in fact the fasta file.  If selections is a collection,
assumes the collection is a set of pairs [nm sq], and converts to
corresponding fasta file.  Returns full path of result file.

raw docstring

get-sto-as-csv-info^clj

(get-sto-as-csv-info stofile & {:keys [ev] :or {ev 0.0}})

has-loc?^clj

(has-loc? entries)

join-sto-fasta-file^clj

(join-sto-fasta-file in-filespec
                     out-filespec
                     &
                     {origin :origin :or {origin ""}})

Joins (de-blocks) unblocked sequence lines in a sto file or fasta file. If in-filespec is a sto file, ORIGIN is a #=GF line indicating tool origin of file. For example, '#=GF AU Infernal 1.0.2'. For stos defaults to nothing, for fastas, not used.

Joins (de-blocks) unblocked sequence lines in a sto file or fasta
file. If in-filespec is a sto file, ORIGIN is a #=GF line indicating
tool origin of file.  For example, '#=GF AU Infernal 1.0.2'. For
stos defaults to nothing, for fastas, not used.

raw docstring

join-sto-fasta-lines^clj

(join-sto-fasta-lines infilespec origin)

make-entry^clj

(make-entry evec)

(make-entry nm s e st)

'Inverse' of entry-parts. EVEC is a vector of shape [nm [s e] strd], where nm is the entry name, S and E are the start and end coordinates in the genome, and strd is the strand marker, 1 or -1. Returns the full entry as: nm/s-e/strd

'Inverse' of entry-parts.  EVEC is a vector of shape [nm [s e]
strd], where nm is the entry name, S and E are the start and end
coordinates in the genome, and strd is the strand marker, 1 or
-1. Returns the full entry as: nm/s-e/strd

raw docstring

map-aln-seqs^clj

(map-aln-seqs f cols filespec)

(map-aln-seqs f par cols filespec & filespecs)

raw docstring

map-seqs^clj

(map-seqs f filespec)

(map-seqs f par filespec & filespecs)

nms-sqs->fasta-file^clj

(nms-sqs->fasta-file nms-sqs filespec)

print-sto^clj

(print-sto seq-lines structure)

takes sequence lines and a structure line and writes it into a sto format file. the seq-lines needs to be a collection of [name sequence] pairs. structure is a string. Simply prints out to the repl.

takes sequence lines and a structure line and writes it into a sto
format file. the seq-lines needs to be a collection of [name
sequence] pairs. structure is a string. Simply prints out to the
repl.

raw docstring

read-aln-seqs^clj

(read-aln-seqs filespec & {cols :cols :or {cols false}})

Read the aligned sequences in FILESPEC and return them in a Clojure seq. Filespec can denote either an aln, sto, or gma file format file. If COLS is true, return the columns of the alignment (including gap characters).

Read the _aligned_ sequences in FILESPEC and return them in a
Clojure seq.  Filespec can denote either an aln, sto, or gma file
format file.  If COLS is true, return the _columns_ of the
alignment (including gap characters).

raw docstring

read-dirs-aln-seqs^clj

(read-dirs-aln-seqs dir
                    &
                    {dirdir :dirdir
                     cols :cols
                     ftypes :ftypes
                     :or {dirdir false cols false ftypes ["sto"]}})

Apply read-aln-seqs across FTYPES in DIR. FTYPES is a vector of one or more file formats (as type extensions) taken from #{"fna", "aln", "sto", "gma"}, producing a seq of sequence sets, each set being a seq of the sequences in a matching file. NOTE: each such set can be viewed as a 'matrix' of the elements (bases/gaps) of the sequences.

If COLS is true, return the transpose of each obtained sequence 'matrix', i.e., return the seq of column seqs. NOTE: as the formats are alignment formats, all sequences in a file are of the same length (including gaps).

If DIRDIR is true, dir is taken as a directory of directories, each of which will have read-aln-seqs applied to it per the above description and the return will be a seq of all such applications.

So, if dirdir is false, the result will have the form:

(seqs-from-file1 ... seqs-from-filen), where filei is in dir

If dirdir is true, the result will be nested one level more:

((seqs-from-dir1-file1 ... seqs-from-dir1-filek) ... (seqs-from-dirn-file1 ... seqs-from-dirn-filel))

Apply read-aln-seqs across FTYPES in DIR.  FTYPES is a vector of
one or more file formats (as type extensions) taken from #{"fna",
"aln", "sto", "gma"}, producing a seq of sequence sets, each
set being a seq of the sequences in a matching file.  NOTE: each
such set can be viewed as a 'matrix' of the elements (bases/gaps)
of the sequences.

If COLS is true, return the transpose of each obtained sequence
'matrix', i.e., return the seq of column seqs.  NOTE: as the
formats are alignment formats, all sequences in a file are of the
same length (including gaps).

If DIRDIR is true, dir is taken as a directory of directories, each
of which will have read-aln-seqs applied to it per the above
description and the return will be a seq of all such applications.

So, if dirdir is false, the result will have the form:

(seqs-from-file1 ... seqs-from-filen), where filei is in dir

If dirdir is true, the result will be nested one level more:

((seqs-from-dir1-file1 ... seqs-from-dir1-filek)
 ...
 (seqs-from-dirn-file1 ... seqs-from-dirn-filel))

raw docstring

read-farec^clj

(read-farec in)

Read a fasta 'record' from a file. IN is an input file descriptor (an already opened input-stream reader). Returns a pair [id sq] defining the next fasta record from IN.

Read a fasta 'record' from a file. IN is an input file
descriptor (an already opened input-stream reader). Returns a
pair [id sq] defining the next fasta record from IN.

raw docstring

read-farecs^clj

(read-farecs in n)

Read n fasta 'records' from a file. IN is an input file descriptor (an already opened input-stream reader), and N is the number of records (2 line chunks) to read. Returns a vector of vector pairs [id sq], each pair representing the id line and sequence line.

Read n fasta 'records' from a file. IN is an input file
descriptor (an already opened input-stream reader), and N is the
number of records (2 line chunks) to read. Returns a vector of
vector pairs [id sq], each pair representing the id line and
sequence line.

raw docstring

read-fqrec^clj

(read-fqrec in)

Read a fastq 'record' from a file. IN is an input file descriptor (an already opened input-stream reader). Returns a quad [id sq aux qc] defining the next fastq record from IN.

Read a fastq 'record' from a file. IN is an input file
descriptor (an already opened input-stream reader). Returns a
quad [id sq aux qc] defining the next fastq record from IN.

raw docstring

read-fqrecs^clj

(read-fqrecs in n)

Read n fastq 'records' from a file. IN is an input file descriptor (an already opened input-stream reader), and N is the number of records (4 line chunks) to read. Returns a vector of vector quads [id sq aux qc], each quad representing the id line, sequence line, auxilliary line and quality control line (phread scores).

Read n fastq 'records' from a file. IN is an input file
descriptor (an already opened input-stream reader), and N is the
number of records (4 line chunks) to read. Returns a vector of
vector quads [id sq aux qc], each quad representing the id line,
sequence line, auxilliary line and quality control line (phread
scores).

raw docstring

read-seqs^clj

(read-seqs input & {:keys [info ftype] :or {info :data}})

Read the sequences in FILESPEC and return set as a lazy seq. Filespec can denote either a fna, fa, hitfna, aln, sto, or gma file format file.

Read the sequences in FILESPEC and return set as a lazy seq.
Filespec can denote either a fna, fa, hitfna, aln, sto, or gma file
format file.

raw docstring

reduce-aln-seqs^clj

(reduce-aln-seqs f fr cols filespecs)

(reduce-aln-seqs f fr v cols filespecs)

raw docstring

reduce-seqs^clj

(reduce-seqs f fr filespecs)

(reduce-seqs f fr v filespecs)

raw docstring

sample-fna^clj

(sample-fna p f)

(sample-fna p f sampfa)

Sample the sequences in f, a fasta file, with probability p. Returns a seq of pairs suitable for writing a fasta file: [id, sq]. The id is the corresponding id of the sq in f. In the 3 arg case, sampfna is a filespec for an output fasta file where the sampling is written.

Sample the sequences in f, a fasta file, with probability
p. Returns a seq of pairs suitable for writing a fasta file: [id,
sq]. The id is the corresponding id of the sq in f. In the 3 arg
case, sampfna is a filespec for an output fasta file where the
sampling is written.

raw docstring

sample-fq^clj

(sample-fq p f)

(sample-fq p f sampfq)

Sample the sequences in f, a fastq file, with probability p. Returns a seq of quadtuples suitable for writing a fastq file: [id, sq, qcdesc qc]. The id is the corresponding id of the sq in f. qcdesc and qc are the corresponding quality description line and the quality score line. In the 3 arg case, sampfq is a filespec for an output fastq file where the sampling is written.

Sample the sequences in f, a fastq file, with probability
p. Returns a seq of quadtuples suitable for writing a fastq
file: [id, sq, qcdesc qc]. The id is the corresponding id of the sq
in f. qcdesc and qc are the corresponding quality description line
and the quality score line. In the 3 arg case, sampfq is a filespec
for an output fastq file where the sampling is written.

raw docstring

seqline-info-mapper^clj

(seqline-info-mapper type info)

Helper function for READ-SEQS. Returns the function to map over seq lines to obtain the requested info. TYPE is supported seq file type (aln, sto, fna, fa, gma). INFO is either :name for the sequence identifier, :data for the sequence data, or :both for name and data.

Impl Note: while this almost begs for multimethods, that would actually increase the complexity as it would mean 14 methods to cover the cases...

Helper function for READ-SEQS.  Returns the function to map over
seq lines to obtain the requested info.  TYPE is supported seq file
type (aln, sto, fna, fa, gma).  INFO is either :name for the
sequence identifier, :data for the sequence data, or :both for name
and data.

Impl Note: while this almost begs for multimethods, that would
actually increase the complexity as it would mean 14 methods to
cover the cases...

raw docstring

split-join-fasta-file^clj

(split-join-fasta-file
  in-file
  &
  {:keys [base pat namefn entryfn testfn]
   :or {base "" pat #"^>gi" entryfn identity testfn (fn [x y] true)}})

split-join-ncbi-fasta-file^clj

(split-join-ncbi-fasta-file in-file)

Split a fasta file IN-FILE into the individual sequences and unblock the sequence if blocked. The resulting individual [nm sq] pairs are written to files named for the NC name in the gi line of in-file and in the DEFAULT-GENOME-FASTA-DIR location.

The main use of this function is to take a refseq fasta db (composed of many multi seq fasta files) and split the db into a normed set of named sequence files for quick access to sequence per name in various other processing (see gen-name-seq for example).

Canonical use case example:

(fs/dodir "/data2/BioData/Fasta" ; RefSeqxx fasta files #(fs/directory-files % "fna") #(split-join-ncbi-fasta-file %))

Split a fasta file IN-FILE into the individual sequences and
unblock the sequence if blocked.  The resulting individual [nm sq]
pairs are written to files named for the NC name in the gi line of
in-file and in the DEFAULT-GENOME-FASTA-DIR location.

The main use of this function is to take a refseq fasta
db (composed of many multi seq fasta files) and split the db into a
normed set of named sequence files for quick access to sequence per
name in various other processing (see gen-name-seq for example).

Canonical use case example:

(fs/dodir "/data2/BioData/Fasta" ; RefSeqxx fasta files
          #(fs/directory-files % "fna")
          #(split-join-ncbi-fasta-file %))

raw docstring

sto->aln^clj

(sto->aln stoin alnout & {blocked :blocked :or {blocked false}})

Convert a stockhom format alignment file into its ClustalW equivalent ALN format. STOIN is the filespec for the stockholm format file and ALNOUT is the filespec for the resulting conversion (it is overwritten if it already exists!)

BLOCKED is a boolean indicating whether the output should be blocked (60 chars per chunk). Default is unblocked.

Convert a stockhom format alignment file into its ClustalW
equivalent ALN format.  STOIN is the filespec for the stockholm
format file and ALNOUT is the filespec for the resulting
conversion (it is overwritten if it already exists!)

BLOCKED is a boolean indicating whether the output should be
blocked (60 chars per chunk).  Default is unblocked.

raw docstring

sto->aln-blocked^clj

(sto->aln-blocked stoin alnout)

Convert a stockhom format alignment file into its ClustalW equivalent BLOCKED ALN format. Blocking is done in 60 character chunks. STOIN is the filespec for the stockholm format file and ALNOUT is the filespec for the resulting conversion (it is overwritten if it already exists!)

Convert a stockhom format alignment file into its ClustalW
equivalent BLOCKED ALN format. Blocking is done in 60 character
chunks.  STOIN is the filespec for the stockholm format file and
ALNOUT is the filespec for the resulting conversion (it is
overwritten if it already exists!)

raw docstring

sto->fna^clj

(sto->fna stoin fnaout)

Convert a sto file into a fasta file. Split seq lines into names and seq data and interleave these. Seq data has all gap characters removed.

Convert a sto file into a fasta file.  Split seq lines into names
and seq data and interleave these.  Seq data has all gap characters
removed.

raw docstring

sto-GC-and-seq-lines^clj

(sto-GC-and-seq-lines stofilespec)

write-farec^clj

(write-farec ot rec)

Write a fasta 'record' to a file. OT is an output file descriptor (an already opened output-stream writer). REC is a vector quad [id sq], representing the id line and the sequence line

Write a fasta 'record' to a file.  OT is an output file
descriptor (an already opened output-stream writer).  REC is a
vector quad [id sq], representing the id line and the sequence line

raw docstring

write-farecs^clj

(write-farecs ot recs)

Write fasta 'records' to file. OT is an output file descriptor (an already opened output-stream writer). RECS is a vector/sequence of quads [id sq], each representing the id line and the sequence line

Write fasta 'records' to file. OT is an output file descriptor (an
already opened output-stream writer). RECS is a vector/sequence of
quads [id sq], each representing the id line and the sequence line

raw docstring

write-fqrec^clj

(write-fqrec ot rec)

Write a fastq 'record' to a file. OT is an output file descriptor (an already opened output-stream writer). REC is a vector quad [id sq aux qc], representing the id line, the sequence line, the auxilliary information line and the quality control line for a fastq format file.

Write a fastq 'record' to a file.  OT is an output file
descriptor (an already opened output-stream writer).  REC is a
vector quad [id sq aux qc], representing the id line, the sequence
line, the auxilliary information line and the quality control line
for a fastq format file.

raw docstring

write-fqrecs^clj

(write-fqrecs ot recs)

Write fastq 'records' to file. OT is an output file descriptor (an already opened output-stream writer). RECS is a vector/sequence of quads [id sq aux qc], each representing the id line, the sequence line, the auxilliary information line and the quality control line for a fastq format file.

Write fastq 'records' to file. OT is an output file descriptor (an
already opened output-stream writer). RECS is a vector/sequence of
quads [id sq aux qc], each representing the id line, the sequence
line, the auxilliary information line and the quality control line
for a fastq format file.

raw docstring

write-sto^clj

(write-sto newsto auth-lines comment-lines nm-sq-pairs ss-lines)

A work in progress... Write a new sto composed of the various given parts to the file spec given as NEWSTO. AUTH-LINES are the authoring header lines - including the STOCKHOLM line. Generally there are two of these - the STOCKHOLM line (with version) and the originating author or program that generated the content (for example, Infernal).

COMMENT-LINES is a collection of the #=GF/GC lines, with the exception of the GC SS_cons and RF lines. Comment-lines may be empty (for example, []).

NM-SQ-PAIRS is a collection (typically vector/list) of pairs of the entries (name/start-end/strand) and the associated sequence (in gapped form). If this is created via JOIN-STO-FASTA-LINES, the vector of [id sq] pairs that is the sequence part of the nm-sq-pair, will have the id part filtered out automatically.

SS-LINES is the set of 'secondary structure' lines. These are the GC SS_cons and RF lines. SS-LINES may contain the final '//' line or not. If not, it is still written to the file, if so, only the one '//' is written.

A work in progress...  Write a new sto composed of the various
given parts to the file spec given as NEWSTO.  AUTH-LINES are the
authoring header lines - including the STOCKHOLM line.  Generally
there are two of these - the STOCKHOLM line (with version) and the
originating author or program that generated the content (for
example, Infernal).

COMMENT-LINES is a collection of the #=GF/GC lines, with the
exception of the GC SS_cons and RF lines.  Comment-lines may be
empty (for example, []).

NM-SQ-PAIRS is a collection (typically vector/list) of pairs of the
entries (name/start-end/strand) and the associated sequence (in
gapped form).  If this is created via JOIN-STO-FASTA-LINES, the
vector of [id sq] pairs that is the sequence part of the
nm-sq-pair, will have the id part filtered out automatically.

SS-LINES is the set of 'secondary structure' lines.  These are the
GC SS_cons and RF lines.  SS-LINES may contain the final '//' line
or not.  If not, it is still written to the file, if so, only the
one '//' is written.

raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close