Liking cljdoc? Tell your friends :D

aerial.bio.utils.infoth

Various information theory computations, calculations and results as applied to bio sequences and alignments thereof. Includes entropy, joint entropy, conditional entropy, mutual information, conditional mi, et. al.

Various information theory computations, calculations and results
as applied to bio sequences and alignments thereof.  Includes
entropy, joint entropy, conditional entropy, mutual information,
conditional mi, et. al.
raw docstring

aln-conditional-mutual-informationclj

(aln-conditional-mutual-information
  seqset
  &
  {par :par
   nogaps :nogaps
   pgap :pgap
   cols :cols
   norm :norm
   sym? :sym?
   :or {par 1 nogaps true pgap 0.25 cols true norm true sym? true}})

Mutual information of all 2 column pairs in an alignment conditioned by the residual - unordered - bases of the remaining columns. Let colpairs be (combins 2 (transpose aln)). For any pair of columns [X Y] in colpairs, let Z be colpairs - {X Y}. Compute I(X;Y|Z), the mutual information for X&Y given Z.

Mutual information of all 2 column pairs in an alignment
conditioned by the residual - unordered - bases of the remaining
columns.  Let colpairs be (combins 2 (transpose aln)).  For any
pair of columns [X Y] in colpairs, let Z be colpairs - {X Y}.
Compute I(X;Y|Z), the mutual information for X&Y given Z.
raw docstring

aln-entropyclj

(aln-entropy n seqset & args)

Compute the entropy of each column Ci of an alignment given in SEQSET, a gaisr-seq-set. Entropy is based on the freqs and probs of elements of Ci taken n at a time. ARGS are any keyword arguments taken by aln-freqs-probs.

The manner of taking the elements is determined by the fsps-fn argument of aln-freqs-probs. The default for this, cc-combins-freqs-probs is based on the combins function, which generates all n-element subsets of Ci. cc-freqs-probs is based on freqn which generates the sliding window of n elements from Ci.

Returns [cols-entropies total-entropy tcnt], where

cols-entropies is a seq of entropies for each Ci total-entropy is the total entropy over all the columns (from total probs) tcnt is the total over all columns of elements counted

Compute the entropy of each column Ci of an alignment given in SEQSET,
a gaisr-seq-set.  Entropy is based on the freqs and probs of
elements of Ci taken n at a time.  ARGS are any keyword arguments
taken by aln-freqs-probs.

The manner of taking the elements is determined by the fsps-fn
argument of aln-freqs-probs.  The default for this,
cc-combins-freqs-probs is based on the combins function, which
generates all n-element subsets of Ci.  cc-freqs-probs is based on
freqn which generates the sliding window of n elements from Ci.

Returns [cols-entropies total-entropy tcnt], where

cols-entropies is a seq of entropies for each Ci
total-entropy is the total entropy over all the columns (from total probs)
tcnt is the total over all columns of elements counted
raw docstring

aln-freqs-probsclj

(aln-freqs-probs n
                 seqset
                 &
                 {:keys [fsps-fn cols nogaps norm pgap par]
                  :or {fsps-fn cc-combins-freqs-probs
                       cols false
                       nogaps true
                       norm true
                       pgap 0.7
                       par 1}})

Take the sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs), treat as a matrix M encoding an alignment (and so all seqs in set must have equal length). If NORM, normalize all base characters to uppercase in M. PGAP is the percent of gaps cutoff for columns. Filter M, by removing all columns Ci, (> (gap-percent Ci) pgap) to get M'.

Uses FSPS-FN, a function taking a combination count or window width of N and a sequence collection, to compute the frequencies and probabilities of columns Ci in M' by means of seqs-freqs-probs. Let Ci-fs-ps be the results for Ci. Typically such a result would be a triple [fs ps cnt], where fs and ps are maps of frequencies and corresponding probabilities keyed by the n-tuple of bases (and possibly gaps) underlying fsps-fn (see, for example, cc-freqs-probs).

If NOGAPS is true, remove all items with gaps from maps and recompute new probabilities for resulting reduced frequencies sets.

For large (count seqset) with expensive fss-fn, use par to parallelize computation over par chunks.

Returns [ccfs&ps allfs allps tcount], where

ccfs&ps is the seq Ci-fs-ps calculated from M' allfs is the map of freqs obtained by reducing over all Ci-fs maps allps is the map of probs obtained by reducing over all Ci-ps maps tcount is the total item (obtained n-tuples) count.

Take the sequences in SEQSET, a collection of sequences or a string
denoting a legal format sequence file (see read-seqs), treat as a
matrix M encoding an alignment (and so all seqs in set must have
equal length).  If NORM, normalize all base characters to uppercase
in M.  PGAP is the percent of gaps cutoff for columns.  Filter M,
by removing all columns Ci, (> (gap-percent Ci) pgap) to get M'.

Uses FSPS-FN, a function taking a combination count or window width
of N and a sequence collection, to compute the frequencies and
probabilities of columns Ci in M' by means of seqs-freqs-probs.
Let Ci-fs-ps be the results for Ci.  Typically such a result would
be a triple [fs ps cnt], where fs and ps are maps of frequencies
and corresponding probabilities keyed by the n-tuple of bases (and
possibly gaps) underlying fsps-fn (see, for example,
cc-freqs-probs).

If NOGAPS is true, remove all items with gaps from maps and
recompute new probabilities for resulting reduced frequencies sets.

For large (count seqset) with expensive fss-fn, use par to
parallelize computation over par chunks.

Returns [ccfs&ps allfs allps tcount], where

ccfs&ps is the seq Ci-fs-ps calculated from M'
allfs is the map of freqs obtained by reducing over all Ci-fs maps
allps is the map of probs obtained by reducing over all Ci-ps maps
tcount is the total item (obtained n-tuples) count.
raw docstring

aln-joint-entropyclj

(aln-joint-entropy seqset & args)

Application of aln-entropy with cc-combins-freqs-probs and n=2. So, joint entropy of each column with itself and overall totals.

Application of aln-entropy with cc-combins-freqs-probs and n=2.
So, joint entropy of each column with itself and overall totals.
raw docstring

aln-mutual-informationclj

(aln-mutual-information
  seqset
  &
  {:keys [par nogaps pgap cols norm sym?]
   :or {par 1 nogaps true pgap 0.25 cols false norm true sym? true}})
raw docstring

aln-shannon-entropyclj

(aln-shannon-entropy seqset & args)

Application of aln-entropy with cc-freqs-probs and n=1. So, shannon entropy of each column and totals over all columns.

Application of aln-entropy with cc-freqs-probs and n=1.  So,
shannon entropy of each column and totals over all columns.
raw docstring

bg-freqsclj

(bg-freqs n
          filespecs
          &
          {:keys [fsps-fn ftypes dirdir cols sym? nogaps norm par]
           :or {fsps-fn cc-freqs-probs
                ftypes [".sto"]
                dirdir false
                cols false
                sym? false
                mnogaps false
                norm true
                par 1}})

Perform a bacground frequency distribution calculation over the sequences in FILESPECS (a coll of legal format sequence files or a directory of such or if dirdir is true, a directory of directories of such, see read-seqs). FTYPES gives the file types in the cases where filespecs is a dir or dirdir.

By default, the distributions are performed with a sliding window of N length. So, on DNA/RNA sequences 1 gives base probabilities, 2 gives a dinucleotide distribution, etc. To change this supply a different freq&prob calculation function for fsps-fn. For more information see seqs-freqs-probs description.

If cols is true, computation is over the columns of the sequences. if sym? is true, treat reversable keys as equal. If nogaps is true, removes default gap characters from calculation. If nogaps is a coll (for example, (keys +NONSTD-RNA+)), removes all those characters from calculation. If norm is true, normalize characters to upper case.

Perform a bacground frequency distribution calculation over the
sequences in FILESPECS (a coll of legal format sequence files or a
directory of such or if dirdir is true, a directory of directories
of such, see read-seqs).  FTYPES gives the file types in the cases
where filespecs is a dir or dirdir.

By default, the distributions are performed with a sliding window
of N length.  So, on DNA/RNA sequences 1 gives base probabilities,
2 gives a dinucleotide distribution, etc.  To change this supply a
different freq&prob calculation function for fsps-fn.  For more
information see seqs-freqs-probs description.

If cols is true, computation is over the columns of the sequences.
if sym? is true, treat reversable keys as equal.
If nogaps is true, removes default gap characters from calculation.
If nogaps is a coll (for example, (keys +NONSTD-RNA+)), removes all
those characters from calculation.  If norm is true, normalize
characters to upper case.
raw docstring

bg-freqs-probsclj

(bg-freqs-probs n
                filespecs
                &
                {:keys [fsps-fn ftypes dirdir cols sym? nogaps norm par]
                 :or {fsps-fn cc-freqs-probs
                      ftypes [".sto"]
                      dirdir false
                      cols false
                      sym? false
                      nogaps false
                      norm true
                      par 1}})

Like bg-freqs, but with the additional final computation of probability distribution for the frequency distribution

Like bg-freqs, but with the additional final computation of
probability distribution for the frequency distribution
raw docstring

bp-statsclj

(bp-stats bp-freq-map)

degap-freqsclj

(degap-freqs freq-map)
(degap-freqs freq-map gap-chars)

Take a frequency map, and remove all elements whose keys contain gap characters. Gap chars are either the defaults . - or those in gap-chars (a seqable collection).

Take a frequency map, and remove all elements whose keys contain
gap characters.  Gap chars are either the defaults \. \- or those
in gap-chars (a seqable collection).
raw docstring

degap-tuplesclj

(degap-tuples tuple-of-sqs)
(degap-tuples tuple-of-sqs gap-chars)

Remove gap characters from a tuple of sequences. Typically this is a pair of sequences (as for example arising from (combins 2 some-seq-set)). The degaping works for gaps in any (through all) of the elements and preserves the correct bases and their order in cases where gaps line up with non gaps. Gap chars are either the defaults . - or those in gap-chars (a seqable collection).

EX:

(degap-tuples ["CAAAUAAAAUAUAAUUUUAUAAUAAUAAGAAUAUAUAAAAAAUAUUAUAUAAAAGAAA" "GGGAGGGGGGGGGGG-GGGGG-GGAGGGGGGG--GGGG-GGGGGAGG-GGGG-GGGG-"]) => ("CAAAUAAAAUAUAAUUUAUAUAAUAAGAAUAUAAAAAUAUUAAUAAAGAA" "GGGAGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGAGGGGGGGGGG")

Remove gap characters from a tuple of sequences.  Typically this is
 a pair of sequences (as for example arising from (combins 2
 some-seq-set)).  The degaping works for gaps in any (through all)
 of the elements and preserves the correct bases and their order in
 cases where gaps line up with non gaps.  Gap chars are either the
 defaults \. \- or those in gap-chars (a seqable collection).

 EX:

 (degap-tuples
   ["CAAAUAAAAUAUAAUUUUAUAAUAAUAAGAAUAUAUAAAAAAUAUUAUAUAAAAGAAA"
    "GGGAGGGGGGGGGGG-GGGGG-GGAGGGGGGG--GGGG-GGGGGAGG-GGGG-GGGG-"])
=> ("CAAAUAAAAUAUAAUUUAUAUAAUAAGAAUAUAAAAAUAUUAAUAAAGAA"
    "GGGAGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGAGGGGGGGGGG")
raw docstring

seq-pairs-bpfreqsclj

(seq-pairs-bpfreqs seq-pairs
                   &
                   {:keys [nogaps sym? par] :or {nogaps true sym? true par 4}})

seq-viclj

(seq-vi seqx seqy & {nogaps :nogaps :or {nogaps true}})

seqs-freqs-probsclj

(seqs-freqs-probs n
                  seqset
                  &
                  {:keys [fsps-fn nogaps norm par]
                   :or {fsps-fn cc-freqs-probs nogaps true norm true par 1}})

Return sequence frequencies and probabilities over the set of sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs). FSPS-FN is a function taking a combination count or window width of N and a sequence collection (here, seqset) and optional par parameter. Applies fsps-fn to n and seqset.

For large (count seqset) with expensive fss-fn, use par to parallelize computation over par chunks.

Return [ccfsps allfs allps tcount], where

ccfs&ps is a seq of triples [fs ps cnt], for each C in seqset, allfs is the map of freqs over all Cs, allps is the map of probs over all Cs and tcount is the total items over coll.

NOTE: item keys are "stringified", i.e., if k is a key from a map produced by fsps-fn, ensures that all returned maps use (apply str k) for all keys.

Return sequence frequencies and probabilities over the set of
sequences in SEQSET, a collection of sequences or a string denoting
a legal format sequence file (see read-seqs).  FSPS-FN is a
function taking a combination count or window width of N and a
sequence collection (here, seqset) and optional par parameter.
Applies fsps-fn to n and seqset.

For large (count seqset) with expensive fss-fn, use par to
parallelize computation over par chunks.

Return [ccfsps allfs allps tcount], where

ccfs&ps is a seq of triples [fs ps cnt], for each C in seqset,
allfs is the map of freqs over all Cs,
allps is the map of probs over all Cs and
tcount is the total items over coll.

NOTE: item keys are "stringified", i.e., if k is a key from a map
produced by fsps-fn, ensures that all returned maps use (apply str
k) for all keys.
raw docstring

seqs-shannon-entropyclj

(seqs-shannon-entropy seqset)

Returns the Shannon entropy of the set of sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs). Returns a pair [ses total-ses], where

ses is (map shannon-entropy seqset) and total-ses is the total over all of seqset.

Returns the Shannon entropy of the set of sequences in SEQSET, a
collection of sequences or a string denoting a legal format
sequence file (see read-seqs).  Returns a pair [ses total-ses],
where

ses is (map shannon-entropy seqset) and total-ses is the total over
all of seqset.
raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close