Various information theory computations, calculations and results as applied to bio sequences and alignments thereof. Includes entropy, joint entropy, conditional entropy, mutual information, conditional mi, et. al.
Various information theory computations, calculations and results as applied to bio sequences and alignments thereof. Includes entropy, joint entropy, conditional entropy, mutual information, conditional mi, et. al.
(aln-conditional-mutual-information
seqset
&
{par :par
nogaps :nogaps
pgap :pgap
cols :cols
norm :norm
sym? :sym?
:or {par 1 nogaps true pgap 0.25 cols true norm true sym? true}})
Mutual information of all 2 column pairs in an alignment conditioned by the residual - unordered - bases of the remaining columns. Let colpairs be (combins 2 (transpose aln)). For any pair of columns [X Y] in colpairs, let Z be colpairs - {X Y}. Compute I(X;Y|Z), the mutual information for X&Y given Z.
Mutual information of all 2 column pairs in an alignment conditioned by the residual - unordered - bases of the remaining columns. Let colpairs be (combins 2 (transpose aln)). For any pair of columns [X Y] in colpairs, let Z be colpairs - {X Y}. Compute I(X;Y|Z), the mutual information for X&Y given Z.
(aln-entropy n seqset & args)
Compute the entropy of each column Ci of an alignment given in SEQSET, a gaisr-seq-set. Entropy is based on the freqs and probs of elements of Ci taken n at a time. ARGS are any keyword arguments taken by aln-freqs-probs.
The manner of taking the elements is determined by the fsps-fn argument of aln-freqs-probs. The default for this, cc-combins-freqs-probs is based on the combins function, which generates all n-element subsets of Ci. cc-freqs-probs is based on freqn which generates the sliding window of n elements from Ci.
Returns [cols-entropies total-entropy tcnt], where
cols-entropies is a seq of entropies for each Ci total-entropy is the total entropy over all the columns (from total probs) tcnt is the total over all columns of elements counted
Compute the entropy of each column Ci of an alignment given in SEQSET, a gaisr-seq-set. Entropy is based on the freqs and probs of elements of Ci taken n at a time. ARGS are any keyword arguments taken by aln-freqs-probs. The manner of taking the elements is determined by the fsps-fn argument of aln-freqs-probs. The default for this, cc-combins-freqs-probs is based on the combins function, which generates all n-element subsets of Ci. cc-freqs-probs is based on freqn which generates the sliding window of n elements from Ci. Returns [cols-entropies total-entropy tcnt], where cols-entropies is a seq of entropies for each Ci total-entropy is the total entropy over all the columns (from total probs) tcnt is the total over all columns of elements counted
(aln-freqs-probs n
seqset
&
{:keys [fsps-fn cols nogaps norm pgap par]
:or {fsps-fn cc-combins-freqs-probs
cols false
nogaps true
norm true
pgap 0.7
par 1}})
Take the sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs), treat as a matrix M encoding an alignment (and so all seqs in set must have equal length). If NORM, normalize all base characters to uppercase in M. PGAP is the percent of gaps cutoff for columns. Filter M, by removing all columns Ci, (> (gap-percent Ci) pgap) to get M'.
Uses FSPS-FN, a function taking a combination count or window width of N and a sequence collection, to compute the frequencies and probabilities of columns Ci in M' by means of seqs-freqs-probs. Let Ci-fs-ps be the results for Ci. Typically such a result would be a triple [fs ps cnt], where fs and ps are maps of frequencies and corresponding probabilities keyed by the n-tuple of bases (and possibly gaps) underlying fsps-fn (see, for example, cc-freqs-probs).
If NOGAPS is true, remove all items with gaps from maps and recompute new probabilities for resulting reduced frequencies sets.
For large (count seqset) with expensive fss-fn, use par to parallelize computation over par chunks.
Returns [ccfs&ps allfs allps tcount], where
ccfs&ps is the seq Ci-fs-ps calculated from M' allfs is the map of freqs obtained by reducing over all Ci-fs maps allps is the map of probs obtained by reducing over all Ci-ps maps tcount is the total item (obtained n-tuples) count.
Take the sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs), treat as a matrix M encoding an alignment (and so all seqs in set must have equal length). If NORM, normalize all base characters to uppercase in M. PGAP is the percent of gaps cutoff for columns. Filter M, by removing all columns Ci, (> (gap-percent Ci) pgap) to get M'. Uses FSPS-FN, a function taking a combination count or window width of N and a sequence collection, to compute the frequencies and probabilities of columns Ci in M' by means of seqs-freqs-probs. Let Ci-fs-ps be the results for Ci. Typically such a result would be a triple [fs ps cnt], where fs and ps are maps of frequencies and corresponding probabilities keyed by the n-tuple of bases (and possibly gaps) underlying fsps-fn (see, for example, cc-freqs-probs). If NOGAPS is true, remove all items with gaps from maps and recompute new probabilities for resulting reduced frequencies sets. For large (count seqset) with expensive fss-fn, use par to parallelize computation over par chunks. Returns [ccfs&ps allfs allps tcount], where ccfs&ps is the seq Ci-fs-ps calculated from M' allfs is the map of freqs obtained by reducing over all Ci-fs maps allps is the map of probs obtained by reducing over all Ci-ps maps tcount is the total item (obtained n-tuples) count.
(aln-joint-entropy seqset & args)
Application of aln-entropy with cc-combins-freqs-probs and n=2. So, joint entropy of each column with itself and overall totals.
Application of aln-entropy with cc-combins-freqs-probs and n=2. So, joint entropy of each column with itself and overall totals.
(aln-mutual-information
seqset
&
{:keys [par nogaps pgap cols norm sym?]
:or {par 1 nogaps true pgap 0.25 cols false norm true sym? true}})
(aln-shannon-entropy seqset & args)
Application of aln-entropy with cc-freqs-probs and n=1. So, shannon entropy of each column and totals over all columns.
Application of aln-entropy with cc-freqs-probs and n=1. So, shannon entropy of each column and totals over all columns.
(bg-freqs n
filespecs
&
{:keys [fsps-fn ftypes dirdir cols sym? nogaps norm par]
:or {fsps-fn cc-freqs-probs
ftypes [".sto"]
dirdir false
cols false
sym? false
mnogaps false
norm true
par 1}})
Perform a bacground frequency distribution calculation over the sequences in FILESPECS (a coll of legal format sequence files or a directory of such or if dirdir is true, a directory of directories of such, see read-seqs). FTYPES gives the file types in the cases where filespecs is a dir or dirdir.
By default, the distributions are performed with a sliding window of N length. So, on DNA/RNA sequences 1 gives base probabilities, 2 gives a dinucleotide distribution, etc. To change this supply a different freq&prob calculation function for fsps-fn. For more information see seqs-freqs-probs description.
If cols is true, computation is over the columns of the sequences. if sym? is true, treat reversable keys as equal. If nogaps is true, removes default gap characters from calculation. If nogaps is a coll (for example, (keys +NONSTD-RNA+)), removes all those characters from calculation. If norm is true, normalize characters to upper case.
Perform a bacground frequency distribution calculation over the sequences in FILESPECS (a coll of legal format sequence files or a directory of such or if dirdir is true, a directory of directories of such, see read-seqs). FTYPES gives the file types in the cases where filespecs is a dir or dirdir. By default, the distributions are performed with a sliding window of N length. So, on DNA/RNA sequences 1 gives base probabilities, 2 gives a dinucleotide distribution, etc. To change this supply a different freq&prob calculation function for fsps-fn. For more information see seqs-freqs-probs description. If cols is true, computation is over the columns of the sequences. if sym? is true, treat reversable keys as equal. If nogaps is true, removes default gap characters from calculation. If nogaps is a coll (for example, (keys +NONSTD-RNA+)), removes all those characters from calculation. If norm is true, normalize characters to upper case.
(bg-freqs-probs n
filespecs
&
{:keys [fsps-fn ftypes dirdir cols sym? nogaps norm par]
:or {fsps-fn cc-freqs-probs
ftypes [".sto"]
dirdir false
cols false
sym? false
nogaps false
norm true
par 1}})
Like bg-freqs, but with the additional final computation of probability distribution for the frequency distribution
Like bg-freqs, but with the additional final computation of probability distribution for the frequency distribution
(bp-stats bp-freq-map)
(degap-freqs freq-map)
(degap-freqs freq-map gap-chars)
Take a frequency map, and remove all elements whose keys contain gap characters. Gap chars are either the defaults . - or those in gap-chars (a seqable collection).
Take a frequency map, and remove all elements whose keys contain gap characters. Gap chars are either the defaults \. \- or those in gap-chars (a seqable collection).
(degap-tuples tuple-of-sqs)
(degap-tuples tuple-of-sqs gap-chars)
Remove gap characters from a tuple of sequences. Typically this is a pair of sequences (as for example arising from (combins 2 some-seq-set)). The degaping works for gaps in any (through all) of the elements and preserves the correct bases and their order in cases where gaps line up with non gaps. Gap chars are either the defaults . - or those in gap-chars (a seqable collection).
EX:
(degap-tuples ["CAAAUAAAAUAUAAUUUUAUAAUAAUAAGAAUAUAUAAAAAAUAUUAUAUAAAAGAAA" "GGGAGGGGGGGGGGG-GGGGG-GGAGGGGGGG--GGGG-GGGGGAGG-GGGG-GGGG-"]) => ("CAAAUAAAAUAUAAUUUAUAUAAUAAGAAUAUAAAAAUAUUAAUAAAGAA" "GGGAGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGAGGGGGGGGGG")
Remove gap characters from a tuple of sequences. Typically this is a pair of sequences (as for example arising from (combins 2 some-seq-set)). The degaping works for gaps in any (through all) of the elements and preserves the correct bases and their order in cases where gaps line up with non gaps. Gap chars are either the defaults \. \- or those in gap-chars (a seqable collection). EX: (degap-tuples ["CAAAUAAAAUAUAAUUUUAUAAUAAUAAGAAUAUAUAAAAAAUAUUAUAUAAAAGAAA" "GGGAGGGGGGGGGGG-GGGGG-GGAGGGGGGG--GGGG-GGGGGAGG-GGGG-GGGG-"]) => ("CAAAUAAAAUAUAAUUUAUAUAAUAAGAAUAUAAAAAUAUUAAUAAAGAA" "GGGAGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGAGGGGGGGGGG")
(seq-pairs-bpfreqs seq-pairs
&
{:keys [nogaps sym? par] :or {nogaps true sym? true par 4}})
(seq-vi seqx seqy & {nogaps :nogaps :or {nogaps true}})
(seqs-freqs-probs n
seqset
&
{:keys [fsps-fn nogaps norm par]
:or {fsps-fn cc-freqs-probs nogaps true norm true par 1}})
Return sequence frequencies and probabilities over the set of sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs). FSPS-FN is a function taking a combination count or window width of N and a sequence collection (here, seqset) and optional par parameter. Applies fsps-fn to n and seqset.
For large (count seqset) with expensive fss-fn, use par to parallelize computation over par chunks.
Return [ccfsps allfs allps tcount], where
ccfs&ps is a seq of triples [fs ps cnt], for each C in seqset, allfs is the map of freqs over all Cs, allps is the map of probs over all Cs and tcount is the total items over coll.
NOTE: item keys are "stringified", i.e., if k is a key from a map produced by fsps-fn, ensures that all returned maps use (apply str k) for all keys.
Return sequence frequencies and probabilities over the set of sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs). FSPS-FN is a function taking a combination count or window width of N and a sequence collection (here, seqset) and optional par parameter. Applies fsps-fn to n and seqset. For large (count seqset) with expensive fss-fn, use par to parallelize computation over par chunks. Return [ccfsps allfs allps tcount], where ccfs&ps is a seq of triples [fs ps cnt], for each C in seqset, allfs is the map of freqs over all Cs, allps is the map of probs over all Cs and tcount is the total items over coll. NOTE: item keys are "stringified", i.e., if k is a key from a map produced by fsps-fn, ensures that all returned maps use (apply str k) for all keys.
(seqs-shannon-entropy seqset)
Returns the Shannon entropy of the set of sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs). Returns a pair [ses total-ses], where
ses is (map shannon-entropy seqset) and total-ses is the total over all of seqset.
Returns the Shannon entropy of the set of sequences in SEQSET, a collection of sequences or a string denoting a legal format sequence file (see read-seqs). Returns a pair [ses total-ses], where ses is (map shannon-entropy seqset) and total-ses is the total over all of seqset.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close