Various bio sequence file format readers, writers, verifiers, and manipulators.
Various bio sequence file format readers, writers, verifiers, and manipulators.
(canonical-csv-entry-info entries & {:keys [ev] :or {ev 0.0}})
(check-sto sto & {printem :printem :or {printem true}})
Checks a sto file to ensure that there are valid characters being used in the sequences consensus structure line. Will print out errors in the sto file by sequence number. Input requires a sto file
Checks a sto file to ensure that there are valid characters being used in the sequences consensus structure line. Will print out errors in the sto file by sequence number. Input requires a sto file
(chunk-genome-fnas genome-fna-dir & {:keys [chunk-size] :or {chunk-size 100}})
Take the set of fnas in directory GENOME-FNA-DIR (presumably created by split-join-ncbi-fasta-file or similar) and aggregate them into a new smaller set of files, where each new file contains the contents of CHUNK-SIZE input files (with the possible exception of the last file having a smaller number). This is useful for creating custom data sets for search.
Take the set of fnas in directory GENOME-FNA-DIR (presumably created by split-join-ncbi-fasta-file or similar) and aggregate them into a new smaller set of files, where each new file contains the contents of CHUNK-SIZE input files (with the possible exception of the last file having a smaller number). This is useful for creating custom data sets for search.
(collapse-group pairs)
(collapse-one fqa fasta)
Collapse the sequences in fqa, a fastq or fasta file and write the collapsed value as a fasta record to fasta, a file spec for fasta output.
Collapse the sequences in fqa, a fastq or fasta file and write the collapsed value as a fasta record to fasta, a file spec for fasta output.
(entry-file->fasta-file efile & {:keys [names-only]})
(entry-parts entry & {:keys [ldelta rdelta] :or {ldelta 0 rdelta 0}})
ENTRY is a string "name/range/strand", where name is a genome name, range is of the form start-end and strand is 1 or -1. At least name must be supplied. DELTA is an integer, which will be subtracted from start and added to end.
Returns a triple [name [start end] strand]
ENTRY is a string "name/range/strand", where name is a genome name, range is of the form start-end and strand is 1 or -1. At least name must be supplied. DELTA is an integer, which will be subtracted from start and added to end. Returns a triple [name [start end] strand]
(fastq->fna fq faot)
Convert a fastq format file to a fasta file. Fastq files have 4 lines per 'record' (id, sq, qcid, qc/phred-scores), while the corresponding fasta has only the id and sq lines per 'record'. Both FQ and FAOT are file specs.
Convert a fastq format file to a fasta file. Fastq files have 4 lines per 'record' (id, sq, qcid, qc/phred-scores), while the corresponding fasta has only the id and sq lines per 'record'. Both FQ and FAOT are file specs.
(fastqs->fnas fqs & {:keys [outdir]})
Convert the fastqs in FQS (a seq) to corresponding fasta files. The fasta files are named as the fastqs but with file type .fna. If outdir is given, place the fastas there, otherwise place in same directory as corresponding fastqs. The items in fqs are file specs!
Convert the fastqs in FQS (a seq) to corresponding fasta files. The fasta files are named as the fastqs but with file type .fna. If outdir is given, place the fastas there, otherwise place in same directory as corresponding fastqs. The items in fqs are file specs!
(fna->fastq infna outfq bc%)
Convert a fasta format file into a fastq format file. infna and outfq are fie specifications (strins) of the input fasta and the output fastq.
The issue here is simply what should the quality control (Phred scores) be? We assume the user understands the issue and provide a 'slight' helper for this: bq% is the probability the bases (all!!) are correct. We don't accommodate the case of providing bqs for each base, as that would indicate you already have a fastq version of the sq in question (or certainly should have such). The bc% is converted to the corresponding Phred score and then Sanger encoded (0 -> 33, and all such scores written as their corresponding ASCII character.)
NOTE: output is unblocked even if input fasta is in blocked format!
Convert a fasta format file into a fastq format file. infna and outfq are fie specifications (strins) of the input fasta and the output fastq. The issue here is simply what should the quality control (Phred scores) be? We assume the user understands the issue and provide a 'slight' helper for this: bq% is the probability the bases (all!!) are correct. We don't accommodate the case of providing bqs for each base, as that would indicate you already have a fastq version of the sq in question (or certainly _should_ have such). The bc% is converted to the corresponding Phred score and then Sanger encoded (0 -> 33, and all such scores written as their corresponding ASCII character.) NOTE: output is unblocked even if input fasta is in blocked format!
(fnas->fastqs fastas fastqdir bc%)
Convert fastas (collection of fasta filespecs as strings) to corresponding fastqs in directory fastqdir. Conversion is by fna->fastq.
Convert fastas (collection of fasta filespecs as strings) to corresponding fastqs in directory fastqdir. Conversion is by fna->fastq.
(gaisr-seq-set? x)
Returns true if either
X is a filespec string with type extension fna, aln, sto, or gma X is a java.io.File (presumed to be of one of the above formats) X is a collection (presumed to have seqences as elements)
Returns true if either X is a filespec string with type extension fna, aln, sto, or gma X is a java.io.File (presumed to be of one of the above formats) X is a collection (presumed to have seqences as elements)
(gbank->fna gbfile fafile)
Obtains the genomic sequence (the 'ORIGIN' record) of a genbank format file gbfile and writes it as the sequence (unblocked) of a fasta file located in fafile. The id line of the fasta consists of the 'LOCUS' fields of the gbfile.
Obtains the genomic sequence (the 'ORIGIN' record) of a genbank format file gbfile and writes it as the sequence (unblocked) of a fasta file located in fafile. The id line of the fasta consists of the 'LOCUS' fields of the gbfile.
(gbank-loci-sane-loci gbloci)
GBFF loci are often, if not incoherent, certainly vague and not very useful. This function takes such and turns them into nice clean loci with simple start, end and strand. BUG: this will lose some information in certain cases, but should not with CDS and genes.
GBFF loci are often, if not incoherent, certainly vague and not very useful. This function takes such and turns them into nice clean loci with simple start, end and strand. BUG: this will lose some information in certain cases, but should not with CDS and genes.
(gbanks->fnas gbfiles outdir)
Calls gbank->fna on all genbank format files in gbfiles, a collection of fully qualified file specifications of genbank files. The corresponding fasta file specification is composed of the basename of a genbank file, with '.fna' file type and placed in outdir. Outdir is a string which is an output directory specification, which must exist and be a directory.
Calls gbank->fna on all genbank format files in gbfiles, a collection of fully qualified file specifications of genbank files. The corresponding fasta file specification is composed of the basename of a genbank file, with '.fna' file type and placed in outdir. Outdir is a string which is an output directory specification, which must exist and be a directory.
(gen-entry-file entries file)
Take entries (a collection of nm/s-e/std - see make-entry / entry-parts) and write them out to file. Returns file
Take entries (a collection of nm/s-e/std - see make-entry / entry-parts) and write them out to file. Returns file
(gen-entry-nv-file entries file)
Take entries (a collection of entry [,v]* where v(s) are optional values (JSD, EVals, etc)) and write them as csv file.
Take entries (a collection of entry [,v]* where v(s) are optional values (JSD, EVals, etc)) and write them as csv file.
(gen-name-seq entry
&
{:keys [basedir ldelta rdelta rna]
:or {basedir (pams/default-genome-fasta-dir entry)
ldelta 0
rdelta 0
rna true}})
Generate a pair [entry genome-seq], from ENTRY as possibly modified by [L|R]DELTA and RNA. ENTRY is a string "name/range/strand", where
name is the genome NC name (and we only currently support NCs),
range is of the form start-end, where start and end are integers (1 based...) for the start and end coordinates of name's sequence to return. NOTE: start < end, as reverse compliment information comes from strand.
strand is either -1 for reverse compliment or 1 for standard 5'->3'
LDELTA is an integer, which will be subtracted from start. So, ldelta < 0 removes |ldelta| bases from 5', ldelta > 0 'tacks on' ldelta extra bases to the 5' end. Defaults to 0 (no change).
RDELTA is an integer, which will be added to the end. So, rdelta < 0 removes |rdelta| bases from 3', rdelta > 0 'tacks on' rdelta extra bases to the 3' end. Defaults to 0 (no change).
If RNA is true, change Ts to Us, otherwise return unmodified sequence.
BASEDIR is the location of the NC fasta files. Generally, this should always be the default location.
Generate a pair [entry genome-seq], from ENTRY as possibly modified by [L|R]DELTA and RNA. ENTRY is a string "name/range/strand", where name is the genome NC name (and we only currently support NCs), range is of the form start-end, where start and end are integers (1 based...) for the start and end coordinates of name's sequence to return. NOTE: start < end, as reverse compliment information comes from strand. strand is either -1 for reverse compliment or 1 for standard 5'->3' LDELTA is an integer, which will be _subtracted_ from start. So, ldelta < 0 _removes_ |ldelta| bases from 5', ldelta > 0 'tacks on' ldelta extra bases to the 5' end. Defaults to 0 (no change). RDELTA is an integer, which will be _added_ to the end. So, rdelta < 0 _removes_ |rdelta| bases from 3', rdelta > 0 'tacks on' rdelta extra bases to the 3' end. Defaults to 0 (no change). If RNA is true, change Ts to Us, otherwise return unmodified sequence. BASEDIR is the location of the NC fasta files. Generally, this should always be the default location.
(gen-name-seq-pairs entries
&
{:keys [basedir strand ldelta rdelta rna]
:or {basedir pams/default-genome-fasta-dir
strand 0
ldelta 0
rdelta 0
rna true}})
Generate a sequence of pairs [name genome-seq] from the given collection denoted by entries, which is either a collection or a string denoting an entry filespec (either hand made or via gen-entry-file or similar).
See gen-name-seq for details. Basically this is (map gen-name-seq entries) with some options. In particular, entries may lack a strand component (again see gen-name-seq for format details), and STRAND here would be used to indicate all entries are to have this strand. If entries have a strand component, strand should be 0, otherwise you will force all entries to the strand given. Strand is either -1 (reverse compliment) or 1 for standard 5'->3', or 0 meaning 'use strand component in entries'
BASEDIR is the location of the NC fasta files. Generally, this should always be the default location.
Generate a sequence of pairs [name genome-seq] from the given collection denoted by entries, which is either a collection or a string denoting an entry filespec (either hand made or via gen-entry-file or similar). See gen-name-seq for details. Basically this is (map gen-name-seq entries) with some options. In particular, entries may lack a strand component (again see gen-name-seq for format details), and STRAND here would be used to indicate all entries are to have this strand. If entries have a strand component, strand should be 0, otherwise you will force all entries to the strand given. Strand is either -1 (reverse compliment) or 1 for standard 5'->3', or 0 meaning 'use strand component in entries' BASEDIR is the location of the NC fasta files. Generally, this should always be the default location.
(gen-nc-genome-fnas full-nc-genomes-filespec)
One shot genome sequence fnas generator. Typically used once per data update. Needs to be generalized to be able to use Genbank fna archives. Currently assumes one fna AND ONLY NC_* genomes.
One shot genome sequence fnas generator. Typically used once per data update. Needs to be generalized to be able to use Genbank fna archives. Currently assumes one fna AND ONLY NC_* genomes.
(genbank-recs gbfile
&
{:keys [feats attrs]
:or {feats ["gene" "CDS" "misc_feature" "rRNA" "tRNA" "tmRNA"
"ncRNA"]
attrs ["gene" "locus_tag" "old_locus_tag" "protein_id"
"condon_start" "gene_synonym" "db_xref"]}})
Takes a genbank file, gbfile - a string file specification, and a set of features (feats) and associated attributes (attrs), so called qualifiers and values, and returns a vector where the first element describes the LOCUS of the gbfile and each subsequent element is a triple [feature loci attr-map], where feature (a string) is one of feats, loci is a vector [start end strand], (all strings, strand = 1|-1), and attr-map is a map with keys in attrs and their associated values from the genbank record.
Takes a genbank file, gbfile - a string file specification, and a set of features (feats) and associated attributes (attrs), so called qualifiers and values, and returns a vector where the first element describes the LOCUS of the gbfile and each subsequent element is a triple [feature loci attr-map], where feature (a string) is one of feats, loci is a vector [start end strand], (all strings, strand = 1|-1), and attr-map is a map with keys in attrs and their associated values from the genbank record.
(genbk2gtf gbfile gtfout)
(genbk2gtf gbfile gtfout options)
Extract a GTF (gene transfer format) file from the record information (features and their attributes) given in the genbank format file gbfile. The GTF file also includes the p_id attribute on CDS records, as required by the cuff* suite of software (this is no longer all that meaningful as the cuff* suite of tools is basically deprecated as 'not good enough').
options is a map of options to use. Currently supports keys :id-order and :feats.
If :id-order is given, the value should be a vector of "locus_tag", "old_locus_tag", "gene", and "protein_id", the order given will determine which of these is used for gene_id and transcript_id. The default value is ["locus_tag", "old_locus_tag", "gene", and "protein_id"]
If :feats is given, the value should be a vector of feature type names taken from this list:
"gene", "CDS", "misc_feature", "rRNA", "tRNA", "tmRNA", "ncRNA"
This will determine what features to include and, importantly, the order will determine the feature type to encode them as. Features in genbank files are often encoded under multiple types, for example, rRNAs are also listed as genes. If you gave :feats ["CDS" "rRNA" "gene"] the rRNA records will have a feature type of rRNA instead of gene as rRNA occurs before gene in the list.
Extract a GTF (gene transfer format) file from the record information (features and their attributes) given in the genbank format file gbfile. The GTF file also includes the p_id attribute on CDS records, as required by the cuff* suite of software (this is no longer all that meaningful as the cuff* suite of tools is basically deprecated as 'not good enough'). options is a map of options to use. Currently supports keys :id-order and :feats. If :id-order is given, the value should be a vector of "locus_tag", "old_locus_tag", "gene", and "protein_id", the order given will determine which of these is used for gene_id and transcript_id. The default value is ["locus_tag", "old_locus_tag", "gene", and "protein_id"] If :feats is given, the value should be a vector of feature type names taken from this list: "gene", "CDS", "misc_feature", "rRNA", "tRNA", "tmRNA", "ncRNA" This will determine what features to include and, importantly, the order will determine the feature type to encode them as. Features in genbank files are often encoded under multiple types, for example, rRNAs are also listed as genes. If you gave :feats ["CDS" "rRNA" "gene"] the rRNA records will have a feature type of rRNA instead of gene as rRNA occurs before gene in the list.
(get-csv-entry-info csv-hit-file)
(get-ent-as-csv-info ent-file)
(get-entries filespec & [seqs])
(get-full-comma-sep-stg stg rdr)
(get-gaisr-csv-info rows)
(get-legacy-csv-info rows)
(get-selection-fna selections)
Get a fasta file for SELECTIONS. If selections is a file, assumes it is in fact the fasta file. If selections is a collection, assumes the collection is a set of pairs [nm sq], and converts to corresponding fasta file. Returns full path of result file.
Get a fasta file for SELECTIONS. If selections is a file, assumes it is in fact the fasta file. If selections is a collection, assumes the collection is a set of pairs [nm sq], and converts to corresponding fasta file. Returns full path of result file.
(get-sto-as-csv-info stofile & {:keys [ev] :or {ev 0.0}})
(has-loc? entries)
(join-sto-fasta-file in-filespec
out-filespec
&
{origin :origin :or {origin ""}})
Joins (de-blocks) unblocked sequence lines in a sto file or fasta file. If in-filespec is a sto file, ORIGIN is a #=GF line indicating tool origin of file. For example, '#=GF AU Infernal 1.0.2'. For stos defaults to nothing, for fastas, not used.
Joins (de-blocks) unblocked sequence lines in a sto file or fasta file. If in-filespec is a sto file, ORIGIN is a #=GF line indicating tool origin of file. For example, '#=GF AU Infernal 1.0.2'. For stos defaults to nothing, for fastas, not used.
(join-sto-fasta-lines infilespec origin)
(make-entry evec)
(make-entry nm s e st)
'Inverse' of entry-parts. EVEC is a vector of shape [nm [s e] strd], where nm is the entry name, S and E are the start and end coordinates in the genome, and strd is the strand marker, 1 or -1. Returns the full entry as: nm/s-e/strd
'Inverse' of entry-parts. EVEC is a vector of shape [nm [s e] strd], where nm is the entry name, S and E are the start and end coordinates in the genome, and strd is the strand marker, 1 or -1. Returns the full entry as: nm/s-e/strd
(map-aln-seqs f cols filespec)
(map-aln-seqs f par cols filespec & filespecs)
(map-seqs f filespec)
(map-seqs f par filespec & filespecs)
(nms-sqs->fasta-file nms-sqs filespec)
(print-sto seq-lines structure)
takes sequence lines and a structure line and writes it into a sto format file. the seq-lines needs to be a collection of [name sequence] pairs. structure is a string. Simply prints out to the repl.
takes sequence lines and a structure line and writes it into a sto format file. the seq-lines needs to be a collection of [name sequence] pairs. structure is a string. Simply prints out to the repl.
(read-aln-seqs filespec & {cols :cols :or {cols false}})
Read the aligned sequences in FILESPEC and return them in a Clojure seq. Filespec can denote either an aln, sto, or gma file format file. If COLS is true, return the columns of the alignment (including gap characters).
Read the _aligned_ sequences in FILESPEC and return them in a Clojure seq. Filespec can denote either an aln, sto, or gma file format file. If COLS is true, return the _columns_ of the alignment (including gap characters).
(read-dirs-aln-seqs dir
&
{dirdir :dirdir
cols :cols
ftypes :ftypes
:or {dirdir false cols false ftypes ["sto"]}})
Apply read-aln-seqs across FTYPES in DIR. FTYPES is a vector of one or more file formats (as type extensions) taken from #{"fna", "aln", "sto", "gma"}, producing a seq of sequence sets, each set being a seq of the sequences in a matching file. NOTE: each such set can be viewed as a 'matrix' of the elements (bases/gaps) of the sequences.
If COLS is true, return the transpose of each obtained sequence 'matrix', i.e., return the seq of column seqs. NOTE: as the formats are alignment formats, all sequences in a file are of the same length (including gaps).
If DIRDIR is true, dir is taken as a directory of directories, each of which will have read-aln-seqs applied to it per the above description and the return will be a seq of all such applications.
So, if dirdir is false, the result will have the form:
(seqs-from-file1 ... seqs-from-filen), where filei is in dir
If dirdir is true, the result will be nested one level more:
((seqs-from-dir1-file1 ... seqs-from-dir1-filek) ... (seqs-from-dirn-file1 ... seqs-from-dirn-filel))
Apply read-aln-seqs across FTYPES in DIR. FTYPES is a vector of one or more file formats (as type extensions) taken from #{"fna", "aln", "sto", "gma"}, producing a seq of sequence sets, each set being a seq of the sequences in a matching file. NOTE: each such set can be viewed as a 'matrix' of the elements (bases/gaps) of the sequences. If COLS is true, return the transpose of each obtained sequence 'matrix', i.e., return the seq of column seqs. NOTE: as the formats are alignment formats, all sequences in a file are of the same length (including gaps). If DIRDIR is true, dir is taken as a directory of directories, each of which will have read-aln-seqs applied to it per the above description and the return will be a seq of all such applications. So, if dirdir is false, the result will have the form: (seqs-from-file1 ... seqs-from-filen), where filei is in dir If dirdir is true, the result will be nested one level more: ((seqs-from-dir1-file1 ... seqs-from-dir1-filek) ... (seqs-from-dirn-file1 ... seqs-from-dirn-filel))
(read-farec in)
Read a fasta 'record' from a file. IN is an input file descriptor (an already opened input-stream reader). Returns a pair [id sq] defining the next fasta record from IN.
Read a fasta 'record' from a file. IN is an input file descriptor (an already opened input-stream reader). Returns a pair [id sq] defining the next fasta record from IN.
(read-farecs in n)
Read n fasta 'records' from a file. IN is an input file descriptor (an already opened input-stream reader), and N is the number of records (2 line chunks) to read. Returns a vector of vector pairs [id sq], each pair representing the id line and sequence line.
Read n fasta 'records' from a file. IN is an input file descriptor (an already opened input-stream reader), and N is the number of records (2 line chunks) to read. Returns a vector of vector pairs [id sq], each pair representing the id line and sequence line.
(read-fqrec in)
Read a fastq 'record' from a file. IN is an input file descriptor (an already opened input-stream reader). Returns a quad [id sq aux qc] defining the next fastq record from IN.
Read a fastq 'record' from a file. IN is an input file descriptor (an already opened input-stream reader). Returns a quad [id sq aux qc] defining the next fastq record from IN.
(read-fqrecs in n)
Read n fastq 'records' from a file. IN is an input file descriptor (an already opened input-stream reader), and N is the number of records (4 line chunks) to read. Returns a vector of vector quads [id sq aux qc], each quad representing the id line, sequence line, auxilliary line and quality control line (phread scores).
Read n fastq 'records' from a file. IN is an input file descriptor (an already opened input-stream reader), and N is the number of records (4 line chunks) to read. Returns a vector of vector quads [id sq aux qc], each quad representing the id line, sequence line, auxilliary line and quality control line (phread scores).
(read-seqs input & {:keys [info ftype] :or {info :data}})
Read the sequences in FILESPEC and return set as a lazy seq. Filespec can denote either a fna, fa, hitfna, aln, sto, or gma file format file.
Read the sequences in FILESPEC and return set as a lazy seq. Filespec can denote either a fna, fa, hitfna, aln, sto, or gma file format file.
(reduce-aln-seqs f fr cols filespecs)
(reduce-aln-seqs f fr v cols filespecs)
(sample-fna p f)
(sample-fna p f sampfa)
Sample the sequences in f, a fasta file, with probability p. Returns a seq of pairs suitable for writing a fasta file: [id, sq]. The id is the corresponding id of the sq in f. In the 3 arg case, sampfna is a filespec for an output fasta file where the sampling is written.
Sample the sequences in f, a fasta file, with probability p. Returns a seq of pairs suitable for writing a fasta file: [id, sq]. The id is the corresponding id of the sq in f. In the 3 arg case, sampfna is a filespec for an output fasta file where the sampling is written.
(sample-fq p f)
(sample-fq p f sampfq)
Sample the sequences in f, a fastq file, with probability p. Returns a seq of quadtuples suitable for writing a fastq file: [id, sq, qcdesc qc]. The id is the corresponding id of the sq in f. qcdesc and qc are the corresponding quality description line and the quality score line. In the 3 arg case, sampfq is a filespec for an output fastq file where the sampling is written.
Sample the sequences in f, a fastq file, with probability p. Returns a seq of quadtuples suitable for writing a fastq file: [id, sq, qcdesc qc]. The id is the corresponding id of the sq in f. qcdesc and qc are the corresponding quality description line and the quality score line. In the 3 arg case, sampfq is a filespec for an output fastq file where the sampling is written.
(seqline-info-mapper type info)
Helper function for READ-SEQS. Returns the function to map over seq lines to obtain the requested info. TYPE is supported seq file type (aln, sto, fna, fa, gma). INFO is either :name for the sequence identifier, :data for the sequence data, or :both for name and data.
Impl Note: while this almost begs for multimethods, that would actually increase the complexity as it would mean 14 methods to cover the cases...
Helper function for READ-SEQS. Returns the function to map over seq lines to obtain the requested info. TYPE is supported seq file type (aln, sto, fna, fa, gma). INFO is either :name for the sequence identifier, :data for the sequence data, or :both for name and data. Impl Note: while this almost begs for multimethods, that would actually increase the complexity as it would mean 14 methods to cover the cases...
(split-join-fasta-file
in-file
&
{:keys [base pat namefn entryfn testfn]
:or {base "" pat #"^>gi" entryfn identity testfn (fn [x y] true)}})
(split-join-ncbi-fasta-file in-file)
Split a fasta file IN-FILE into the individual sequences and unblock the sequence if blocked. The resulting individual [nm sq] pairs are written to files named for the NC name in the gi line of in-file and in the DEFAULT-GENOME-FASTA-DIR location.
The main use of this function is to take a refseq fasta db (composed of many multi seq fasta files) and split the db into a normed set of named sequence files for quick access to sequence per name in various other processing (see gen-name-seq for example).
Canonical use case example:
(fs/dodir "/data2/BioData/Fasta" ; RefSeqxx fasta files #(fs/directory-files % "fna") #(split-join-ncbi-fasta-file %))
Split a fasta file IN-FILE into the individual sequences and unblock the sequence if blocked. The resulting individual [nm sq] pairs are written to files named for the NC name in the gi line of in-file and in the DEFAULT-GENOME-FASTA-DIR location. The main use of this function is to take a refseq fasta db (composed of many multi seq fasta files) and split the db into a normed set of named sequence files for quick access to sequence per name in various other processing (see gen-name-seq for example). Canonical use case example: (fs/dodir "/data2/BioData/Fasta" ; RefSeqxx fasta files #(fs/directory-files % "fna") #(split-join-ncbi-fasta-file %))
(sto->aln stoin alnout & {blocked :blocked :or {blocked false}})
Convert a stockhom format alignment file into its ClustalW equivalent ALN format. STOIN is the filespec for the stockholm format file and ALNOUT is the filespec for the resulting conversion (it is overwritten if it already exists!)
BLOCKED is a boolean indicating whether the output should be blocked (60 chars per chunk). Default is unblocked.
Convert a stockhom format alignment file into its ClustalW equivalent ALN format. STOIN is the filespec for the stockholm format file and ALNOUT is the filespec for the resulting conversion (it is overwritten if it already exists!) BLOCKED is a boolean indicating whether the output should be blocked (60 chars per chunk). Default is unblocked.
(sto->aln-blocked stoin alnout)
Convert a stockhom format alignment file into its ClustalW equivalent BLOCKED ALN format. Blocking is done in 60 character chunks. STOIN is the filespec for the stockholm format file and ALNOUT is the filespec for the resulting conversion (it is overwritten if it already exists!)
Convert a stockhom format alignment file into its ClustalW equivalent BLOCKED ALN format. Blocking is done in 60 character chunks. STOIN is the filespec for the stockholm format file and ALNOUT is the filespec for the resulting conversion (it is overwritten if it already exists!)
(sto->fna stoin fnaout)
Convert a sto file into a fasta file. Split seq lines into names and seq data and interleave these. Seq data has all gap characters removed.
Convert a sto file into a fasta file. Split seq lines into names and seq data and interleave these. Seq data has all gap characters removed.
(sto-GC-and-seq-lines stofilespec)
(write-farec ot rec)
Write a fasta 'record' to a file. OT is an output file descriptor (an already opened output-stream writer). REC is a vector quad [id sq], representing the id line and the sequence line
Write a fasta 'record' to a file. OT is an output file descriptor (an already opened output-stream writer). REC is a vector quad [id sq], representing the id line and the sequence line
(write-farecs ot recs)
Write fasta 'records' to file. OT is an output file descriptor (an already opened output-stream writer). RECS is a vector/sequence of quads [id sq], each representing the id line and the sequence line
Write fasta 'records' to file. OT is an output file descriptor (an already opened output-stream writer). RECS is a vector/sequence of quads [id sq], each representing the id line and the sequence line
(write-fqrec ot rec)
Write a fastq 'record' to a file. OT is an output file descriptor (an already opened output-stream writer). REC is a vector quad [id sq aux qc], representing the id line, the sequence line, the auxilliary information line and the quality control line for a fastq format file.
Write a fastq 'record' to a file. OT is an output file descriptor (an already opened output-stream writer). REC is a vector quad [id sq aux qc], representing the id line, the sequence line, the auxilliary information line and the quality control line for a fastq format file.
(write-fqrecs ot recs)
Write fastq 'records' to file. OT is an output file descriptor (an already opened output-stream writer). RECS is a vector/sequence of quads [id sq aux qc], each representing the id line, the sequence line, the auxilliary information line and the quality control line for a fastq format file.
Write fastq 'records' to file. OT is an output file descriptor (an already opened output-stream writer). RECS is a vector/sequence of quads [id sq aux qc], each representing the id line, the sequence line, the auxilliary information line and the quality control line for a fastq format file.
(write-sto newsto auth-lines comment-lines nm-sq-pairs ss-lines)
A work in progress... Write a new sto composed of the various given parts to the file spec given as NEWSTO. AUTH-LINES are the authoring header lines - including the STOCKHOLM line. Generally there are two of these - the STOCKHOLM line (with version) and the originating author or program that generated the content (for example, Infernal).
COMMENT-LINES is a collection of the #=GF/GC lines, with the exception of the GC SS_cons and RF lines. Comment-lines may be empty (for example, []).
NM-SQ-PAIRS is a collection (typically vector/list) of pairs of the entries (name/start-end/strand) and the associated sequence (in gapped form). If this is created via JOIN-STO-FASTA-LINES, the vector of [id sq] pairs that is the sequence part of the nm-sq-pair, will have the id part filtered out automatically.
SS-LINES is the set of 'secondary structure' lines. These are the GC SS_cons and RF lines. SS-LINES may contain the final '//' line or not. If not, it is still written to the file, if so, only the one '//' is written.
A work in progress... Write a new sto composed of the various given parts to the file spec given as NEWSTO. AUTH-LINES are the authoring header lines - including the STOCKHOLM line. Generally there are two of these - the STOCKHOLM line (with version) and the originating author or program that generated the content (for example, Infernal). COMMENT-LINES is a collection of the #=GF/GC lines, with the exception of the GC SS_cons and RF lines. Comment-lines may be empty (for example, []). NM-SQ-PAIRS is a collection (typically vector/list) of pairs of the entries (name/start-end/strand) and the associated sequence (in gapped form). If this is created via JOIN-STO-FASTA-LINES, the vector of [id sq] pairs that is the sequence part of the nm-sq-pair, will have the id part filtered out automatically. SS-LINES is the set of 'secondary structure' lines. These are the GC SS_cons and RF lines. SS-LINES may contain the final '//' line or not. If not, it is still written to the file, if so, only the one '//' is written.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close