scicloj.metamorph.ml.text

Liking cljdoc? Tell your friends :D

Clojure only.

->column--coalesce-blocks
->column--concat-buffers
->line
->tfidf
->tidy-text
libsvm->tidy
make-container
make-mmap-container
process-line
tidy->libsvm!

->column--coalesce-blocks^clj

(->column--coalesce-blocks col-name container-type data-type tfidf-data key)

source

->column--concat-buffers^clj

(->column--concat-buffers col-name data-type tfidf-data key)

source

->line^clj

(->line document-ds column)

source

->tfidf^clj

(->tfidf tidy-text
         &
         {:keys [container-type column-container-type combine-method
                 datatype-meta]
          :or {combine-method :coalesce-blocks!
               column-container-type :jvm-heap
               container-type :jvm-heap
               datatype-meta :object}})

Transforms a dataset in tidy text format in the bag-of-words representation including TFIDF calculation of the the tokens.

tidy-text needs to be a dataset with columns :document
:token-idx
:token-pos

The following three can be used to move data off heap during calculations. They can make dramatic differences in performance (faster and slower) and memory usage.

container-type decides if the intermidiate results are stored on-heap (:jvm-heap, the default) or off-heap (:native-heap) or :mmap (as mmaped file) column-container-type same decides if the resulting dataset os store on-hep (:jvm-heap, the default) or off-heap (:native-heap) or :mmap (as mmaped file) combine-method How to combine the intermidiate containers, either :concat-bufders or :coalesce-buffers!

Returns a dataset with columns:

:document document id :token-idx The token as id :token-count How often the token appears in a 'document' :tf :token-count divided by document length :tfidf tfidf value for token

Transforms a dataset in tidy text format in the bag-of-words representation including
 TFIDF calculation of the the tokens.

`tidy-text` needs to be a dataset with columns
    :document    
    :token-idx   
    :token-pos   
 

 The following three can be used to `move` data off heap during calculations.
 They can make dramatic differences in performance (faster and slower) 
 and memory usage.

 `container-type` decides if the intermidiate results are stored on-heap (:jvm-heap, the default)
                 or off-heap (:native-heap) or :mmap (as mmaped file)
 `column-container-type` same decides if the resulting dataset os store on-hep (:jvm-heap, the default)
                 or off-heap (:native-heap) or :mmap (as mmaped file)
 `combine-method` How to combine the intermidiate containers, either :concat-bufders or :coalesce-buffers!
 
 Returns a dataset with columns:

 :document      document id
 :token-idx     The token as id
 :token-count   How often the token appears in a 'document' 
 :tf            :token-count divided by document length
 :tfidf         tfidf value for token

source raw docstring

->tidy-text^clj

(->tidy-text lines-source
             line-seq-fn
             line-split-fn
             line-tokenizer-fn
             &
             {:keys [skip-lines max-lines container-type datatype-document
                     datatype-token-pos datatype-meta datatype-token-idx
                     compacting-document-intervall combine-method
                     token->index-map column-container-type new-token-behaviour]
              :or {datatype-token-idx :int16
                   max-lines Integer/MAX_VALUE
                   datatype-document :int16
                   container-type :jvm-heap
                   datatype-meta :object
                   datatype-token-pos :int16
                   compacting-document-intervall 10000
                   skip-lines 0
                   column-container-type :jvm-heap
                   combine-method :coalesce-blocks!
                   new-token-behaviour :store
                   token->index-map (Object2IntOpenHashMap. 10000)}})

Reads, parses and tokenizes a text file or a TMD dataset into a seq of tech.v3.dataset in the tidy-text format, so one word per row. It does the parsing and conversion strictly line based, so it should work for large documents.

Initial tests show that each byte of text size need 1.5 byte on average So a 8 GB text file can be sucessfully loaded when having at least 12 GB.

lines-source Either a buffered reader or a TMD dadaset line-seq-fn A function which return a lazy-list of lines , given the lines-source line-split-fn A fn which should seperate a single line of input in text and other Supposed to return a seq of size 2, where the first is the 'text' of the line and meta can be anything non-nil (map, vector, scalar). It's value will be returned in column meta and is supposed to be further processed later. meta can be nil always, so no column meta will be created

text-tokenizer-fn A function which will be called for any text as obtained by line-split-fn It should split the text by word boundaries and return the obtained tokens as a seq of strings. It can do any text normalisation desired.

Optional options are: skip-lines 0 Lines to skip at beginning max-lines MAX_INT max lines to return

The following can be used to optimize the heap usage for larger texts. It can be tune depending on how may documents, how many words per document, and how many tokens overall are in the text corpus.

datatype-document :int16 Datatype of :document column (:int16 or :int32) datatype-token-pos :int16 Datatype of :token-pos column (:int16 or :int32) datatype-meta :object Datatype of :meta column (anything, need to match what line-split-fn returns as 'meta') datatype-token-idx :int16 Datatype of :token-idx column (:int16 or :int32)

The following options can be used to move data off heap during calculations. They can make dramatic differences in performance (faster and slower) and memory usage.

column-container-type :jvm-heap If the resulting table is created on heap (:jvm-heap ) of off heap (:native-heap) container-type :jvm-heap as column-container-type but for intermidiate reuslts, per interval compacting-document-intervall 10000 After how many lines the data is written into a continous block combine-method :coalesce-blocks! Which method to use to combine blocks (:coalesce-blocks! or :concat-buffers) One or the other might need less RAM in ceratin scenarious. token->index-map Object2IntOpenHashMap Can be overriden with a own object->int map implementation, (maybe off-heap). Can as well be a map obtained from a prevoius run in order to guranty same mappings.
new-token-behaviour :store How to react when new tokens appear , which are no in token->id-map Either :store (default), :fail (throw exception) or :as-unknown (use specific token [UNKNOWN])

The following three can be used to move data off heap during calculations. They can make dramatic differences in performance (faster and slower) and memory usage.

Function returns a map of :datasets and :token-lookup-table

:datasets is a seq of TMD datasets each having 4 columns which represent the input text in the tidy-text format:

:document The 'document/line' a token is comming from :token-idx The token/word (as int) , which is present as well in the token->int look up table returned :token-pos The position of the token in the document :meta The meta values if return by line-split-fn

Assuming that the text-tokenizer-fn does no text normalisation, the table is a exact representation of the input text. I contains as well the word order in column :token-pos, so resorting the table keeps the original text.

Reads, parses and tokenizes a text file or a TMD dataset 
into a seq of tech.v3.dataset in the tidy-text format,
so one word per row. 
It does the parsing and conversion strictly line based, so it should work for large documents.

Initial tests show that each byte of text size need 1.5 byte on average
So a 8 GB text file can be sucessfully loaded when having at least 12 GB.

`lines-source` Either a buffered reader or a TMD dadaset
`line-seq-fn`  A function which return a lazy-list of lines , given the `lines-source`
`line-split-fn` A fn which should seperate a single line of input in text and `other`
                Supposed to return a seq of size 2, where the first is the 'text' of the line and `meta` can be 
                anything non-nil (map, vector, scalar). It's value will be returned in column `meta` and is supposed 
                to be further processed later. `meta` can be nil always,  so no column `meta` will be created 

`text-tokenizer-fn` A function which will be called for any `text` as obtained by `line-split-fn`
                    It should split the text by word boundaries and return the obtained tokens as a seq of strings.
                    It can do any text normalisation desired.

Optional `options` are: 
`skip-lines`                      0           Lines to skip at beginning
`max-lines`                       MAX_INT     max lines to return

The following can be used to optimize the heap usage for larger texts.
It can be tune depending on how may documents, how many words per document, and how many 
tokens overall are in the text corpus. 


`datatype-document`              :int16                Datatype of :document column (:int16 or :int32)
`datatype-token-pos`             :int16                Datatype of :token-pos column (:int16 or :int32)
`datatype-meta`                  :object               Datatype of :meta column (anything, need to match what `line-split-fn` returns as 'meta')
`datatype-token-idx`             :int16                Datatype of :token-idx column (:int16 or :int32)


The following options can be used to `move` data off heap during 
calculations.  They can make dramatic differences in performance (faster and slower) 
and memory usage.                   
                

`column-container-type`          :jvm-heap             If the resulting table is created on heap (:jvm-heap ) of off heap (:native-heap)
`container-type`                 :jvm-heap             as `column-container-type` but for intermidiate reuslts, per interval
`compacting-document-intervall`  10000                 After how many lines the data is written into a continous block
`combine-method`                 :coalesce-blocks!     Which method to use to combine blocks (:coalesce-blocks! or :concat-buffers)
                                                       One or the other might need less RAM in ceratin scenarious.
`token->index-map`               Object2IntOpenHashMap Can be overriden with a own object->int map implementation, (maybe off-heap). 
                                                       Can as well be a map obtained from a prevoius run in order to guranty same mappings.                        
`new-token-behaviour`            :store                How to react when new  tokens appear , which are no in `token->id-map`
                                                       Either :store (default), :fail (throw exception) or :as-unknown (use specific token [UNKNOWN]) 

 
                                  

                    
The following three can be used to `move` data off heap during calculations.
They can make dramatic differences in performance (faster and slower) 
and memory usage.

`container-type` decides if the intermidiate results are stored on-heap (:jvm-heap, the default)
                or off-heap (:native-heap) or :mmap (as mmaped file)
`column-container-type` same decides if the resulting dataset os store on-hep (:jvm-heap, the default)
                or off-heap (:native-heap) or :mmap (as mmaped file)
`combine-method` How to combine the intermidiate containers, either :concat-bufders or :coalesce-buffers!



Function returns a map of :datasets and :token-lookup-table

:datasets is a seq of TMD datasets each having 4 columns which represent
the input text in the tidy-text format:

:document    The 'document/line' a token is comming from
:token-idx   The token/word (as int) , which is present as well in the token->int look up table returned
:token-pos   The position of the token in the document
:meta        The meta values if return by `line-split-fn`
                
Assuming that the `text-tokenizer-fn` does no text normalisation, the table is a exact representation 
of the input text. I contains as well the word order in column :token-pos, 
so resorting the table keeps the original text.

source raw docstring

libsvm->tidy^clj

(libsvm->tidy reader)

Takes a reader (of a file usualy) and reads it as libsvm formated data.

Returns a dataset with columns :instance :label :index :value

Takes a reader (of a file usualy) and
reads it as libsvm formated data. 

Returns a dataset with columns
:instance
:label 
:index 
:value

source raw docstring

make-container^clj

(make-container container-type datatype col-size)

source

make-mmap-container^clj

(make-mmap-container datatype col-size)

source

process-line^clj

(process-line token-lookup-table
              line-split-fn
              text-tokenizer-fn
              datatype-document
              datatype-token-pos
              datatype-meta
              datatype-token-idx
              container-type
              compacting-document-intervall
              combine-method
              new-token-behaviour
              acc
              line)

source

tidy->libsvm!^clj

(tidy->libsvm! tfidf-ds writer column)

Writes a tfidf dataset to a writer in the svmlib text format

Writes a tfidf dataset to a writer in the 
svmlib text format

source raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close