zensols.nlparse — com.zensols.nlp/parse 0.1.7

zensols.nlparse.config

Configure the Stanford CoreNLP parser.

This provides a plugin architecture for natural language processing tasks in a pipeline. A parser takes either an human language utterance or a previously annotated data parsed from an utterance.

Parser Libraries

Each parser provides a set of components that make up the pipeline. Each component (i.e. tokenize) is a function that returns a map including a map containing keys:

component a key that's the name of the component to create.
parser a key that is the name of the parser it belongs to.

For example, the Stanford CoreNLP word tokenizer has the following return map:

:component :tokenize
:lang lang-code (e.g. en)
:parser :stanford

The map also has additional key/value pairs that represent remaining configuration given to the parser library used to create it's pipeline components. All parse library names (keys) are given in all-parsers.

Use register-library to add your library with the key name of your parser.

Usage

You can either create your own custom parser configuration with create-parse-config and then create it's respective context with create-context. If you do this, then each parse call needs to be in a with-context lexical context. If you don't, a default context is created and used for each parse invocation.

Once/if configured, use zensols.nlparse.parse/parse to invoke the parsing pipeline.

Configure the Stanford CoreNLP parser.

This provides a plugin architecture for natural language processing tasks in a
pipeline.  A parser takes either an human language utterance or a previously
annotated data parsed from an utterance.


### Parser Libraries

Each parser provides a set of *components* that make up the pipeline.  Each
component (i.e. [[tokenize]]) is a function that returns a map including a map
containing keys:

* **component** a key that's the name of the component to create.
* **parser** a key that is the name of the parser it belongs to.

For example, the Stanford CoreNLP word tokenizer has the following return map:

* **:component** :tokenize
* **:lang**  *lang-code* (e.g. `en`)
* **:parser** :stanford

The map also has additional key/value pairs that represent remaining
configuration given to the parser library used to create it's pipeline
components.  All parse library names (keys) are given in [[all-parsers]].

Use [[register-library]] to add your library with the key name of your parser.


### Usage

You can either create your own custom parser configuration
with [[create-parse-config]] and then create it's respective context
with [[create-context]].  If you do this, then each parse call needs to be in
a [[with-context]] lexical context.  If you don't, a default context is created
and used for each parse invocation.

Once/if configured, use [[zensols.nlparse.parse/parse]] to invoke the parsing
pipeline.

raw docstring

zensols.nlparse.config-parse

Parse a pipeline configruation. This namespace supports a simple DSL for parsing a pipeline configuration (see zensols.nlparse.config). The configuration string represents is a component separated by commas as a set of forms. For example the forms:

zensols.nlparse.config/tokenize("en"),zensols.nlparse.config/sentence,part-of-speech("english.tagger"),zensols.nlparse.config/morphology

creates a pipeline that tokenizes, adds POS and lemmas when called with parse. Note the double quotes in the tokenize and part-of-speech mnemonics. The parse function does this by calling in order:

(zensols.nlparse.config/tokenize "en")
(zensols.nlparse.config/sentence)
(zensols.nlparse.config/part-of-speech "english.tagger")
(zensols.nlparse.config/morphology)

Soem configuration functions are parameterized by positions or maps. Positional functions are shown in the above example and a map configuration follows:

parse-tree({:use-shift-reduce? true :maxtime 1000})

which creates a shift reduce parser that times out after a second (per sentence).

Note that arguments are option (the parenthetical portion of the form) and so is the namespace, which defaults to zensols.nlparse.config. To use a separate namespace for custom plug and play To use a separate namespace for custom plug and play components (see zensols.nlparse.config/register-library) you can specify your own namespace with a /, for example:

example.namespace/myfunc(arg1,arg2)

Parse a pipeline configruation.  This namespace supports a simple
DSL for parsing a pipeline configuration (see [[zensols.nlparse.config]]).  The
*configuration string* represents is a component separated by commas as a set
of *forms*.  For example the forms:
```
zensols.nlparse.config/tokenize("en"),zensols.nlparse.config/sentence,part-of-speech("english.tagger"),zensols.nlparse.config/morphology
```
creates a pipeline that tokenizes, adds POS and lemmas when called
with [[parse]].  Note the double quotes in the `tokenize` and `part-of-speech`
mnemonics.  The [[parse]] function does this by calling in order:

* ([[zensols.nlparse.config/tokenize]] "en")
* ([[zensols.nlparse.config/sentence]])
* ([[zensols.nlparse.config/part-of-speech]] "english.tagger")
* ([[zensols.nlparse.config/morphology]])

Soem configuration functions are parameterized by positions or maps.
Positional functions are shown in the above example and a map configuration
follows:
```
parse-tree({:use-shift-reduce? true :maxtime 1000})
```
which creates a shift reduce parser that times out after a second (per
sentence).

Note that arguments are option (the parenthetical portion of the form) and so
is the namespace, which defaults to `zensols.nlparse.config`.  To use a
separate namespace for custom plug and play To use a separate namespace for
custom plug and play
components (see [[zensols.nlparse.config/register-library]]) you can specify
your own namespace with a `/`, for example:
```
example.namespace/myfunc(arg1,arg2)
```

raw docstring

zensols.nlparse.feature.lang

Feature utility functions. In this library, all references to panon stand for parsed annotation, which is returned from zensols.nlparse.parse/parse.

Feature utility functions.  In this library, all references to
`panon` stand for *parsed annotation*, which is returned
from [[zensols.nlparse.parse/parse]].

raw docstring

zensols.nlparse.feature.word-count

Feature utility functions. See zensols.nlparse.feature.lang.

Feature utility functions.  See [[zensols.nlparse.feature.lang]].

raw docstring

zensols.nlparse.parse

Parse an utterance using the Stanford CoreNLP and the ClearNLP SRL.

This is the main client entry point to the package. A default out of the box parser works that comes with components listed in [[zensols.nlparse.config/all-components]].

If you want to customzie or add your own parser plug in, see the zensols.nlparse.config namespace.

Parse an utterance using the Stanford CoreNLP and the ClearNLP SRL.

This is the main client entry point to the package.  A default out of the box
parser works that comes with components listed
in [[zensols.nlparse.config/all-components]].

If you want to customzie or add your own parser plug in, see
the [[zensols.nlparse.config]] namespace.

raw docstring

zensols.nlparse.resource

Configure environment for the NLP pipeline.

Configure environment for the NLP pipeline.

raw docstring

initialize

zensols.nlparse.srl

Wrap ClearNLP SRL.

Currently the propbank trained version is used. The main classification function is label.

Wrap ClearNLP SRL.

Currently the propbank trained version is used.  The main classification
function is [[label]].

raw docstring

zensols.nlparse.stanford

Wraps the Stanford CoreNLP parser.

Wraps the Stanford CoreNLP parser.

raw docstring

zensols.nlparse.stopword

This namesapce provides ways of filtering stop word tokens.

To avoid the double negative in function names, go words are defined to be the compliment of a vocabulary with a stop word list. Functions like go-word? tell whether or not a token is a stop word, which are defined to be:

stopwords (predefined list)
punctuation
numbers
non-alphabetic characters

This namesapce provides ways of filtering *stop word* tokens.

To avoid the double negative in function names, *go words* are defined to be
the compliment of a vocabulary with a stop word list.  Functions
like [[go-word?]] tell whether or not a token is a stop word, which are
defined to be:

  * stopwords (predefined list)
  * punctuation
  * numbers
  * non-alphabetic characters

raw docstring

zensols.nlparse.tok-re

This namespace extends the NER system to easily add any regular expression using the Stanford TokensRegex API.

This takes a sequence of regular expressions and entity metadata as input and produces a file format the TokensRegex API consumes to tag entities.

This is an example of the output.

This namespace extends the NER system to easily add any regular
expression using the [Stanford
TokensRegex](http://nlp.stanford.edu/software/tokensregex.html) API.

This takes a sequence of regular expressions and entity metadata as input and
produces a file format the TokensRegex API consumes to tag entities.

[This](https://github.com/plandes/clj-nlp-parse/blob/v0.0.11/test-resources/token-regex.txt)
is an example of the output.

raw docstring