Liking cljdoc? Tell your friends :D

Text API

Overview

The mxnet.contrib.text APIs refer to classes and functions related to text data processing, such as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the mxnet.ndarray.NDArray format.

.. warning:: This package contains experimental APIs and may change in the near future.

This document lists the text APIs in mxnet:

.. autosummary::
    :nosignatures:

    mxnet.contrib.text.embedding
    mxnet.contrib.text.vocab
    mxnet.contrib.text.utils

All the code demonstrated in this document assumes that the following modules or packages are imported.

>>> from mxnet import gluon
>>> from mxnet import nd
>>> from mxnet.contrib import text
>>> import collections

Looking up pre-trained word embeddings for indexed words

As a common use case, let us look up pre-trained word embedding vectors for indexed words in just a few lines of code.

To begin with, Suppose that we have a simple text data set in the string format. We can count word frequency in the data set.

>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

The obtained counter has key-value pairs whose keys are words and values are word frequencies. Suppose that we want to build indices for all the keys in counter and load the defined fastText word embedding for all such indexed words. First, we need a Vocabulary object with counter as its argument

>>> my_vocab = text.vocab.Vocabulary(counter)

We can create a fastText word embedding object by specifying the embedding name fasttext and the pre-trained file wiki.simple.vec. We also specify that the indexed tokens for loading the fastText word embedding come from the defined Vocabulary object my_vocab.

>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
...     vocabulary=my_vocab)

Now we are ready to look up the fastText word embedding vectors for indexed words, such as 'hello' and 'world'.

>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])

[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
    ...
   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
 [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
    ...
   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

Using pre-trained word embeddings in gluon

To demonstrate how to use pre-trained word embeddings in the gluon package, let us first obtain indices of the words 'hello' and 'world'.

>>> my_embedding.to_indices(['hello', 'world'])
[2, 1]

We can obtain the vector representation for the words 'hello' and 'world' by specifying their indices (2 and 1) and the my_embedding.idx_to_vec in mxnet.gluon.nn.Embedding.

>>> layer = gluon.nn.Embedding(len(my_embedding), my_embedding.vec_len)
>>> layer.initialize()
>>> layer.weight.set_data(my_embedding.idx_to_vec)
>>> layer(nd.array([2, 1]))

[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
    ...
   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
 [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
    ...
   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

Vocabulary

The vocabulary builds indices for text tokens. Such indexed tokens can be used by token embedding instances. The input counter whose keys are candidate indices may be obtained via count_tokens_from_str.

.. currentmodule:: mxnet.contrib.text.vocab
.. autosummary::
    :nosignatures:

    Vocabulary

Suppose that we have a simple text data set in the string format. We can count word frequency in the data set.

>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

The obtained counter has key-value pairs whose keys are words and values are word frequencies. Suppose that we want to build indices for the 2 most frequent keys in counter with the unknown token representation '<unk>' and a reserved token '<pad>'.

>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='&lt;unk&gt;', 
...     reserved_tokens=['&lt;pad&gt;'])

We can access properties such as token_to_idx (mapping tokens to indices), idx_to_token (mapping indices to tokens), vec_len (length of each embedding vector), and unknown_token (representation of any unknown token) and reserved_tokens.

>>> my_vocab.token_to_idx
{'&lt;unk&gt;': 0, '&lt;pad&gt;': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
['&lt;unk&gt;', '&lt;pad&gt;', 'world', 'hello']
>>> my_vocab.unknown_token
'&lt;unk&gt;'
>>> my_vocab.reserved_tokens
['&lt;pad&gt;']
>>> len(my_vocab)
4

Besides the specified unknown token '<unk>' and reserved_token '<pad>' are indexed, the 2 most frequent words 'world' and 'hello' are also indexed.

Text token embedding

To load token embeddings from an externally hosted pre-trained token embedding file, such as those of GloVe and FastText, use embedding.create(embedding_name, pretrained_file_name).

To get all the available embedding_name and pretrained_file_name, use embedding.get_pretrained_file_names().

>>> text.embedding.get_pretrained_file_names()
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...],
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]}

Alternatively, to load embedding vectors from a custom pre-trained text token embedding file, use CustomEmbedding.

Moreover, to load composite embedding vectors, such as to concatenate embedding vectors, use CompositeEmbedding.

The indexed tokens in a text token embedding may come from a vocabulary or from the loaded embedding vectors. In the former case, only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. In the later case, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, are taken as the indexed tokens of the embedding.

.. currentmodule:: mxnet.contrib.text.embedding
.. autosummary::
    :nosignatures:

    register
    create
    get_pretrained_file_names
    GloVe
    FastText
    CustomEmbedding
    CompositeEmbedding

Indexed tokens are from a vocabulary

One can specify that only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file.

To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set.

>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

The obtained counter has key-value pairs whose keys are words and values are word frequencies. Suppose that we want to build indices for the most frequent 2 keys in counter and load the defined fastText word embedding with pre-trained file wiki.simple.vec for all these 2 words.

>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2)
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
...     vocabulary=my_vocab)

Now we are ready to look up the fastText word embedding vectors for indexed words.

>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])

[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
    ...
   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
 [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
    ...
   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

We can also access properties such as token_to_idx (mapping tokens to indices), idx_to_token (mapping indices to tokens), and vec_len (length of each embedding vector).

>>> my_embedding.token_to_idx
{'&lt;unk&gt;': 0, 'world': 1, 'hello': 2}
>>> my_embedding.idx_to_token
['&lt;unk&gt;', 'world', 'hello']
>>> len(my_embedding)
3
>>> my_embedding.vec_len
300

If a token is unknown to glossary, its embedding vector is initialized according to the default specification in fasttext_simple (all elements are 0).


>>> my_embedding.get_vecs_by_tokens('nice')

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  ...
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 300 @cpu(0)>

Indexed tokens are from the loaded embedding vectors

One can also use all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, as the indexed tokens of the embedding.

To begin with, we can create a fastText word embedding object by specifying the embedding name 'fasttext' and the pre-trained file 'wiki.simple.vec'. The argument init_unknown_vec specifies default vector representation for any unknown token. To index all the tokens from this pre-trained word embedding file, we do not need to specify any vocabulary.

>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
...     init_unknown_vec=nd.zeros)

We can access properties such as token_to_idx (mapping tokens to indices), idx_to_token (mapping indices to tokens), vec_len (length of each embedding vector), and unknown_token (representation of any unknown token, default value is '<unk>').

>>> my_embedding.token_to_idx['nice']
2586
>>> my_embedding.idx_to_token[2586]
'nice'
>>> my_embedding.vec_len
300
>>> my_embedding.unknown_token
'&lt;unk&gt;'

For every unknown token, if its representation '<unk>' is encountered in the pre-trained token embedding file, index 0 of property idx_to_vec maps to the pre-trained token embedding vector loaded from the file; otherwise, index 0 of property idx_to_vec maps to the default token embedding vector specified via init_unknown_vec (set to nd.zeros here). Since the pre-trained file does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and the number of tokens in the embedding is 111,052.

>>> len(my_embedding)
111052
>>> my_embedding.idx_to_vec[0]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  ...
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 300 @cpu(0)>
>>> my_embedding.get_vecs_by_tokens('nice')

[ 0.49397001  0.39996001  0.24000999 -0.15121    -0.087512    0.37114
  ...
  0.089521    0.29175001 -0.40917999 -0.089206   -0.1816     -0.36616999]
<NDArray 300 @cpu(0)>
>>> my_embedding.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN'])

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   ...
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   ...
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
<NDArray 2x50 @cpu(0)>

Implement a new text token embedding

For optimizer, create a subclass of mxnet.contrib.text.embedding._TokenEmbedding. Also add @mxnet.contrib.text.embedding._TokenEmbedding.register before this class. See embedding.py for examples.

Text utilities

The following functions provide utilities for text data processing.

.. currentmodule:: mxnet.contrib.text.utils
.. autosummary::
    :nosignatures:

    count_tokens_from_str

API Reference


.. automodule:: mxnet.contrib.text.embedding
    :members: register, create, get_pretrained_file_names
.. autoclass:: mxnet.contrib.text.embedding.GloVe
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.FastText
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.CustomEmbedding
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.CompositeEmbedding
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens

.. automodule:: mxnet.contrib.text.vocab
.. autoclass:: mxnet.contrib.text.vocab.Vocabulary
    :members: to_indices, to_tokens

.. automodule:: mxnet.contrib.text.utils
    :members: count_tokens_from_str

Can you improve this documentation? These fine people already did:
Aston Zhang, Sheng Zha & Aaron Markham
Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close