The mxnet.contrib.text
APIs refer to classes and functions related to text data processing, such
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the
mxnet.ndarray.NDArray
format.
.. warning:: This package contains experimental APIs and may change in the near future.
This document lists the text APIs in mxnet:
.. autosummary::
:nosignatures:
mxnet.contrib.text.embedding
mxnet.contrib.text.vocab
mxnet.contrib.text.utils
All the code demonstrated in this document assumes that the following modules or packages are imported.
>>> from mxnet import gluon
>>> from mxnet import nd
>>> from mxnet.contrib import text
>>> import collections
As a common use case, let us look up pre-trained word embedding vectors for indexed words in just a few lines of code.
To begin with, Suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)
The obtained counter
has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for all the keys in counter
and load the defined fastText
word embedding for all such indexed words. First, we need a Vocabulary object with counter
as its
argument
>>> my_vocab = text.vocab.Vocabulary(counter)
We can create a fastText word embedding object by specifying the embedding name fasttext
and
the pre-trained file wiki.simple.vec
. We also specify that the indexed tokens for loading the
fastText word embedding come from the defined Vocabulary object my_vocab
.
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
... vocabulary=my_vocab)
Now we are ready to look up the fastText word embedding vectors for indexed words, such as 'hello' and 'world'.
>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>
gluon
To demonstrate how to use pre-trained word embeddings in the gluon
package, let us first obtain
indices of the words 'hello' and 'world'.
>>> my_embedding.to_indices(['hello', 'world'])
[2, 1]
We can obtain the vector representation for the words 'hello' and 'world' by specifying their
indices (2 and 1) and the my_embedding.idx_to_vec
in mxnet.gluon.nn.Embedding
.
>>> layer = gluon.nn.Embedding(len(my_embedding), my_embedding.vec_len)
>>> layer.initialize()
>>> layer.weight.set_data(my_embedding.idx_to_vec)
>>> layer(nd.array([2, 1]))
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>
The vocabulary builds indices for text tokens. Such indexed tokens can be used by token embedding
instances. The input counter whose keys are candidate indices may be obtained via
count_tokens_from_str
.
.. currentmodule:: mxnet.contrib.text.vocab
.. autosummary::
:nosignatures:
Vocabulary
Suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)
The obtained counter
has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the 2 most frequent keys in counter
with the unknown
token representation '<unk>' and a reserved token '<pad>'.
>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<unk>',
... reserved_tokens=['<pad>'])
We can access properties such as token_to_idx
(mapping tokens to indices), idx_to_token
(mapping
indices to tokens), vec_len
(length of each embedding vector), and unknown_token
(representation
of any unknown token) and reserved_tokens
.
>>> my_vocab.token_to_idx
{'<unk>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
['<unk>', '<pad>', 'world', 'hello']
>>> my_vocab.unknown_token
'<unk>'
>>> my_vocab.reserved_tokens
['<pad>']
>>> len(my_vocab)
4
Besides the specified unknown token '<unk>' and reserved_token '<pad>' are indexed, the 2 most frequent words 'world' and 'hello' are also indexed.
To load token embeddings from an externally hosted pre-trained token embedding file, such as those
of GloVe and FastText, use
embedding.create(embedding_name, pretrained_file_name)
.
To get all the available embedding_name
and pretrained_file_name
, use
embedding.get_pretrained_file_names()
.
>>> text.embedding.get_pretrained_file_names()
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...],
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]}
Alternatively, to load embedding vectors from a custom pre-trained text token
embedding file, use CustomEmbedding
.
Moreover, to load composite embedding vectors, such as to concatenate embedding vectors,
use CompositeEmbedding
.
The indexed tokens in a text token embedding may come from a vocabulary or from the loaded embedding vectors. In the former case, only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file. In the later case, all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, are taken as the indexed tokens of the embedding.
.. currentmodule:: mxnet.contrib.text.embedding
.. autosummary::
:nosignatures:
register
create
get_pretrained_file_names
GloVe
FastText
CustomEmbedding
CompositeEmbedding
One can specify that only the indexed tokens in a vocabulary are associated with the loaded embedding vectors, such as loaded from a pre-trained token embedding file.
To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set.
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)
The obtained counter
has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the most frequent 2 keys in counter
and load the defined
fastText word embedding with pre-trained file wiki.simple.vec
for all these 2 words.
>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2)
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
... vocabulary=my_vocab)
Now we are ready to look up the fastText word embedding vectors for indexed words.
>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>
We can also access properties such as token_to_idx
(mapping tokens to indices), idx_to_token
(mapping indices to tokens), and vec_len
(length of each embedding vector).
>>> my_embedding.token_to_idx
{'<unk>': 0, 'world': 1, 'hello': 2}
>>> my_embedding.idx_to_token
['<unk>', 'world', 'hello']
>>> len(my_embedding)
3
>>> my_embedding.vec_len
300
If a token is unknown to glossary
, its embedding vector is initialized according to the default
specification in fasttext_simple
(all elements are 0).
>>> my_embedding.get_vecs_by_tokens('nice')
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 300 @cpu(0)>
One can also use all the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding file, as the indexed tokens of the embedding.
To begin with, we can create a fastText word embedding object by specifying the embedding name
'fasttext' and the pre-trained file 'wiki.simple.vec'. The argument init_unknown_vec
specifies
default vector representation for any unknown token. To index all the tokens from this pre-trained
word embedding file, we do not need to specify any vocabulary.
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
... init_unknown_vec=nd.zeros)
We can access properties such as token_to_idx
(mapping tokens to indices), idx_to_token
(mapping
indices to tokens), vec_len
(length of each embedding vector), and unknown_token
(representation
of any unknown token, default value is '<unk>').
>>> my_embedding.token_to_idx['nice']
2586
>>> my_embedding.idx_to_token[2586]
'nice'
>>> my_embedding.vec_len
300
>>> my_embedding.unknown_token
'<unk>'
For every unknown token, if its representation '<unk>' is encountered in the pre-trained token
embedding file, index 0 of property idx_to_vec
maps to the pre-trained token embedding vector
loaded from the file; otherwise, index 0 of property idx_to_vec
maps to the default token
embedding vector specified via init_unknown_vec
(set to nd.zeros here). Since the pre-trained file
does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and
the number of tokens in the embedding is 111,052.
>>> len(my_embedding)
111052
>>> my_embedding.idx_to_vec[0]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 300 @cpu(0)>
>>> my_embedding.get_vecs_by_tokens('nice')
[ 0.49397001 0.39996001 0.24000999 -0.15121 -0.087512 0.37114
...
0.089521 0.29175001 -0.40917999 -0.089206 -0.1816 -0.36616999]
<NDArray 300 @cpu(0)>
>>> my_embedding.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN'])
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
<NDArray 2x50 @cpu(0)>
For optimizer
, create a subclass of mxnet.contrib.text.embedding._TokenEmbedding
.
Also add @mxnet.contrib.text.embedding._TokenEmbedding.register
before this class. See
embedding.py
for examples.
The following functions provide utilities for text data processing.
.. currentmodule:: mxnet.contrib.text.utils
.. autosummary::
:nosignatures:
count_tokens_from_str
.. automodule:: mxnet.contrib.text.embedding
:members: register, create, get_pretrained_file_names
.. autoclass:: mxnet.contrib.text.embedding.GloVe
:members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.FastText
:members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.CustomEmbedding
:members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.CompositeEmbedding
:members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. automodule:: mxnet.contrib.text.vocab
.. autoclass:: mxnet.contrib.text.vocab.Vocabulary
:members: to_indices, to_tokens
.. automodule:: mxnet.contrib.text.utils
:members: count_tokens_from_str
Can you improve this documentation? These fine people already did:
Aston Zhang, Sheng Zha & Aaron MarkhamEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close