1 of 5

Language Processing

Overview of language processing in DL4J

Although not designed to be comparable to tools such as Stanford CoreNLP or NLTK, deepLearning4J does include some core text processing tools that are described here.

Deeplearning4j's NLP support contains interfaces for different NLP libraries. A user wraps third party libraries via our interfaces. Deeplearning4j as of M1, does not support any 3rd party libraries directly. This is due to the lack of maintenance and custom work needed to make this work well for users. Instead, we expose interfaces to allow users to implement their own tokenizers.

SentenceIterator

There are several steps involved in processing natural language. The first is to iterate over your corpus to create a list of documents, which can be as short as a tweet, or as long as a newspaper article. This is performed by a SentenceIterator, which will appear like this:

// Gets Path to Text file
String filePath = new File(dataLocalPath,"raw_sentences.txt").getAbsolutePath();
// Strip white space before and after for each line
SentenceIterator iter = new BasicLineIterator(filePath);

The SentenceIterator encapsulates a corpus or text, organizing it, say, as one Tweet per line. It is responsible for feeding text piece by piece into your natural language processor. The SentenceIterator is not analogous to a similarly named class, the DatasetIterator, which creates a dataset for training a neural net. Instead it creates a collection of strings by segmenting a corpus.

Tokenizer

A Tokenizer further segments the text at the level of single words, also alternatively as n-grams. ClearTK contains the underlying tokenizers, such as parts of speech (PoS) and parse trees, which allow for both dependency and constituency parsing, like that employed by a recursive neural tensor network (RNTN).

A Tokenizer is created and wrapped by a TokenizerFactory. The default tokens are words separated by spaces. The tokenization process also involves some machine learning to differentiate between ambibuous symbols like . which end sentences and also abbreviate words such as Mr. and vs.

Both Tokenizers and SentenceIterators work with Preprocessors to deal with anomalies in messy text like Unicode, and to render such text, say, as lowercase characters uniformly.

 public static void main(String[] args) throws Exception {

        dataLocalPath = DownloaderUtility.NLPDATA.Download();
        // Gets Path to Text file
        String filePath = new File(dataLocalPath,"raw_sentences.txt").getAbsolutePath();

        log.info("Load & Vectorize Sentences....");
        // Strip white space before and after for each line
        SentenceIterator iter = new BasicLineIterator(filePath);
        // Split on white spaces in the line to get words
        TokenizerFactory t = new DefaultTokenizerFactory();

        /*
            CommonPreprocessor will apply the following regex to each token: [\d\.:,"'\(\)\[\]|/?!;]+
            So, effectively all numbers, punctuation symbols and some special symbols are stripped off.
            Additionally it forces lower case for all tokens.
         */
        t.setTokenPreProcessor(new CommonPreprocessor());

Vocab

Each document has to be tokenized to create a vocab, the set of words that matter for that document or corpus. Those words are stored in the vocab cache, which contains statistics about a subset of words counted in the document, the words that "matter". The line separating significant and insignifant words is mobile, but the basic idea of distinguishing between the two groups is that words occurring only once (or less than, say, five times) are hard to learn and their presence represents unhelpful noise.

The vocab cache stores metadata for methods such as Word2vec and Bag of Words, which treat words in radically different ways. Word2vec creates representations of words, or neural word embeddings, in the form of vectors that are hundreds of coefficients long. Those coefficients help neural nets predict the likelihood of a word appearing in any given context; for example, after another word. Here's Word2vec, configured:

package org.deeplearning4j.examples.nlp.word2vec;

import org.deeplearning4j.examples.download.DownloaderUtility;
import org.deeplearning4j.models.word2vec.Word2Vec;
import org.deeplearning4j.text.sentenceiterator.BasicLineIterator;
import org.deeplearning4j.text.sentenceiterator.SentenceIterator;
import org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor;
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory;
import org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.util.Collection;

/**
 * Created by agibsonccc on 10/9/14.
 *
 * Neural net that processes text into wordvectors. See below url for an in-depth explanation.
 * https://deeplearning4j.org/word2vec.html
 */
public class Word2VecRawTextExample {

    private static Logger log = LoggerFactory.getLogger(Word2VecRawTextExample.class);

    public static String dataLocalPath;


    public static void main(String[] args) throws Exception {

        dataLocalPath = DownloaderUtility.NLPDATA.Download();
        // Gets Path to Text file
        String filePath = new File(dataLocalPath,"raw_sentences.txt").getAbsolutePath();

        log.info("Load & Vectorize Sentences....");
        // Strip white space before and after for each line
        SentenceIterator iter = new BasicLineIterator(filePath);
        // Split on white spaces in the line to get words
        TokenizerFactory t = new DefaultTokenizerFactory();

        /*
            CommonPreprocessor will apply the following regex to each token: [\d\.:,"'\(\)\[\]|/?!;]+
            So, effectively all numbers, punctuation symbols and some special symbols are stripped off.
            Additionally it forces lower case for all tokens.
         */
        t.setTokenPreProcessor(new CommonPreprocessor());

        log.info("Building model....");
        Word2Vec vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .iterations(1)
                .layerSize(100)
                .seed(42)
                .windowSize(5)
                .iterate(iter)
                .tokenizerFactory(t)
                .build();

        log.info("Fitting Word2Vec model....");
        vec.fit();

        log.info("Writing word vectors to text file....");

        // Prints out the closest 10 words to "day". An example on what to do with these Word Vectors.
        log.info("Closest Words:");
        Collection<String> lst = vec.wordsNearestSum("day", 10);
        log.info("10 Words closest to 'day': {}", lst);
    }
}

Once you obtain word vectors, you can feed them into a deep net for classification, prediction, sentiment analysis and the like.

Doc2Vec

Doc2Vec and arbitrary documents for language processing in DL4J.

The main purpose of Doc2Vec is associating arbitrary documents with labels, so labels are required. Doc2vec is an extension of word2vec that learns to correlate labels and words, rather than words with other words. Deeplearning4j's implentation is intended to serve the Java, Scala and Clojure communities.

The first step is coming up with a vector that represents the "meaning" of a document, which can then be used as input to a supervised machine learning algorithm to associate documents with labels.

In the ParagraphVectors builder pattern, the labels() method points to the labels to train on. In the example below, you can see labels related to sentiment analysis:

    .labels(Arrays.asList("negative", "neutral","positive"))

Here's a full working example of classification with paragraph vectors:

public void testDifferentLabels() throws Exception {
    ClassPathResource resource = new ClassPathResource("/labeled");
    File file = resource.getFile();
    LabelAwareSentenceIterator iter = LabelAwareUimaSentenceIterator.createWithPath(file.getAbsolutePath());

    TokenizerFactory t = new UimaTokenizerFactory();

    ParagraphVectors vec = new ParagraphVectors.Builder()
            .minWordFrequency(1).labels(Arrays.asList("negative", "neutral","positive"))
            .layerSize(100)
            .stopWords(new ArrayList<String>())
            .windowSize(5).iterate(iter).tokenizerFactory(t).build();

    vec.fit();

    assertNotEquals(vec.lookupTable().vector("UNK"), vec.lookupTable().vector("negative"));
    assertNotEquals(vec.lookupTable().vector("UNK"),vec.lookupTable().vector("positive"));
    assertNotEquals(vec.lookupTable().vector("UNK"),vec.lookupTable().vector("neutral"));}

Sentence Iterator

Iteration of words, documents, and sentences for language processing in DL4J.

A sentence iterator is used in both Word2vec and Bag of Words.

It feeds bits of text into a neural network in the form of vectors, and also covers the concept of documents in text processing.

In natural-language processing, a document or sentence is typically used to encapsulate a context which an algorithm should learn.

A few examples include analyzing Tweets and full-blown news articles. The purpose of the sentence iterator is to divide text into processable bits. Note the sentence iterator is input agnostic. So bits of text (a document) can come from a file system, the Twitter API or Hadoop.

Depending on how input is processed, the output of a sentence iterator will then be passed to a tokenizer for the processing of individual tokens, which are usually words, but could also be ngrams, skipgrams or other units. The tokenizer is created on a per-sentence basis by a tokenizer factory. The tokenizer factory is what is passed into a text-processing vectorizer.

Some typical examples are below:

SentenceIterator iter = new LineSentenceIterator(new File("your file"));

This assumes that each line in a file is a sentence.

You can also do list of strings as sentence as follows:

Collection<String> sentences = ...;
SentenceIterator iter = new CollectionSentenceIterator(sentences);

This will assume that each string is a sentence (document). Remember this could be a list of Tweets or articles -- both are applicable.

You can iterate over files as follows:

SentenceIterator iter = new FileSentenceIterator(new File("your dir or file"));

This will parse the files line by line and return individual sentences on each one.

For anything complex, we recommend any pipeline that can implement more in depth support than space separated tokens.

Tokenization

Breaking text into individual words for language processing in DL4J.

Notes to write on: 1. Tokenizer factory interface 2. Tokenizer interface 2. How to write your own factory and tokenizer

Tokenization

What is Tokenization?

Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. Word2Vec can output text windows that comprise training examples for input into neural nets, as seen here.

Example

Here's an example of tokenization done with DL4J tools:

TokenizerFactory tokenizerFactory = new DefaultTokenizerFactory();
Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");

//iterate over the tokens
while(tokenizer.hasMoreTokens()) {
      String token = tokenizer.nextToken();
}

//get the whole list of tokens
List<String> tokens = tokenizer.getTokens();

The above snippet creates a tokenizer capable of stemming.

In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.

Vocabulary Cache

Mechanism for handling general NLP tasks in DL4J.

The vocabulary cache, or vocab cache, is a mechanism for handling general-purpose natural-language tasks in Deeplearning4j, including normal TF-IDF, word vectors and certain information-retrieval techniques. The goal of the vocab cache is to be a one-stop shop for text vectorization, encapsulating techniques common to bag of words and word vectors, among others.

Vocab cache handles storage of tokens, word-count frequencies, inverse-document frequencies and document occurrences via an inverted index. The InMemoryLookupCache is the reference implementation.

In order to use a vocab cache as you iterate over text and index tokens, you need to figure out if the tokens should be included in the vocab. The criterion is usually if tokens occur with more than a certain pre-configured frequency in the corpus. Below that frequency, an individual token isn't a vocab word, and it remains just a token.

We track tokens as well. In order to track tokens, do the following:

When you want to add a vocab word, do the following:

Adding the word to the index sets the index. Then you declare it as a vocab word. (Declaring it as a vocab word will pull the word from the index.)

Language Processing

Overview of language processing in DL4J

Although not designed to be comparable to tools such as Stanford CoreNLP or NLTK, deepLearning4J does include some core text processing tools that are described here.

SentenceIterator

// Gets Path to Text file
String filePath = new File(dataLocalPath,"raw_sentences.txt").getAbsolutePath();
// Strip white space before and after for each line
SentenceIterator iter = new BasicLineIterator(filePath);

Tokenizer

Both Tokenizers and SentenceIterators work with Preprocessors to deal with anomalies in messy text like Unicode, and to render such text, say, as lowercase characters uniformly.

 public static void main(String[] args) throws Exception {

        dataLocalPath = DownloaderUtility.NLPDATA.Download();
        // Gets Path to Text file
        String filePath = new File(dataLocalPath,"raw_sentences.txt").getAbsolutePath();

        log.info("Load & Vectorize Sentences....");
        // Strip white space before and after for each line
        SentenceIterator iter = new BasicLineIterator(filePath);
        // Split on white spaces in the line to get words
        TokenizerFactory t = new DefaultTokenizerFactory();

        /*
            CommonPreprocessor will apply the following regex to each token: [\d\.:,"'\(\)\[\]|/?!;]+
            So, effectively all numbers, punctuation symbols and some special symbols are stripped off.
            Additionally it forces lower case for all tokens.
         */
        t.setTokenPreProcessor(new CommonPreprocessor());

Vocab

package org.deeplearning4j.examples.nlp.word2vec;

import org.deeplearning4j.examples.download.DownloaderUtility;
import org.deeplearning4j.models.word2vec.Word2Vec;
import org.deeplearning4j.text.sentenceiterator.BasicLineIterator;
import org.deeplearning4j.text.sentenceiterator.SentenceIterator;
import org.deeplearning4j.text.tokenization.tokenizer.preprocessor.CommonPreprocessor;
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory;
import org.deeplearning4j.text.tokenization.tokenizerfactory.TokenizerFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.File;
import java.util.Collection;

/**
 * Created by agibsonccc on 10/9/14.
 *
 * Neural net that processes text into wordvectors. See below url for an in-depth explanation.
 * https://deeplearning4j.org/word2vec.html
 */
public class Word2VecRawTextExample {

    private static Logger log = LoggerFactory.getLogger(Word2VecRawTextExample.class);

    public static String dataLocalPath;


    public static void main(String[] args) throws Exception {

        dataLocalPath = DownloaderUtility.NLPDATA.Download();
        // Gets Path to Text file
        String filePath = new File(dataLocalPath,"raw_sentences.txt").getAbsolutePath();

        log.info("Load & Vectorize Sentences....");
        // Strip white space before and after for each line
        SentenceIterator iter = new BasicLineIterator(filePath);
        // Split on white spaces in the line to get words
        TokenizerFactory t = new DefaultTokenizerFactory();

        /*
            CommonPreprocessor will apply the following regex to each token: [\d\.:,"'\(\)\[\]|/?!;]+
            So, effectively all numbers, punctuation symbols and some special symbols are stripped off.
            Additionally it forces lower case for all tokens.
         */
        t.setTokenPreProcessor(new CommonPreprocessor());

        log.info("Building model....");
        Word2Vec vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .iterations(1)
                .layerSize(100)
                .seed(42)
                .windowSize(5)
                .iterate(iter)
                .tokenizerFactory(t)
                .build();

        log.info("Fitting Word2Vec model....");
        vec.fit();

        log.info("Writing word vectors to text file....");

        // Prints out the closest 10 words to "day". An example on what to do with these Word Vectors.
        log.info("Closest Words:");
        Collection<String> lst = vec.wordsNearestSum("day", 10);
        log.info("10 Words closest to 'day': {}", lst);
    }
}

Once you obtain word vectors, you can feed them into a deep net for classification, prediction, sentiment analysis and the like.

Language Processing

SentenceIterator

Tokenizer

Vocab

Doc2Vec

Further Reading

Sentence Iterator

Tokenization

Tokenization

What is Tokenization?

Example

Vocabulary Cache

Sentence Iterator

Doc2Vec

Further Reading

Language Processing

SentenceIterator

Tokenizer

Vocab

Tokenization

Tokenization

What is Tokenization?

Example

Vocabulary Cache