Iteration of words, documents, and sentences for language processing in DL4J.
A sentence iterator is used in both Word2vec and Bag of Words.
It feeds bits of text into a neural network in the form of vectors, and also covers the concept of documents in text processing.
In natural-language processing, a document or sentence is typically used to encapsulate a context which an algorithm should learn.
A few examples include analyzing Tweets and full-blown news articles. The purpose of the sentence iterator is to divide text into processable bits. Note the sentence iterator is input agnostic. So bits of text (a document) can come from a file system, the Twitter API or Hadoop.
Depending on how input is processed, the output of a sentence iterator will then be passed to a tokenizer for the processing of individual tokens, which are usually words, but could also be ngrams, skipgrams or other units. The tokenizer is created on a per-sentence basis by a tokenizer factory. The tokenizer factory is what is passed into a text-processing vectorizer.
Some typical examples are below:
This assumes that each line in a file is a sentence.
You can also do list of strings as sentence as follows:
This will assume that each string is a sentence (document). Remember this could be a list of Tweets or articles -- both are applicable.
You can iterate over files as follows:
This will parse the files line by line and return individual sentences on each one.
For anything complex, we recommend an actual machine-learning level pipeline, represented by the UimaSentenceIterator.
The UimaSentenceIterator is capable of tokenization, part-of-speech tagging and lemmatization, among other things. The UimaSentenceIterator iterates over a set of files and can segment sentences. You can customize its behavior based on the AnalysisEngine passed into it.
The AnalysisEngine is the UIMA concept of a text-processing pipeline. DeepLearning4j comes with standard analysis engines for all of these common tasks, allowing you to customize which text is being passed in and how you define sentences. The AnalysisEngines are thread-safe versions of the opennlp pipelines. We also include cleartk-based pipelines for handling common tasks.
For those using UIMA or curious about it, this employs the cleartk type system for tokens, sentences, and other annotations within the type system.
Here's how to create a UimaSentenceItrator.
You can also instantiate directly:
For those familiar with Uima, this uses Uimafit extensively to create analysis engines. You can also create custom sentence iterators by extending SentenceIterator.