Iteration of words, documents, and sentences for language processing in DL4J.
A sentence iterator is used in both Word2vec and Bag of Words.
It feeds bits of text into a neural network in the form of vectors, and also covers the concept of documents in text processing.
In natural-language processing, a document or sentence is typically used to encapsulate a context which an algorithm should learn.
A few examples include analyzing Tweets and full-blown news articles. The purpose of the sentence iterator is to divide text into processable bits. Note the sentence iterator is input agnostic. So bits of text (a document) can come from a file system, the Twitter API or Hadoop.
Depending on how input is processed, the output of a sentence iterator will then be passed to a tokenizer for the processing of individual tokens, which are usually words, but could also be ngrams, skipgrams or other units. The tokenizer is created on a per-sentence basis by a tokenizer factory. The tokenizer factory is what is passed into a text-processing vectorizer.
Some typical examples are below:
This assumes that each line in a file is a sentence.
You can also do list of strings as sentence as follows:
This will assume that each string is a sentence (document). Remember this could be a list of Tweets or articles -- both are applicable.
You can iterate over files as follows:
This will parse the files line by line and return individual sentences on each one.
For anything complex, we recommend any pipeline that can implement more in depth support than space separated tokens.