Breaking text into individual words for language processing in DL4J.
What is Tokenization?
Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. Word2Vec can output text windows that comprise training examples for input into neural nets, as seen here.
Here's an example of tokenization done with DL4J tools:
//tokenization with lemmatization,part of speech taggin,sentence segmentation