Tokenization
Breaking text into individual words for language processing in DL4J.
What is Tokenization?
Example
//tokenization with lemmatization,part of speech taggin,sentence segmentation
TokenizerFactory tokenizerFactory = new UimaTokenizerFactory();
Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");
//iterate over the tokens
while(tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
}
//get the whole list of tokens
List<String> tokens = tokenizer.getTokens();Last updated
Was this helpful?