Deeplearning4j
Community ForumND4J JavadocDL4J Javadoc
EN 1.0.0-beta6
EN 1.0.0-beta6
  • Eclipse DeepLearning4J
  • Getting Started
    • Quickstart
    • Tutorials
      • Quickstart with MNIST
      • MultiLayerNetwork And ComputationGraph
      • Logistic Regression
      • Built-in Data Iterators
      • Feed Forward Networks
      • Basic Autoencoder
      • Advanced Autoencoder
      • Convolutional Networks
      • Recurrent Networks
      • Early Stopping
      • Layers and Preprocessors
      • Hyperparameter Optimization
      • Using Multiple GPUs
      • Clinical Time Series LSTM
      • Sea Temperature Convolutional LSTM
      • Sea Temperature Convolutional LSTM 2
      • Instacart Multitask Example
      • Instacart Single Task Example
      • Cloud Detection Example
    • Core Concepts
    • Cheat Sheet
    • Examples Tour
    • Deep Learning Beginners
    • Build from Source
    • Contribute
      • Eclipse Contributors
    • Benchmark Guide
    • About
    • Release Notes
  • Configuration
    • Backends
      • CPU and AVX
      • cuDNN
      • Performance Issues
    • Memory Management
      • Memory Workspaces
    • Snapshots
    • Maven
    • SBT, Gradle, & Others
  • Models
    • Autoencoders
    • Multilayer Network
    • Computation Graph
    • Convolutional Neural Network
    • Recurrent Neural Network
    • Layers
    • Vertices
    • Iterators
    • Listeners
    • Custom Layers
    • Model Persistence
    • Activations
    • Updaters
  • Model Zoo
    • Overview
    • Zoo Models
  • ND4J
    • Overview
    • Quickstart
    • Backend
    • Basics
    • Elementwise Operations
    • Matrix Manipulation
    • Syntax
    • Tensors
  • SAMEDIFF
    • Importing TensorFlow models
    • Variables
    • Ops
    • Adding Ops
  • Tuning & Training
    • Evaluation
    • Visualization
    • Trouble Shooting
    • Early Stopping
    • t-SNE Visualization
    • Transfer Learning
  • DISTRIBUTED DEEP LEARNING
    • Introduction/Getting Started
    • Technical Explanation
    • Spark Guide
    • Spark Data Pipelines Guide
    • API Reference
    • Parameter Server
  • Keras Import
    • Overview
    • Get Started
    • Supported Features
      • Activations
      • Losses
      • Regularizers
      • Initializers
      • Constraints
      • Optimizers
    • Functional Model
    • Sequential Model
    • API Reference
      • Core Layers
      • Convolutional Layers
      • Embedding Layers
      • Local Layers
      • Noise Layers
      • Normalization Layers
      • Pooling Layers
      • Recurrent Layers
      • Wrapper Layers
      • Advanced Activations
  • Arbiter
    • Overview
    • Layer Spaces
    • Parameter Spaces
  • Datavec
    • Overview
    • Records
    • Reductions
    • Schema
    • Serialization
    • Transforms
    • Analysis
    • Readers
    • Conditions
    • Executors
    • Filters
    • Operations
    • Normalization
    • Visualization
  • Language Processing
    • Overview
    • Word2Vec
    • Doc2Vec
    • Sentence Iteration
    • Tokenization
    • Vocabulary Cache
  • Mobile (Android)
    • Setup
    • Tutorial: First Steps
    • Tutorial: Classifier
    • Tutorial: Image Classifier
  • FAQ
  • Press
  • Support
  • Why Deep Learning?
Powered by GitBook
On this page
  • What is Tokenization?
  • Example

Was this helpful?

Edit on Git
Export as PDF
  1. Language Processing

Tokenization

Breaking text into individual words for language processing in DL4J.

PreviousSentence IterationNextVocabulary Cache

Last updated 5 years ago

Was this helpful?

What is Tokenization?

Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. can output text windows that comprise training examples for input into neural nets, as seen here.

Example

Here's an example of tokenization done with DL4J tools:

//tokenization with lemmatization,part of speech taggin,sentence segmentation
TokenizerFactory tokenizerFactory = new UimaTokenizerFactory();
Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");

//iterate over the tokens
while(tokenizer.hasMoreTokens()) {
      String token = tokenizer.nextToken();
}

//get the whole list of tokens
List<String> tokens = tokenizer.getTokens();

The above snippet creates a tokenizer capable of stemming.

In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.

Word2Vec