Deeplearning4j
Community ForumND4J JavadocDL4J Javadoc
EN 1.0.0-M2
EN 1.0.0-M2
  • Deeplearning4j Suite Overview
  • Release Notes
    • 1.0.0-M2
    • 1.0.0-M1.1
    • 1.0.0-M1
    • 1.0.0-beta7
    • 1.0.0-beta6
    • 1.0.0-beta5
    • 1.0.0-beta4
    • 1.0.0-beta3
    • 1.0.0-beta2
    • 1.0.0-beta
    • 1.0.0-alpha
    • 0.9.1
    • 0.9.0
    • 0.8.0
    • 0.7.2
    • 0.7.1
    • 0.7.0
    • 0.6.0
    • 0.5.0
    • 0.4.0
    • 1.00-M2.2
  • Multi-Project
    • Tutorials
      • Beginners
      • Quickstart
    • How To Guides
      • Import in to your favorite IDE
      • Contribute
        • Eclipse Contributors
      • Developer Docs
        • Github Actions/Build Infra
        • Javacpp
        • Release
        • Testing
      • Build From Source
      • Benchmark
      • Beginners
    • Reference
      • Examples Tour
    • Explanation
      • The core workflow
      • Configuration
        • Backends
          • Performance Issues
          • CPU
          • Cudnn
        • Memory
          • Workspaces
      • Build Tools
      • Snapshots
      • Maven
  • Deeplearning4j
    • Tutorials
      • Quick Start
      • Language Processing
        • Doc2Vec
        • Sentence Iterator
        • Tokenization
        • Vocabulary Cache
    • How To Guides
      • Custom Layers
      • Keras Import
        • Functional Models
        • Sequential Models
        • Custom Layers
        • Keras Import API Overview
          • Advanced Activations
          • Convolutional Layers
          • Core Layers
          • Embedding Layers
          • Local Layers
          • Noise Layers
          • Normalization Layers
          • Pooling Layers
          • Recurrent Layers
          • Wrapper Layers
        • Supported Features Overview
          • Activations
          • Constraints
          • Initializers
          • Losses
          • Optimizers
          • Regularizers
      • Tuning and Training
        • Visualization
        • Troubleshooting Training
        • Early Stopping
        • Evaluation
        • Transfer Learning
    • Reference
      • Model Zoo
        • Zoo Models
      • Activations
      • Auto Encoders
      • Computation Graph
      • Convolutional Layers
      • DataSet Iterators
      • Layers
      • Model Listeners
      • Saving and Loading Models
      • Multi Layer Network
      • Recurrent Layers
      • Updaters/Optimizers
      • Vertices
      • Word2vec/Glove/Doc2Vec
    • Explanation
  • datavec
    • Tutorials
      • Overview
    • How To Guides
    • Reference
      • Analysis
      • Conditions
      • Executors
      • Filters
      • Normalization
      • Operations
      • Transforms
      • Readers
      • Records
      • Reductions
      • Schemas
      • Serialization
      • Visualization
    • Explanation
  • Nd4j
    • Tutorials
      • Quickstart
    • How To Guides
      • Other Framework Interop
        • Tensorflow
        • TVM
        • Onnx
      • Matrix Manipulation
      • Element wise Operations
      • Basics
    • Reference
      • Op Descriptor Format
      • Tensor
      • Syntax
    • Explanation
  • Samediff
    • Tutorials
      • Quickstart
    • How To Guides
      • Importing Tensorflow
      • Adding Operations
        • codegen
    • Reference
      • Operation Namespaces
        • Base Operations
        • Bitwise
        • CNN
        • Image
        • LinAlg
        • Loss
        • Math
        • NN
        • Random
        • RNN
      • Variables
    • Explanation
      • Model Import Framework
  • Libnd4j
    • How To Guides
      • Building on Windows
      • Building for raspberry pi or Jetson Nano
      • Building on ios
      • How to Add Operations
      • How to Setup CLion
    • Reference
      • Understanding graph execution
      • Overview of working with libnd4j
      • Helpers Overview (CUDNN, OneDNN,Armcompute)
    • Explanation
  • Python4j
    • Tutorials
      • Quickstart
    • How To Guides
      • Write Python Script
    • Reference
      • Python Types
      • Python Path
      • Garbage Collection
      • Python Script Execution
    • Explanation
  • Spark
    • Tutorials
      • DL4J on Spark Quickstart
    • How To Guides
      • How To
      • Data How To
    • Reference
      • Parameter Server
      • Technical Reference
    • Explanation
      • Spark API Reference
  • codegen
Powered by GitBook
On this page

Was this helpful?

Export as PDF
  1. Deeplearning4j
  2. Tutorials
  3. Language Processing

Vocabulary Cache

Mechanism for handling general NLP tasks in DL4J.

The vocabulary cache, or vocab cache, is a mechanism for handling general-purpose natural-language tasks in Deeplearning4j, including normal TF-IDF, word vectors and certain information-retrieval techniques. The goal of the vocab cache is to be a one-stop shop for text vectorization, encapsulating techniques common to bag of words and word vectors, among others.

Vocab cache handles storage of tokens, word-count frequencies, inverse-document frequencies and document occurrences via an inverted index. The InMemoryLookupCache is the reference implementation.

In order to use a vocab cache as you iterate over text and index tokens, you need to figure out if the tokens should be included in the vocab. The criterion is usually if tokens occur with more than a certain pre-configured frequency in the corpus. Below that frequency, an individual token isn't a vocab word, and it remains just a token.

We track tokens as well. In order to track tokens, do the following:

addToken(new VocabWord(1.0,"myword"));

When you want to add a vocab word, do the following:

addWordToIndex(0, Word2Vec.UNK);
putVocabWord(Word2Vec.UNK);

Adding the word to the index sets the index. Then you declare it as a vocab word. (Declaring it as a vocab word will pull the word from the index.)

PreviousTokenizationNextHow To Guides

Last updated 3 years ago

Was this helpful?