Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Doc2Vec and arbitrary documents for language processing in DL4J.
The main purpose of Doc2Vec is associating arbitrary documents with labels, so labels are required. Doc2vec is an extension of word2vec that learns to correlate labels and words, rather than words with other words. Deeplearning4j's implentation is intended to serve the Java, Scala and Clojure communities.
The first step is coming up with a vector that represents the "meaning" of a document, which can then be used as input to a supervised machine learning algorithm to associate documents with labels.
In the ParagraphVectors builder pattern, the labels()
method points to the labels to train on. In the example below, you can see labels related to sentiment analysis:
Here's a full working example of classification with paragraph vectors:
Iteration of words, documents, and sentences for language processing in DL4J.
A sentence iterator is used in both Word2vec and Bag of Words.
It feeds bits of text into a neural network in the form of vectors, and also covers the concept of documents in text processing.
In natural-language processing, a document or sentence is typically used to encapsulate a context which an algorithm should learn.
A few examples include analyzing Tweets and full-blown news articles. The purpose of the sentence iterator is to divide text into processable bits. Note the sentence iterator is input agnostic. So bits of text (a document) can come from a file system, the Twitter API or Hadoop.
Depending on how input is processed, the output of a sentence iterator will then be passed to a tokenizer for the processing of individual tokens, which are usually words, but could also be ngrams, skipgrams or other units. The tokenizer is created on a per-sentence basis by a tokenizer factory. The tokenizer factory is what is passed into a text-processing vectorizer.
Some typical examples are below:
This assumes that each line in a file is a sentence.
You can also do list of strings as sentence as follows:
This will assume that each string is a sentence (document). Remember this could be a list of Tweets or articles -- both are applicable.
You can iterate over files as follows:
This will parse the files line by line and return individual sentences on each one.
For anything complex, we recommend any pipeline that can implement more in depth support than space separated tokens.
Breaking text into individual words for language processing in DL4J.
Notes to write on: 1. Tokenizer factory interface 2. Tokenizer interface 2. How to write your own factory and tokenizer
Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. Word2Vec can output text windows that comprise training examples for input into neural nets, as seen here.
Here's an example of tokenization done with DL4J tools:
The above snippet creates a tokenizer capable of stemming.
In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.
Overview of language processing in DL4J
Although not designed to be comparable to tools such as Stanford CoreNLP or NLTK, deepLearning4J does include some core text processing tools that are described here.
Deeplearning4j's NLP support contains interfaces for different NLP libraries. A user wraps third party libraries via our interfaces. Deeplearning4j as of M1, does not support any 3rd party libraries directly. This is due to the lack of maintenance and custom work needed to make this work well for users. Instead, we expose interfaces to allow users to implement their own tokenizers.
There are several steps involved in processing natural language. The first is to iterate over your corpus to create a list of documents, which can be as short as a tweet, or as long as a newspaper article. This is performed by a SentenceIterator, which will appear like this:
The SentenceIterator encapsulates a corpus or text, organizing it, say, as one Tweet per line. It is responsible for feeding text piece by piece into your natural language processor. The SentenceIterator is not analogous to a similarly named class, the DatasetIterator, which creates a dataset for training a neural net. Instead it creates a collection of strings by segmenting a corpus.
A Tokenizer further segments the text at the level of single words, also alternatively as n-grams. ClearTK contains the underlying tokenizers, such as parts of speech (PoS) and parse trees, which allow for both dependency and constituency parsing, like that employed by a recursive neural tensor network (RNTN).
A Tokenizer is created and wrapped by a . The default tokens are words separated by spaces. The tokenization process also involves some machine learning to differentiate between ambibuous symbols like . which end sentences and also abbreviate words such as Mr. and vs.
Both Tokenizers and SentenceIterators work with Preprocessors to deal with anomalies in messy text like Unicode, and to render such text, say, as lowercase characters uniformly.
Each document has to be tokenized to create a vocab, the set of words that matter for that document or corpus. Those words are stored in the vocab cache, which contains statistics about a subset of words counted in the document, the words that "matter". The line separating significant and insignifant words is mobile, but the basic idea of distinguishing between the two groups is that words occurring only once (or less than, say, five times) are hard to learn and their presence represents unhelpful noise.
The vocab cache stores metadata for methods such as Word2vec and Bag of Words, which treat words in radically different ways. Word2vec creates representations of words, or neural word embeddings, in the form of vectors that are hundreds of coefficients long. Those coefficients help neural nets predict the likelihood of a word appearing in any given context; for example, after another word. Here's Word2vec, configured:
Once you obtain word vectors, you can feed them into a deep net for classification, prediction, sentiment analysis and the like.
Quickstart for Java using Maven
This is everything you need to run DL4J examples and begin your own projects.
We recommend that you join our . There you can request help and give feedback, but please do use this guide before asking questions we've answered below. If you are new to deep learning, we've included with links to courses, readings and other resources.
We are currently reworking the Getting Started Guide.
If you find that you have trouble following along here, take a look at the Konduit blog, as it features .
Deeplearning4j is a domain-specific language to configure deep neural networks, which are made of multiple layers. Everything starts with a MultiLayerConfiguration
, which organizes those layers and their hyperparameters.
Hyperparameters are variables that determine how a neural network learns. They include how many times to update the weights of the model, how to initialize those weights, which activation function to attach to the nodes, which optimization algorithm to use, and how fast the model should learn. This is what one configuration would look like:
With Deeplearning4j, you add a layer by calling layer
on the NeuralNetConfiguration.Builder()
, specifying its place in the order of layers (the zero-indexed layer below is the input layer), the number of input and output nodes, nIn
and nOut
, as well as the type: DenseLayer
.
Once you've configured your net, you train the model with model.fit
.
You should have these installed to use this QuickStart guide. DL4J targets professional Java developers who are familiar with production deployments, IDEs and automated build tools. Working with DL4J will be easiest if you already have experience with these.
Please make sure you have a 64-Bit version of java installed, as you will see an error telling you no jnind4j in java.library.path
if you decide to try to use a 32-Bit version instead. Make sure the JAVA_HOME environment variable is set.
If you are working on a Mac, you can simply enter the following into the command line:
The latest version of Mac's Mojave OS breaks git, producing the following error message:
xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
This can be fixed by running:
Use the command line to enter the following:
Open IntelliJ and choose Import Project. Then select the main 'dl4j-examples' directory. (Note: the example in the illustration below refers to an outdated repository named dl4j-0.4-examples. However, the repository that you will download and install will be called dl4j-examples).![select directory](../../.gitbook/assets/install_intj_1%20(2).png)
Choose 'Import project from external model' and ensure that Maven is selected.
![select directory](../../.gitbook/assets/install_intj_2%20(2).png)
Continue through the wizard's options. Select the SDK that begins with jdk
. (You may need to click on a plus sign to see your options...) Then click Finish. Wait a moment for IntelliJ to download all the dependencies. You'll see the horizontal bar working on the lower right.
Pick an example from the file tree on the left. Right-click the file to run.
![run IntelliJ example](../../.gitbook/assets/install_intj_3%20(3).png)
To run DL4J in your own projects, we highly recommend using Maven for Java users, or a tool such as SBT for Scala. The basic set of dependencies and their versions are shown below. This includes:
deeplearning4j-core
, which contains the neural network implementations
nd4j-native-platform
, the CPU version of the ND4J library that powers DL4J
datavec-api
- Datavec is our library vectorizing and loading data
To run the example, right click on it and select the green button in the drop-down menu. You will see, in IntelliJ's bottom window, a series of scores. The rightmost number is the error score for the network's classifications. If your network is learning, then that number will decrease over time with each batch it processes. At the end, this window will tell you how accurate your neural-network model has become:
![](../../.gitbook/assets/mlp_classifier_results%20(4).png)
In another window, a graph will appear, showing you how the multilayer perceptron (MLP) has classified the data in the example. It will look like this:
Congratulations! You just trained your first neural network with Deeplearning4j.
Q: I'm using a 64-Bit Java on Windows and still get the no jnind4j in java.library.path
error
A: You may have incompatible DLLs on your PATH. To tell DL4J to ignore those, you have to add the following as a VM parameter (Run -> Edit Configurations -> VM Options in IntelliJ):
Q: SPARK ISSUES I am running the examples and having issues with the Spark based examples such as distributed training or datavec transform options.
Windows users might be seeing something like:
Now that you've learned how to run the different examples, we've made a template available for you that has a basic MNIST trainer with simple evaluation code.
To use the template:
Copy the standalone-sample-project
from the examples and give it the name of your project.
Import the folder into IntelliJ.
Start coding!
Mechanism for handling general NLP tasks in DL4J.
The vocabulary cache, or vocab cache, is a mechanism for handling general-purpose natural-language tasks in Deeplearning4j, including normal TF-IDF, word vectors and certain information-retrieval techniques. The goal of the vocab cache is to be a one-stop shop for text vectorization, encapsulating techniques common to bag of words and word vectors, among others.
Vocab cache handles storage of tokens, word-count frequencies, inverse-document frequencies and document occurrences via an inverted index. The InMemoryLookupCache is the reference implementation.
In order to use a vocab cache as you iterate over text and index tokens, you need to figure out if the tokens should be included in the vocab. The criterion is usually if tokens occur with more than a certain pre-configured frequency in the corpus. Below that frequency, an individual token isn't a vocab word, and it remains just a token.
We track tokens as well. In order to track tokens, do the following:
When you want to add a vocab word, do the following:
Adding the word to the index sets the index. Then you declare it as a vocab word. (Declaring it as a vocab word will pull the word from the index.)
1.7 or later (Only 64-Bit versions supported)
(automated build and dependency manager)
or Eclipse
If you are new to Java or unfamiliar with these tools, read the details below for help with installation and setup. Otherwise, .
If you don't have Java 1.7 or later, download the current . To check if you have a compatible version of Java installed, use the following command:
Maven is a dependency management and automated build tool for Java projects. It works well with IDEs such as IntelliJ and lets you install DL4J project libraries easily. to the latest release following for your system. To check if you have the most recent version of Maven installed, enter the following:
Maven is widely used among Java developers and it's pretty much mandatory for working with DL4J. If you come from a different background, and Maven is new to you, check out and our , which includes some additional troubleshooting tips. such as Ivy and Gradle can also work, but we support Maven best.
An Integrated Development Environment () allows you to work with our API and configure neural networks in a few steps. We strongly recommend using , which communicates with Maven to handle dependencies. The is free.
There are other popular IDEs such as and . However, IntelliJ is preferred, and using it will make finding help on the easier if you need it.
Install the . If you already have Git, you can update to the latest version using Git itself:
Every Maven project has a POM file. Here is when you run your examples.
Within IntelliJ, you will need to choose the first Deeplearning4j example you're going to run. We suggest MLPClassifierLinear
, as you will almost immediately see the network classify two groups of data in our UI. The file on .
Join our community forums on .
Read the .
Check out the more detailed .
Python folks: If you plan to run benchmarks on Deeplearning4j comparing it to well-known Python framework [x], please read on how to optimize heap space, garbage collection and ETL on the JVM. By following them, you will see at least a 10x speedup in training time.
A: You may be missing some dependencies that Spark requires. See this for a discussion of potential dependency issues. Windows users may need the winutils.exe from Hadoop.
Download winutils.exe from and put it into the null/bin/winutils.exe (or create a hadoop folder and add that to HADOOP_HOME)
If that is the issue, see . In this case replace with "Nd4jCpu".
The Quickstart template is available at .