How to conduct a release to Maven Central
test.heap.size: The heap size used for maven surefire plugin sub processes
test.offheap.size: The off heap size used for maven surefire sub processes. This is very important for
configuration (especially on gpu systems)
In order to run the deeplearning4j tests, many pretrained models and other resources are required. Ensure dl4j test resources as a dependency on your classpath. It is a big repository that needs to be mvn clean installed in order to run the tests properly. You can do this by adding -Ptestresources to your test execution when running the tests from maven.
When running deeplearning4j's tests, there are 2 main profiles to be aware of: nd4j-tests-cpu and nd4j-tests-cuda. These each enable running cpu or gpu tests respectively across the whole code base. Please ensure one of these is selected when running tests.
testresources: Used to add the test resources used for nd4j.
Deeplearning4j uses' junit 5's tags to categorize tests in to different types. All of the tag names used throughout the code base can be found here Nd4j-common-tests is included as a dependency for all tests and has a few reusable utilities used throughout the code base for tests. This makes it a great location to put common utilities we want to use throughout the code base. The tag names are mainly there to categorize tests that can take longer or use more resources so we can avoid running those dynamically depending on the size of the machine we are running tests on.
Note when running gpu tests on a box with more than 1 gpu, it can/will run out of memory if test.heap.size is at not at least 4g. Also of note, is when running tests
How to conduct a release to Maven Central
Deeplearning4j has several steps to a release. Below is a brief outline with follow on descriptions.
Compile libnd4j for different cpu architectures
Ensure the current javacpp dependencies such as python, mkldnn, cuda, .. are up to date
Run all integration tests on core platforms (windows, mac, linux) with both cpu and gpu
Create a staging repository for testing using github actions running manually on each platform
Update the examples to be compatible with the latest release
Run the deeplearning4j-examples as a litmus tests on all platforms (including embedded)
to sanity check platform specific numerical bugs using the staging repository
Double check any user related bugs to see if they should block a release
Hit release button
Perform follow up release of -platform projects under same version
Tag release
Compiling libnd4j on different cpu architectures ensures there is platform optimized math in c++ for each platform. The single code base is a self contained cmake project that can be run on different platforms. In each github actions workflow there are steps for deploying for each platform.
At the core of compiling from source for libnd4j is a maven pom.xml that is run as part of the overall build process that invokes our build script with various parameters that then get passed to our overall cmake structure for compilation. This script exists to formalize some of the required parameters for invokving cmake. Any developer is welcome to invoke cmake directly.
Platform compatibility
We currently compile libnd4j on ubuntu 16.04. This means glibc 2.23.
For our cuda builds, we use gcc7.
Users of older glibc versions may need to compile from source. For our standard release, we try to keep it reasonably old, but do not support end of lifed
end of linux distributions for public builds.
Platform specific helpers
Each build of libnd4j links against an accelerated backend for blas and convolution operations such as onednn, cudnn, or armcompute The implementations for each platform can be found here
This is a step that just ensures that the dl4j release matches the current state of the dependencies provided by javacpp on maven central. This affects every module including python4j, nd4j-native/cuda, datavec-image, among others. The versions of everything can be found in the top level deeplearning4j pom The general convention is library version followed by a - and the version of javacpp that that version uses.
Of note here is that certain older versions of libraries can use older javacpp versions. It is recommended that that the desired version be up to date if possible. Otherwise, if an older version of javacpp is the only version available, this is generally ok.
We run all of the major integration tests on the core major platforms where higher end compute is accessible. This is generally a bigger machine. It is expected that some builds can take up to 2 hours depending on the specs of the desired machine.
This step may also involve invoking tests with specific tags if only running a subset of tests is desired. This can be achived using the surefire plugin -Dgroups flag.
To ensure the examples stay compatible with the current release, we also tag the release version to be the latest version found on maven central. This step may also involve adding or removing examples for new or deprecated features respectivley.
Different supported cuda versions with and without cudnn
Onednn and associated classifiers per platform
Ensure testing happens on the android emulator.
The examples contain a set of tests which just allow us to run maven clean test on a small number of examples. Instead of us picking examples manually, we can just run mvn clean test on any platform we need by just specifying a version of dl4j to depend on and usually a staging repository
Generally, sometimes users will raise issues right before a release that can be critical. It is the sole discretion of the maintainers to ask the user to use snapshots or to wait for a follow on version. For certain fixes, we will publish quick bugfix releases. If your team has specific requirements on a release, please contact us on the community forums
This means after closing a staging repository, hitting the release button initiating a sync of the staging repository with the desired version to maven central. Sync usually takes 2 hours or less.
After a release happens, a version update to the stable version + a github tag needs to happen. This is achived in the desktop app by going to: 1. History 2. Right click on target commit you want to tag 3. Click tag 4. Push the revision 5. Update the version back to snapshot after tag.
Github actions Configuration Overview
Each has 10 parameters for manually invoking builds. The reason this is manual is due to the different ways a release can break. Being manual also allows us to re invoke only the parts of a build we need, rather than the whole release pipeline.
Most workflows implement a matrix structure for handling different combinations of builds related to the following: 1. Platform specific optimizations: On windows/linux/mac we allow cpu + optional linking against mkldnn. Each combination is enumerated and ran as part of a matrix build on github actions.
Cuda, optional cudnn: We also allow optional linking against cudnn for gpu routines.
buildThreads: This is the number of builds threads used for compilation in linbnd4j. This is the equivalent of make -j. For specific platforms that use more memory, 1 is the recommended value. On self hosted setups, you may use more threads to make builds run faster.
deployToReleaseStaging: 0 or 1. If 1, this will create a staging repository on oss sonatype. Otherwise, it will deploy to ossrh snapshots. Snapshots is the default.
releaseVersion: This is the intended release version to be converted to from snapshots. The script is run converting the versions of every module to that specific version intended for release. This is what will get uploaded to a staging repository for release. Otherwise, all intended versions should be SNAPSHOT.
snapshotVersion: The current in development snapshot version
releaseRepoId: If blank, then a new staging repository for a version is created. Otherwise, a staging repository id should be obtained from the ossrh nexus sonatype. This releaseRepoId should be passed to subsequent builds so all of the artifacts associated with a version get propagated to 1 place.
serverId: This should be ossrh 90% of the time. A github profile is also available for use with github actions.
modules: The maven modules to build. This is fairly raw and error prone. The intended usage is with the Typical usage is to skip libnd4j builds with something like:
to skip a libnd4j compile. This can speed builds up significantly.
libnd4jDownload/libnd4jUrl: In tandem with modules, you can specify a libnd4j zip file distribution that was compiled before for download. The builds will download a libnd4j distribution and use that for linking. This can be handy when recompiling the nd4j-native/nd4j-cuda backends for a specific platform without needing to recompile the whole c++ codebase. A url in a matrix build will be sourced from a hard coded file name from - each file name will be updated to point to a zip file distribution appropriate for an individual matrix build. This was done because 1 url is not going to be suitable for individual matrix builds.
runsOn: This is the operating system upon which to run the build. For linux, this defaults to ubuntu-16.04. For windows, windows-2019. self-hosted can also be specified for faster builds.
Many configurations on cpu and cuda require a matrix based build structure to capture the various combinations of optimization and software versions people may want to use. In order to accomodate these workflows, we need to attach variables proxying the values of the manual inputs to the individual matrix workers themselves. These parameters are analogous to the above described parameters. Note we will not repeat the descriptions here, but the values can be seen from their values in the form of $ where SOME_VALUE is one of the values above.
The configuration to look for is as follows:
CUDA: Most cuda builds take 4-5 hours. Both windows and linux on GH actions just download the cuda distribution and compile things on their respective platforms.
CPU builds: From scratch libnd4j + cpu builds typically take 1-2 hours max. Anything more than that, your build may have something wrong.
Out of disk: It is very common for a github actions VM to run out of disk. If a build fails with no logs after and all steps terminated, this maybe one of the reasons.
Out of memory: Sometimes builds run out of memory. A few common causes include:
Clang out of memory on android, depending on the number of builds threads assigned, it is easy for clang to run out of memory
Maven javadoc: The maven javadoc plugin for bigger projects can use a ton of ram and crash a job
Network failures: Maven can sometimes (rarely) fail to download certain dependencies in the middle of a job
MAVEN_GPG_KEY: The maven gpg key secret for a release
CROSS_COMPILER_DIR: For the pi_build.sh script in libnd4j. This contains the root directory
for cross compiler invocation. We need this because all cross compilation for various libnd4j builds happens
on x86. We cross compile for speed reasons also easily allowing us to run on github actions.
Debian frontend: This is to ensure that all debian commands by default don't prompt for yes/no
GITHUB_TOKEN: This is for authentication with github actions
BUILD_USING_MAVEN: This is for pi_build.sh. This toggles (0 or 1) whether to use maven or buildnativeoperation.sh
in the libnd4j root directory directly.
NDK_VERSION: Default is r21d. Libnd4j's android is compiled with the android r21 currently.
CURRENT_TARGET: This variable is for pi_build.sh. It tells pi_build.sh which architecture to build for.
PUBLISH_TO: The repo to publish to for releases or snapshots. Valid values are github or ossrh.
These are repositories defined in the deeplearning4j root pom.
OPENBLAS_PATH: We compile libnd4j against openblas for several different cpus. Openblas is manually downloaded and linked against.
This specifies the path to the download for the libnd4j cmake invocation.
MAVEN_USERNAME: The user name to login to for the ossrh maven repository
MAVEN_PASSWORD: The password to login to for the ossrh maven repository
MAVEN_GPG_PASSPHRSE: The gpg password for signing artifacts for uploading to maven central
DEPLOY_TO> Valid values are either ossrh or github.
LIBND4J_BUILD_THREADS: This is the equivalent of make -j. It specifies the number of threads
that should be used to compile libnd4j
PERFORM_RELEASE: Whether to perform a release or not (0 or 1)
RELEASE_VERSION: The version to be released to maven central. change-versions.sh will be run
to change versions throughout the code base from the snapshot verison to the intended release version.
SNAPSHOT_VERSION: The current snapshot version to be changed when performing a release.
After a release is conducted, this should generally be the next development version.
RELEASE_REPO_ID: Leave this empty when first creating a release repository in combination with
DEPLOY set to 1. Afterwards, note which staging repository id gets created in the ossrh interface when publishing
to maven central. Use that id for further buidls to ensure that all uploads for 1 version are synchronized to 1 staging repository.
MODULES: Extra maven flags for pi_build.sh if more flags are needed (such as for debugging or only building specific modules)
LIBND4J_URL: Used when building nd4j-native. If a user does not want to recompile libnd4j for their particular build, you can instead
skip this step and specify a libnd4j zip file download (generally built with the maven assembly plugin)
DL4J and Javacpp
DL4J heavily depends on javacpp for its interop between java and platform optimized c++ libraries. However, due to our usage of JNI this comes with certain complexities in the build anyone should be aware of.
The following modules rely on javacpp as part of their build process: 1. nd4j-native 2. nd4j-native-presets 3. nd4j-cuda 4. nd4j-cuda-presets
Each of these libraries are what comprise our nd4j backends. Leveraging [libnd4j], javacpp handles linking each nd4j-backend against the libnd4j c++ codebase. This linking is done using a libnd4j home. This will contain all of the include files and necessary binary files for specific platforms. By default, nd4j backends and the libnd4j code base are compiled within the same build step. This is the recommended default, but for specific circumstances. A libnd4j release is also uploaded to maven central as a zip file and can be used in place of libnd4j compilation. See our Github actions overview libnd4jUrl parameter for more information on this.
Each backend consists of 2 modules
The codebase: This represents the actual nd4j backend logic for specific platforms. Conceptually, this logic will be anything that a developer should need to control such as memory management, environment variables, or other execution logic.
The presets: This is a similar concept in spirit to the official javacpp presets In order to avoid a race condition between the backend and the presets compilation, this is a separate dependency that just exists to handle interop between the libnd4j code base and the java frontend. The above backend then contains the rest of the logic needed for execution of the math operations on specific platforms.
After a libnd4j build is executed for a specific platform, we need to leverage javacpp to actually link against libnd4j to create a complete libnd4j backend. When invoking a maven build, the javacpp maven plugin is used to actually invoke a build. The presets will be compiled first. Generally the presets are just 1 or 2 classes containing a description of how to map the actual nd4j code base to the libnd4j codebase.
Next, the actual backend is compiled with a dependency on the above presets code base. The javacpp plugin will leverage the description from the presets we specify as a dependency and facilitate linking against a LIBND4J_HOME (a folder which contains the platform specific libnd4j binaries and include sources) specified by the user. In the actual plugin declaration on the backend pom.xml we include the target presets class to use for our particular backend.
Note: This still requires the native platform specific tools to be installed since binaries are generated for each platform. Please see our github actions for instructions on specific platforms.
Nd4j reuses javacpp's notion of a -platform library. This is a curated set of dependencies most users will use as part of a build. Each backend will have an associated -platform artifact so users don't have to deal with maven classifiers. See docs from javacpp for how to leverage this artifact.
Caution to users: By default, this means that a large number of dependencies for all platforms will be included. If you do not need dependencies for all platforms, then please read the above documentation to figure out how to build a jar for your specific platform.
Generally, the main thing to know is when you build your application, use:
A comprehensive list of classifiers can be found here Note that each library we link against such as openblas will also have a similar set of classifiers.
Throughout the dl4j pom.xml files, platform specific profiles that setup dependencies exist. An example can be found here. This helps us dynamically figure out which platform someone is building for.
A testing setup the team uses for testing android involves lineageos, termux, and some arm32 based open jdk debian files that can be found here
In order to bootstrap this environment, a from scratch install of the latest lineageos flashed on an sd card using the raspberry pi is suggested.
Afterwards, install
In order to properly setup the test environment,
you need to execute your test from the command line as follows:
A proper execution environment after the above jdk is installed involves manually setting the environment as follows:
This will setup the jdk + maven to ignore ssl errors due to issues with cacerts + termux. This is largely irrelevant for our small testing use case, but not recommended for production environments.
Redist artifacts are easy ways of distributing dependencies without installation.
Note that for the presets that are part of nd4j (nd4j-cuda-presets and nd4j-native-presets) only the latest versions support redist artifacts. The presets preload versions only support pre loading (eg: linking against libraries from the javacpp cache) against the latest version. This is because during pre loading, certain version numbers are checked for.