From Python To Android: HF Sentence Transformers (Embeddings)

Bring all-MiniLM-L2-V6 sentence embeddings to Android with ONNX and Rust

Shubham Panchal
ProAndroidDev
10 min readJun 30, 2024

--

Photo by John Tuesday on Unsplash

A while ago, I developed an Android app, Android-Doc-QA which is an instance of on-device RAG for PDF/DOCX documents. It used ObjectBox as a vector database, Gemini Cloud API as a LLM and Mediapipe’s Text Embedder as an embedding provider.

For RAG applications to perform well, having a good embedding model that performs a text to vector/embedding transformation is crucial. The embeddings must capture the semantics of the text as they are used to find the most similar chunks (text subsequences from the documents given) from the database through a vector search. Chunks that are most similar to the query serve as context to the LLM and go in the prompt directly.

From my experience in testing the app, Mediapipe’s Text Embedder that uses Google’s Universal Sentence Encoder was not good enough to understand the semantics of the text/chunk or the query and felt more like a text search matching word occurrences rather than understanding the true meaning of the sentences. Moreover, the size of the embeddings was 100 that felt less when compared to models from HuggingFace sentence-transformers.

The all-MiniLM-L6-V2 model from sentence-transformers seemed powerful from the RAG applications that I built in Python. It had an embedding size of 384, which can theoretically capture more information than an embedding of size 100. I wanted to use this model in my Android, but I couldn’t find any clear implementation doing so.

In this blog, I’ll share my thoughts on how I solved the problem on bringing the all-MiniLM-L6-V2 to an Android app and build an Android library and demo app for the same,

An app demonstrating the power of sentence-embeddings in Android. Test device: Samsung M13, armeabi-v7a (32-bit) with 4GB RAM

Here’s the GitHub repository for the library and demo app:

Approach

In order to execute any ML model on Android, it is good to have the model converted to the ONNX or TensorFlow Lite format. For ONNX, Microsoft provides onnxruntime the TensorFlow team provides an Android library to execute TFLite models. Both of these libraries are well-documented and have numerous examples for models operating on different modalities.

Solving Model Execution — ONNX

In our case, the sentence-transformers team has already provided an ONNX version of theall-MiniLM-L6-V2 model in their HuggingFace repository. With onnxruntime ‘s Python package, we can load the ONNX model and check its input/output shapes, data-types and tensor-names,

import onnxruntime as ort

# Download ONNX model from
# https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/main/onnx/model.onnx
session = ort.InferenceSession("model.onnx")
print( [ x.shape for x in session.get_inputs() ] )
print( [ x.name for x in session.get_inputs() ] )
print( [ x.type for x in session.get_inputs() ] )
print( [ x.shape for x in session.get_outputs() ] )
print( [ x.name for x in session.get_outputs() ] )
print( [ x.type for x in session.get_outputs() ] )
[['batch_size', 'sequence_length'], ['batch_size', 'sequence_length']]
['input_ids', 'attention_mask']
['tensor(int64)', 'tensor(int64)']
[['batch_size', 'sequence_length', 384], ['batch_size', 384]]
['token_embeddings', 'sentence_embedding']
['tensor(float)', 'tensor(float)']

The onnxruntime library similar to the one above, is also built for Android apps and hosted on Maven Central. In Android, we can load this model, provide the input_ids , attention_mask and get the sentence_embedding as an output. Hence, we have solved the core model execution problem with onnxruntime as its solution.

The model does not take a string or sequence of characters as input directly. Rather, it accepts input_ids which is a sequence of integers representing the tokens derived from the text corpus during training. Each transformer model on HuggingFace comes with a tokenizer.json file that defines the type of the tokenizer, token to id map and other text-processing parameters. Replicating and building a tokenizer that can perform operations following tokenizer.json outside of Python (in Android) was a challenge that had to solved to get input_ids and attention_mask required for the executing the ONNX model.

Solving Tokenization — Rust and JNI

The source code of HuggingFace Tokenizers is written in Rust with Python bindings available. Moreover, we can use tokenizers as a crate in our own Rust project and read tokenizer.json to create an appropriate tokenizer for the model in consideration. Here’s a simple example that loads tokenizer.json from theall-MiniLM-L6-v2 repository and encodes a sentence,

use tokenizers::tokenizer::Tokenizer;

fn main() {
let sent_bert_tokenizer = Tokenizer::from_file("hf-tokenizer/tokenizer.json").unwrap();
let encoding = sent_bert_tokenizer.encode("what is the population of london", true).unwrap();
let ids: Vec<i64> = encoding.get_ids().iter().map(|id| *id as i64).collect();
let attention_mask: Vec<i64> = encoding
.get_attention_mask()
.iter()
.map(|x| *x as i64)
.collect();

println!( "{:?}" , ids ) ;
println!( "{:?}" , attention_mask ) ;
}
[101, 2054, 2003, 1996, 2313, 1997, 2414, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Rust supports compilation to Android-based targets and can build a dynamic/static C library also. Similar to C/C++, we can make use of JNI (Java Native Interface) in Rust and expose functions that interact with the Tokenizer in Rust. Java/Kotlin classes can declare these functions as native / external and search for their implementations in dynamic libraries (shared objects, .so ) we’ll build from the Rust source code.

This Rust-Java interface will help us access the Rust-based Tokenizer from Android and we will be able to tokenize a String and obtain input_ids and attention_mask for the ONNX model.

JNI and Rust Library

To start building the Rust source code, we add tokenizers , jniand serde_json as dependencies in our crate and modify the crate_type to cdylib (C dynamic library),

[dependencies]
tokenizers = "0.19.1"
jni = { version = "0.21.1" }
serde = { version = "1.0.192" , features = ["derive"] }
serde_json = "1.0.108"
onig = "6.4.0"

[lib]
crate-type = ["cdylib"]
name = "hftokenizer"

Next in src/lib.rs , with JNI we define functions that create and destroy the tokenizer instance and which use the instance to tokenize the given text. By instance, we mean a pointer/address that references the Tokenizer object in memory. The pointer will be represented with a Box in Rust and as a Long in Kotlin. The first function, createTokenizer , takes the file-bytes of tokenizer.json and returns the instance pointer,

use jni::objects::{JByteArray, JClass, JString, ReleaseMode};
use jni::sys::{jbyteArray, jlong};
use jni::JNIEnv;
use serde::Serialize;
use serde_json;
use tokenizers::Tokenizer;

#[no_mangle]
pub extern "C" fn Java_com_ml_shubham0204_sentence_1embeddings_HFTokenizer_createTokenizer<'a>(
mut env: JNIEnv<'a>,
_: JClass<'a>,
tokenizer_bytes: JByteArray<'a>,
) -> jlong {
let tokenizer_bytes_rs: Vec<u8> = env
.get_array_elements(&tokenizer_bytes, ReleaseMode::CopyBack)
.expect("Could not read tokenizer_bytes")
.into();
Box::into_raw(Box::new(Tokenizer::from_bytes(&tokenizer_bytes_rs))) as jlong
}

Next, the tokenize function accepts an instance pointer and a text, returning a JSON string contains the ids and attention_mask ,

#[no_mangle]
pub extern "C" fn Java_com_ml_shubham0204_sentence_1embeddings_HFTokenizer_tokenize<'a>(
mut env: JNIEnv<'a>,
_: JClass<'a>,
tokenizer_ptr: jlong,
text: JString<'a>,
) -> JString<'a> {
let tokenizer = unsafe { &mut *(tokenizer_ptr as *mut Tokenizer) };
let text: String = env
.get_string(&text)
.expect("Could not convert text to Rust String")
.into();
let encoding = tokenizer.encode(text, true).expect("Could not encode text");
let result = TokenizationResult {
ids: encoding.get_ids().to_vec(),
attention_mask: encoding.get_attention_mask().to_vec(),
};
let result_json_str =
serde_json::to_string(&result).expect("Could not convert tokenization result to JSON");
env.new_string(result_json_str)
.expect("Could not convert result_json_str to jstring")
}

Last, we define a deleteTokenizer function which allows Rust to deallocate the instance of the Tokenizer ,

#[no_mangle]
pub extern "C" fn Java_com_ml_shubham0204_sentence_1embeddings_HFTokenizer_deleteTokenizer(
_: JNIEnv,
_: JClass,
tokenizer_ptr: jlong
) {
let _ptr = unsafe { Box::from_raw(tokenizer_ptr as *mut Tokenizer) };
// _ptr is not referred further, hence it will be deallocated
// as its reference-count becomes zero
}

To compile a dynamic library for Android architectures, we add the necessary toolchains with rustup ,

$> rustup target add aarch64-linux-android 
$> rustup target add armv7-linux-androideabi
$> rustup target add i686-linux-android
$> rustup target add x86_64-linux-android

We also need to add paths for clang and clang++ compilers from Android NDK in .cargo/config.toml . Crates like onig require these compilers to build C/C++ code on which they depend. Paths to the linker for each architecture also have to be included.

[target.aarch64-linux-android]
linker = "~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android33-clang"

[target.armv7-linux-androideabi]
linker = "~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/armv7a-linux-androideabi33-clang"

[target.i686-linux-android]
linker = "~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android33-clang"

[target.x86_64-linux-android]
linker = "~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/x86_64-linux-android33-clang"

[env]
AR_aarch64-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar"
CC_aarch64-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android33-clang"
CXX_aarch64-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android33-clang++"

AR_armv7-linux-androideabi="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar"
CC_armv7-linux-androideabi="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/armv7a-linux-androideabi33-clang"
CXX_armv7-linux-androideabi="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/armv7a-linux-androideabi33-clang++"

AR_i686-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar"
CC_i686-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android33-clang"
CXX_i686-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android33-clang++"

AR_x86_64-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-ar"
CC_x86_64-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/x86_64-linux-android33-clang"
CXX_x86_64-linux-android="~/android-ndk-r26d/toolchains/llvm/prebuilt/linux-x86_64/bin/x86_64-linux-android33-clang++"

We have completed our build setup for Android now. We execute cargo run now, passing --release and --target flags for each architecture,

$> cargo build --release --target armv7-linux-androideabi
$> cargo build --release --target aarch64-linux-android
$> cargo build --release --target i686-linux-android
$> cargo build --release --target x86_64-linux-android

The .so dynamic libraries can be found in the path target/<architecture>/release/libhftokenizer.so .

You can view the Rust source code here,

Android Library

Once we’ve built the dynamic libraries from our Rust code, we add them to our library module in the following paths: sentence_embeddings/src/main/jniLibs/<architecture>/libhftokenizer.so where <architecture> will be x86 , x86_64 , armeabi-v7a and arm64-v8a .

Next, we create a Kotlin class HFTokenizer that declares the JNI functions and holds an instancePtr , an address/pointer that references the Tokenizer instance created by library and returned by create_tokenizer ,

import org.json.JSONObject

class HFTokenizer(tokenizerBytes: ByteArray) {

data class Result(
val ids: LongArray = longArrayOf(),
val attentionMask: LongArray = longArrayOf()
)

private val tokenizerPtr: Long = createTokenizer(tokenizerBytes)

fun tokenize(
text: String
): Result {
val output = tokenize(tokenizerPtr, text)
// Deserialize the string
// and read `ids` and `attention_mask` as LongArray
val jsonObject = JSONObject(output)
val idsArray = jsonObject.getJSONArray("ids")
val ids = LongArray(idsArray.length())
for (i in 0 until idsArray.length()) {
ids[i] = (idsArray.get(i) as Int).toLong()
}
val attentionMaskArray = jsonObject.getJSONArray("attention_mask")
val attentionMask = LongArray(attentionMaskArray.length())
for (i in 0 until attentionMaskArray.length()) {
attentionMask[i] = (attentionMaskArray.get(i) as Int).toLong()
}
return Result(ids, attentionMask)
}

fun close() {
deleteTokenizer(tokenizerPtr)
}

// Given the bytes of the file `tokenizer.json`,
// return a pointer
private external fun createTokenizer(
tokenizerBytes: ByteArray
): Long

// Given the pointer to `Tokenizer` and the text,
// return `ids` and `attention_mask` in JSON format
private external fun tokenize(
tokenizerPtr: Long,
text: String
): String

// Pass the `tokenizerPtr` which is then deallocated
// by the library
private external fun deleteTokenizer(
tokenizerPtr: Long
)

companion object {
init {
System.loadLibrary("hftokenizer")
}
}
}

The externaltokenize method will return a String that contains ids and attention_mask encoded in JSON. We deserialize the string, reading arrays with keys ids and attention_mask as LongArray and pack them in a data class Result.

Next, we create the SentenceEmbedding class that uses onnxruntime to execute the model and use HFTokenizer to tokenize the given text,

package com.ml.shubham0204.sentence_embeddings

import ai.onnxruntime.OnnxTensor
import ai.onnxruntime.OrtEnvironment
import ai.onnxruntime.OrtSession
import ai.onnxruntime.providers.NNAPIFlags
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.withContext
import java.nio.LongBuffer
import java.util.EnumSet

class SentenceEmbedding {

private lateinit var hfTokenizer: HFTokenizer
private lateinit var ortEnvironment: OrtEnvironment
private lateinit var ortSession: OrtSession

suspend fun init(
modelBytes: ByteArray,
tokenizerBytes: ByteArray,
useFP16: Boolean = false,
useXNNPack: Boolean = false
) = withContext(Dispatchers.IO) {
// Initialize HFTokenizer and OrtSession
hfTokenizer = HFTokenizer(tokenizerBytes)
ortEnvironment = OrtEnvironment.getEnvironment()
val options = OrtSession.SessionOptions().apply{
if (useFP16) {
addNnapi(EnumSet.of(NNAPIFlags.USE_FP16, NNAPIFlags.CPU_DISABLED))
}
if (useXNNPack) {
addXnnpack(mapOf(
"intra_op_num_threads" to "2"
))
}
}
ortSession = ortEnvironment.createSession(modelBytes,options)
}

suspend fun encode(
sentence: String
): FloatArray = withContext(Dispatchers.IO) {
val result = hfTokenizer.tokenize(sentence)
// Create input tensors for `ids` and `attention_mask`
val idsTensor =
OnnxTensor.createTensor(
ortEnvironment,
LongBuffer.wrap(result.ids),
longArrayOf(1, result.ids.size.toLong()),
)
val attentionMaskTensor =
OnnxTensor.createTensor(
ortEnvironment,
LongBuffer.wrap(result.attentionMask),
longArrayOf(1, result.attentionMask.size.toLong()),
)
val outputs =
ortSession.run(mapOf("input_ids" to idsTensor, "attention_mask" to attentionMaskTensor))
val embeddingTensor = outputs.get("sentence_embedding").get() as OnnxTensor
return@withContext embeddingTensor.floatBuffer.array()
}
}

The library can then be distributed through Maven Central or Jitpack and sentence embeddings can be obtained by the use of the SentenceEmbedding class.

Extended Resources

You may find additional blogs on compiling Rust for Android apps, the process being very similar to that for C/C++ with CMake,

These are some helpful resources using onnxruntime in Android,

Conclusion

I’m an on-device ML enthusiast and developing ML apps on Android is my passion. If you would like to view on-device ML projects, ahead here,

Do check out my website and share your thoughts on this story. Keep learning and have a nice day ahead!

--

--