-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Describe the issue
Problem Description
ONNX Runtime Android cannot create tensors with zero dimensions, which prevents proper initialization of past_key_values in Transformer models (such as Qwen2-VL, LLaMA, etc.) during the first inference step.
Expected Behavior:
According to ONNX specification and Python implementations, past_key_values should be initialized with seq_len=0 (empty cache) for the first inference:
past_key_values = np.zeros((1, num_heads, 0, head_dim), dtype=np.float16)
# seq_len = 0 ✅ Python/NumPy supportsActual Behavior:
ONNX Runtime Android throws an exception when trying to create a tensor with a dimension of 0:
val shape = longArrayOf(1, 2, 0, 128) // seq_len = 0
val tensor = OnnxTensor.createTensor(ortEnv!!, FloatBuffer.wrap(data), shape)
// ❌ Exception: Cannot create tensor with zero dimensionWorkaround Attempted:
We must use seq_len=1 as a placeholder:
val shape = longArrayOf(1, 2, 1, 128) // seq_len = 1 (placeholder)However, this causes sequence length mismatches in model inference, as the model expects:
- Total sequence length = past_seq_len + current_seq_len
- With
seq_len=1placeholder: 1 + N = N+1 - But
inputs_embedsonly has N tokens, causing shape mismatch errors
Impact:
This limitation affects all Transformer models that use past_key_values for KV caching on Android, preventing proper inference.
Error Example:
ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION
Shape mismatch attempting to re-use buffer. {1,12,605,128} != {1,12,606,128}
The model expects sequence length 606 (past_seq_len=1 + current_seq_len=605), but inputs_embeds only has 605 tokens.
Questions:
- Is this a known limitation of ONNX Runtime Android? Are there plans to support zero-dimensional tensors?
- Is there a workaround to properly handle
past_key_valuesinitialization on Android? - Are there configuration options in
SessionOptionsthat can help with this?
To reproduce
Steps to Reproduce
-
Setup ONNX Runtime Android:
- Add dependency:
com.microsoft.onnxruntime:onnxruntime-android:1.20.0 - Initialize ONNX Runtime environment in Kotlin/Java
- Add dependency:
-
Attempt to create a tensor with zero dimension:
val ortEnv = OrtEnvironment.getEnvironment() val shape = longArrayOf(1, 2, 0, 128) // seq_len = 0 (zero dimension) val data = FloatArray(0) // Empty array val tensor = OnnxTensor.createTensor(ortEnv, FloatBuffer.wrap(data), shape)
-
Expected: Tensor should be created successfully (as in Python/NumPy)
-
Actual: Exception is thrown because ONNX Runtime Android doesn't support zero dimensions
Minimal Code Example
import ai.onnxruntime.*
fun main() {
val ortEnv = OrtEnvironment.getEnvironment()
// This works in Python/NumPy but fails in ONNX Runtime Android
try {
val shape = longArrayOf(1, 2, 0, 128) // seq_len = 0
val data = FloatArray(0)
val tensor = OnnxTensor.createTensor(ortEnv, FloatBuffer.wrap(data), shape)
println("Success: Tensor created with zero dimension")
} catch (e: Exception) {
println("Error: ${e.message}")
// Expected: Exception about zero dimension not supported
}
}Model Context
This issue occurs when initializing past_key_values for Transformer models (e.g., Qwen2-VL, LLaMA) where the KV cache should be empty (seq_len=0) for the first inference step. The model is exported correctly according to ONNX specification with past_sequence_length=0.
Urgency
This issue is urgent as it blocks critical functionality for our project.
Why it's urgent:
-
Core functionality blocked: We are developing an Android application that uses a Qwen2-VL model for vision-language understanding. The model cannot run inference on Android due to this limitation.
-
Widespread impact: This affects all Transformer models that use
past_key_valuesfor KV caching on Android, not just our specific use case. This is a fundamental limitation that prevents many modern Transformer models from working on Android. -
Project deadline: This is blocking our core feature development and preventing us from deploying our application.
-
No workaround available: We've tried multiple approaches (padding inputs, adjusting attention masks, using seq_len=1 as placeholder, etc.), but all fail due to this fundamental limitation. We cannot use
seq_len=0as required by the model specification, and usingseq_len=1causes sequence length mismatches in model inference.
Impact: Without a solution, we cannot deploy Transformer models with KV caching on Android using ONNX Runtime, which severely limits the framework's usefulness for mobile applications.
Platform
Android
OS Version
Android 15
ONNX Runtime Installation
Released Package
Compiler Version (if 'Built from Source')
No response
Package Name (if 'Released Package')
onnxruntime-android
ONNX Runtime Version or Commit ID
com.microsoft.onnxruntime:onnxruntime-android:1.20.0
ONNX Runtime API
Java/Kotlin
Architecture
ARM64
Execution Provider
Default CPU
Execution Provider Library Version
N/A (CPU EP)