Documentation Β· Quickstart Β· Models Β· Community
GenieX is an on-device Gen AI inference runtime for Qualcomm devices. Bring almost any GGUF model from Hugging Face β or a pre-compiled bundle from Qualcomm AI Hub β and run it locally on the Hexagon NPU, Adreno GPU, or CPU in a few lines of code. One C SDK underneath, exposed through a CLI, Python, Kotlin/Java, Docker, and an OpenAI-compatible server. It is the community version of Qualcomm GENIE.
GenieX runs only on Qualcomm Snapdragon. Find your platform, then jump straight to the interface you want to use.
| Platform | Example devices | Jump to a quickstart |
|---|---|---|
| πͺ Windows ARM64 (Compute) | Snapdragon X Β· X Elite | CLI Β· Python Β· Local server |
| π€ Android (Mobile) | Snapdragon 8 Elite Β· 8 Elite Gen 5 | Android SDK |
| π§ Linux ARM64 (IoT) | Dragonwing QCS9075 | CLI Β· Docker Β· Python |
No device on hand? Spin up a remote session on Qualcomm Device Cloud.
Pick your interface below. Each one follows the same three steps β Install, Run, and Docs β and shows both runtimes: a GGUF model from Hugging Face (llama_cpp) and a pre-compiled bundle from Qualcomm AI Hub (qairt, NPU).
Install
- Windows ARM64 β download the installer, run it, then open a new terminal.
- Linux ARM64 β one line, no
sudo:curl -fsSL https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-geniex/install.sh | sh
Run β chat with any model in one line (drag in an image for VLMs):
# GGUF from Hugging Face β llama.cpp (NPU / GPU / CPU)
geniex infer google/gemma-4-E4B-it-qat-q4_0-gguf
# Pre-compiled bundle from Qualcomm AI Hub β Qualcomm AI Engine Direct (NPU)
geniex infer ai-hub-models/Qwen2.5-VL-7B-Instructπ Docs β Install Β· Quickstart Β· Command reference
Install
pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple geniexRun β mirrors Hugging Face transformers (from_pretrained() β .generate()):
# GGUF from Hugging Face β llama.cpp
from geniex import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3.5-2B-GGUF", precision="Q4_0")
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = model.tokenizer.apply_chat_template(messages, add_generation_prompt=True)
for chunk in model.generate(prompt, max_new_tokens=256, stream=True):
print(chunk, end="", flush=True)
model.close()# Pre-compiled bundle from Qualcomm AI Hub β Qualcomm AI Engine Direct (NPU)
from geniex import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai-hub-models/Qwen3-4B")
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = model.tokenizer.apply_chat_template(messages, add_generation_prompt=True)
for chunk in model.generate(prompt, max_new_tokens=256, stream=True):
print(chunk, end="", flush=True)
model.close()π Docs β Install Β· Quickstart Β· API reference
Install β ships with the CLI (install above).
Run β pull any model (GGUF or Qualcomm AI Hub bundle), then serve an OpenAI-compatible API:
geniex pull ai-hub-models/Qwen3-4B-Instruct-2507
geniex serve # serves http://127.0.0.1:18181/v1curl http://127.0.0.1:18181/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ai-hub-models/Qwen3-4B-Instruct-2507",
"messages": [{"role": "user", "content": "Hello!"}]
}'Point any OpenAI client at http://127.0.0.1:18181/v1 β no code changes.
π Docs β Local server guide
Install β add the SDK to your app module's build.gradle.kts:
dependencies {
implementation("com.qualcomm.qti:geniex-android:0.3.1")
}Run β fastest path is the sample app (chat UI, model picker for GGUF + Qualcomm AI Hub bundles, VLM support):
The Android demo app lives in qualcomm/ai-hub-apps. Clone it, open the sample app in Android Studio, and hit Run.
π Docs β Install Β· Quickstart Β· API reference
Install
docker pull docker.io/qualcomm/geniex:latestRun β the container wraps the CLI, so geniex infer β¦ works exactly as above.
π Docs β Docker guide
Install β link against the single C header sdk/include/geniex.h; every other interface is a thin wrapper over it.
π Docs β sdk/README.md Β· notes/build.md
GenieX has two runtimes so you get broad model coverage and peak Snapdragon performance in one stack. Both LLMs and VLMs are supported.
llama.cpp (llama_cpp) |
Qualcomm AI Engine Direct (qairt) |
|
|---|---|---|
| Get models from | Hugging Face (any GGUF) | Qualcomm AI Hub (pre-compiled) |
| Format | GGUF | Per-chipset bundle |
| Compute units | NPU Β· GPU Β· CPU | NPU only |
| Best for | Bringing your own GGUF | Highest NPU performance |
For llama.cpp, pick the
Q4_0precision when prompted β it has the best Hexagon NPU support. See the Models guide β for the full list, precisions, and how to run a local model.
Contributions are welcome! Before opening a PR, please read CONTRIBUTING.md for branch naming, commit / PR title format, pre-commit checks, and the FFI-update rule for public SDK headers.
| ποΈ Build the CLI, SDK, or Python bindings | notes/build.md |
| notes/run.md | |
| π·οΈ Release β SemVer tags, channels, HTP signing | notes/release.md |
| π All developer docs | docs/README.md |
Questions, ideas, or want to show off what you built? Come say hi.
- π¬ Slack β ask questions and chat with the community in real time.
- π GitHub Issues β report a bug or request a feature.
- π LinkedIn β follow Qualcomm AI Hub for news and updates.
Thanks to everyone building GenieX π
BSD 3-Clause β see LICENSE and NOTICE.
Use of this project is also subject to Qualcomm's Terms of Use.
