Skip to content
Open

newLP #2395

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Customer Support Chatbot with Llama and ExecuTorch on Arm-Based Mobile Devices (with Agentic AI Capabilities)
minutes_to_complete: 60

who_is_this_for: This learning plan is designed for developers with basic knowledge of Python, Mobile development, and machine learning concepts.It guides you through creating an on-device customer support chatbot using Meta's Llama models deployed via PyTorch's ExecuTorch runtime.The focus is on Arm-based Android devices.The chatbot will handle common customer queries (e.g., product info, troubleshooting) with low latency, privacy (no cloud dependency), and optimized performance.Incorporates agentic AI capabilities, transforming the chatbot from reactive (simple Q&A) to proactive and autonomous. Agentic AI enables the bot to plan multi-step actions, use external tools,reason over user intent, and adapt responses dynamically. This is achieved by extending the core LLM with tool-calling mechanisms and multi-agent orchestration.

learning_objectives:
- Explain the architecture and capabilities of Llama models (e.g., Llama 3.2 1B/3B) for mobile use.
- Master the process of quantizing LLMs (e.g., 4-bit PTQ) to reduce model size and enable efficient inference on resource-constrained mobile devices.
- Gain proficiency in using ExecuTorch to export PyTorch models to .pte format for on-device deployment.
- Learn to leverage Arm-specific optimizations (e.g., XNNPACK, KleidiAI) to achieve 2-3x faster inference on Arm-based Android devices.
- Implement real-time inference with Llama models, enabling seamless customer support interactions (e.g., handling FAQs, troubleshooting).

prerequisites:
- Basic Understanding of Machine Learning & Deep Learning (Familiarity with concepts like supervised learning, neural networks, transfer learning and Understanding of model training, validation, & overfitting concepts).
- Familiarity with Deep Learning Frameworks (Experience with PyTorch for building, training neural networks and Knowledge of Hugging Face Transformers for working with pre-trained LLMs.
- An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM.
- A USB cable to connect your smartphone to your development machine.
- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server or Arm based laptop.
- Android Debug Bridge (adb) installed on your device. Follow the steps in [adb](https://developer.android.com/tools/adb) to install Android SDK Platform Tools. The adb tool is included in this package.
- Java 17 JDK. Follow the steps in [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) to download and install JDK for host.
- Android Studio. Follow the steps in [Android Studio](https://developer.android.com/studio) to download and install Android Studio for host.
- Python 3.10.

author: Parichay Das

### Tags
skilllevels: Introductory
subjects: ML
armips:
- Neoverse

tools_software_languages:
- LLM
- GenAI
- Python
- PyTorch
- ExecuTorch
operatingsystems:
- Linux
- Windows
- Android


further_reading:
- resource:
title: Hugging Face Documentation
link: https://huggingface.co/docs
type: documentation
- resource:
title: PyTorch Documentation
link: https://pytorch.org/docs/stable/index.html
type: documentation
- resource:
title: Android
link: https://www.android.com/
type: website


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
# ================================================================================
# FIXED, DO NOT MODIFY THIS FILE
# ================================================================================
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
title: "Next Steps" # Always the same, html page title.
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
---
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: Overview
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Understanding Llama: Meta’s Large Language Model
Llama is a family of large language models trained using publicly available datasets. These models demonstrate strong performance across a range of natural language processing (NLP) tasks, including language translation, question answering, and text summarization.

In addition to their analytical capabilities, Llama models can generate human-like, coherent, and contextually relevant text, making them highly effective for applications that rely on natural language generation. Consequently, they serve as powerful tools in areas such as chatbots, virtual assistants, and language translation, as well as in creative and content-driven domains where producing natural and engaging text is essential.

Please note that the models are subject to the [acceptable use policy](https://github.com/meta-llama/llama/blob/main/USE_POLICY.md) and this [responsible use guide](https://github.com/meta-llama/llama/blob/main/RESPONSIBLE_USE_GUIDE.md) .



## Quantization
A practical approach to make models fit within smartphone memory constraints is through 4-bit groupwise per-token dynamic quantization of all linear layers. In this technique, dynamic quantization is applied to activations—meaning the quantization parameters are computed at runtime based on the observed minimum and maximum activation values. Meanwhile, the model weights are statically quantized, where each channel is quantized in groups using 4-bit signed integers. This method significantly reduces memory usage while maintaining model performance for on-device inference.

This method ensures efficient memory usage while maintaining model performance on resource-constrained devices.

For further information, refer to [torchao: PyTorch Architecture Optimization](https://github.com/pytorch-labs/ao/).

The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).

The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples:

|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
|--------|-----------------| ---------------------- | ---------------
|Llama 2 7B | 9.2 | 10.2 | 10.7
|Llama 3 8B | 7.9 | 9.4 | 9.7

Note that groupsize less than 128 was not enabled in this example, since the model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way.
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: Environment Setup
weight: 3

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Android NDK and Android Studio - Environment Setup

#### Plartform Required
- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server or Arm based laptop.
- An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM.
- A USB cable to connect your smartphone to your development machine.

The installation and configuration of Android Studio can be accomplished through the following steps:
1. Download and install the latest version of [Android Studio](https://developer.android.com/studio).
2. Launch Android Studio and access the Settings dialog.
3. Navigate to Languages & Frameworks → Android SDK.
4. Under the SDK Platforms tab, ensure that Android 14.0 (“UpsideDownCake”) is selected.

Next, proceed to install the required version of the Android NDK by first setting up the Android Command Line Tools.
Linux:
```bash
curl https://dl.google.com/android/repository/commandlinetools-linux-11076708_latest.zip -o commandlinetools.zip
unzip commandlinetools.zip
./commandlinetools/bin/sdkmanager --install "ndk;26.1.10909697"
```
Install the NDK in the same directory where Android Studio has installed the SDK, which is typically located at ~/Library/Android/sdk by default. Then, configure the necessary environment variables as follows:
```bash
export ANDROID_HOME="$(realpath ~/Library/Android/sdk)"
export PATH=$ANDROID_HOME/cmdline-tools/bin/:$PATH
sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;28.0.12433566"
export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/
```

#### Install Java 17 JDK
1. Open the Java SE 17 Archive [Downloads](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) page in your browser.
2. Select an appropriate download for your development machine operating system.

#### Install Git and cmake
```bash
sudo apt-get install git cmake
```

#### Install Python 3.10
```bash
sudo apt-get install python3.10
```

#### Set up ExecuTorch
ExecuTorch is an end-to-end framework designed to facilitate on-device inference across a wide range of mobile and edge platforms, including wearables, embedded systems, and microcontrollers. As a component of the PyTorch Edge ecosystem, it streamlines the efficient deployment of PyTorch models on edge devices. For further details, refer to the [ExecuTorch Overview](https://pytorch.org/executorch/stable/overview/).

It is recommended to create an isolated Python environment to install the ExecuTorch dependencies. Instructions are available for setting up either a Python virtual environment or a Conda virtual environment—you only need to choose one of these options.

##### Install Required Tools ( Python environment setup)
```python
python3 -m venv exec_env
source exec_env/bin/activate
pip install torch torchvision torchaudio
pip install executorch
```
##### Clone Required Repositories
```bash
git clone https://github.com/pytorch/executorch.git
git clone https://github.com/pytorch/text.git
```
##### Download Pretrained Model (Llama 3.1 Instruct)
Download the quantized model weights optimized for mobile deployment from either the Meta AI Hub or Hugging Face.
```
git lfs install
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
```

##### Verify Arm SDK Path
```
ANDROID_SDK_ROOT=/Users/<you>/Library/Android/sdk
ANDROID_NDK_HOME=$ANDROID_SDK_ROOT/ndk/26.1.10909125
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
---
title: Model Preparation and Conversion
weight: 4

### FIXED, DO NOT MODIFY
layout: learningpathall
---

To begin working with Llama 3, the pre-trained model parameters can be accessed through Meta’s Llama Downloads page. Users are required to request access by submitting their details and reviewing and accepting the Responsible Use Guide. Upon approval, a license and a download link—valid for 24 hours—are provided. For this exercise, the Llama 3.2 1B Instruct model is utilized; however, the same procedures can be applied to other available variants with only minor modifications.

Convert the model into an ExecuTorch-compatible format optimized for Arm devices
## Script the Model

```python
import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.float16)
scripted_model = torch.jit.script(model)
scripted_model.save("llama_exec.pt")

```

Install the llama-stack package from pip.
```python
pip install llama-stack
```

Run the command to download, and paste the download link from the email when prompted.
```python
llama model download --source meta --model-id Llama3.2-1B-Instruct
```

When the download is finished, the installation path is printed as output.
```python
Successfully downloaded model to /<path-to-home>/.llama/checkpoints/Llama3.2-1B-Instruct
```
Verify by viewing the downloaded files under this path:
```
ls $HOME/.llama/checkpoints/Llama3.2-1B-Instruct
checklist.chk consolidated.00.pth params.json tokenizer.model

```

Export the model and generate a .pte file by running the appropriate Python command. This command will export the model and save the resulting file in your current working directory.
```python
python3 -m examples.models.llama.export_llama \
--checkpoint $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \
--params $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/params.json \
-kv --use_sdpa_with_kv_cache -X --xnnpack-extended-ops -qmode 8da4w \
--group_size 64 -d fp32 \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \
--embedding-quantize 4,32 \
--output_name="llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte" \
--max_seq_length 1024 \
--max_context_length 1024
```

Because Llama 3 has a larger vocabulary size, it is recommended to quantize the embeddings using the parameter --embedding-quantize 4,32. This helps to further optimize memory usage and reduce the overall model size.


###### Load a pre-fine-tuned model (from Hugging Face)
- Example: meta-llama/Llama-3-8B-Instruct or a customer-support fine-tuned variant

###### Model Optimization for ARM (Understanding Quantization)
- Reduces model precision (e.g., 32-bit → 8-bit)
- Decreases memory footprint (~4x reduction)
- Speeds up inference on CPU
- Minimal accuracy loss for most tasks

###### Apply Dynamic Quantization
- Create optimize_model.py

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.quantization import quantize_dynamic
import time
import os

def load_base_model(model_name):
"""Load the base model"""
print(f"Loading base model: {model_name}")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float32,
device_map=None,
low_cpu_mem_usage=True
)
model.eval()

return model, tokenizer

def apply_quantization(model):
"""Apply dynamic quantization"""
print("Applying dynamic quantization...")

quantized_model = quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize linear layers
dtype=torch.qint8
)

return quantized_model

def test_model(model, tokenizer, prompt):
"""Test model with a sample prompt"""
inputs = tokenizer(prompt, return_tensors="pt")

start_time = time.time()
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_new_tokens=100,
do_sample=False,
pad_token_id=tokenizer.eos_token_id
)
inference_time = time.time() - start_time

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

return response, inference_time

def main():
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"

# Load base model
base_model, tokenizer = load_base_model(model_name)

# Test base model
test_prompt = "How do I track my order?"
print("\nTesting base model...")
response, base_time = test_model(base_model, tokenizer, test_prompt)
print(f"Base model inference time: {base_time:.2f}s")

# Apply quantization
quantized_model = apply_quantization(base_model)

# Test quantized model
print("\nTesting quantized model...")
response, quant_time = test_model(quantized_model, tokenizer, test_prompt)
print(f"Quantized model inference time: {quant_time:.2f}s")
print(f"Speedup: {base_time / quant_time:.2f}x")

# Save quantized model
save_dir = "./models/quantized_llama3"
os.makedirs(save_dir, exist_ok=True)

torch.save(quantized_model.state_dict(), f"{save_dir}/model.pt")
tokenizer.save_pretrained(save_dir)

print(f"\nQuantized model saved to: {save_dir}")

if __name__ == "__main__":
main()

```
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: Building the Chatbot Logic

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Conversation Framework (Python prototype)
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def generate_response(model, query, context):
prompt = f"### Context:\n{context}\n### User Query:\n{query}\n### Assistant Response:"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
```

###### Context Memory (Simple JSON Store)

```python
import json

def update_memory(user_id, query, response):
memory = json.load(open("chat_memory.json", "r"))
memory[user_id].append({"query": query, "response": response})
json.dump(memory, open("chat_memory.json", "w"))

```

Loading