ArmDeveloperEcosystem · parichaydas · Oct 4, 2025 · Oct 4, 2025 · Oct 16, 2025
diff --git a/...support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_index.md b/...support-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_index.md
@@ -0,0 +1,65 @@
+---
+title: Customer Support Chatbot with Llama and ExecuTorch on Arm-Based Mobile Devices (with Agentic AI Capabilities)
+minutes_to_complete: 60
+
+who_is_this_for: This learning plan is designed for developers with basic knowledge of Python, Mobile development, and machine learning concepts.It guides you through creating an on-device customer support chatbot using Meta's Llama models deployed via PyTorch's ExecuTorch runtime.The focus is on Arm-based Android devices.The chatbot will handle common customer queries (e.g., product info, troubleshooting) with low latency, privacy (no cloud dependency), and optimized performance.Incorporates agentic AI capabilities, transforming the chatbot from reactive (simple Q&A) to proactive and autonomous. Agentic AI enables the bot to plan multi-step actions, use external tools,reason over user intent, and adapt responses dynamically. This is achieved by extending the core LLM with tool-calling mechanisms and multi-agent orchestration.
+
+learning_objectives: 
+    - Explain the architecture and capabilities of Llama models (e.g., Llama 3.2 1B/3B) for mobile use.
+    - Master the process of quantizing LLMs (e.g., 4-bit PTQ) to reduce model size and enable efficient inference on resource-constrained mobile devices.
+    - Gain proficiency in using ExecuTorch to export PyTorch models to .pte format for on-device deployment.
+    - Learn to leverage Arm-specific optimizations (e.g., XNNPACK, KleidiAI) to achieve 2-3x faster inference on Arm-based Android devices.
+    - Implement real-time inference with Llama models, enabling seamless customer support interactions (e.g., handling FAQs, troubleshooting).
+
+prerequisites:
+    - Basic Understanding of Machine Learning & Deep Learning (Familiarity with concepts like supervised learning, neural networks, transfer learning and Understanding of model training, validation, & overfitting concepts).
+    - Familiarity with Deep Learning Frameworks (Experience with PyTorch for building, training neural networks and Knowledge of Hugging Face Transformers for working with pre-trained LLMs.
+    - An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM.
+    - A USB cable to connect your smartphone to your development machine.
+    - An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server or Arm based laptop.
+    - Android Debug Bridge (adb) installed on your device. Follow the steps in [adb](https://developer.android.com/tools/adb) to install Android SDK Platform Tools. The adb tool is included in this package.
+    - Java 17 JDK. Follow the steps in [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) to download and install JDK for host.
+    - Android Studio. Follow the steps in [Android Studio](https://developer.android.com/studio) to download and install Android Studio for host.
+    - Python 3.10.
+
+author: Parichay Das
+
+### Tags
+skilllevels: Introductory
+subjects: ML
+armips:
+    - Neoverse
+
+tools_software_languages:
+    - LLM
+    - GenAI
+    - Python
+    - PyTorch
+    - ExecuTorch
+operatingsystems:
+    - Linux
+    - Windows
+    - Android  
+
+
+further_reading:
+     - resource:
+        title: Hugging Face Documentation
+        link: https://huggingface.co/docs
+        type: documentation
+     - resource:
+        title: PyTorch Documentation
+        link: https://pytorch.org/docs/stable/index.html
+        type: documentation
+     - resource:
+        title: Android 
+        link: https://www.android.com/
+        type: website
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1                       # _index.md always has weight of 1 to order correctly
+layout: "learningpathall"       # All files under learning paths have this same wrapper
+learning_path_main_page: "yes"  # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/...rt-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_next-steps.md b/...rt-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+#       FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21                  # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps"         # Always the same, html page title.
+layout: "learningpathall"   # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/...atbot-with-llama-and-executorch-on-arm-based-mobile-devices/example-picture.png b/...atbot-with-llama-and-executorch-on-arm-based-mobile-devices/example-picture.png
diff --git a/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-1.md b/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-1.md
@@ -0,0 +1,34 @@
+---
+title: Overview
+weight: 2
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Understanding Llama: Meta’s Large Language Model
+Llama is a family of large language models trained using publicly available datasets. These models demonstrate strong performance across a range of natural language processing (NLP) tasks, including language translation, question answering, and text summarization.
+
+In addition to their analytical capabilities, Llama models can generate human-like, coherent, and contextually relevant text, making them highly effective for applications that rely on natural language generation. Consequently, they serve as powerful tools in areas such as chatbots, virtual assistants, and language translation, as well as in creative and content-driven domains where producing natural and engaging text is essential.
+
+Please note that the models are subject to the [acceptable use policy](https://github.com/meta-llama/llama/blob/main/USE_POLICY.md)  and this [responsible use guide](https://github.com/meta-llama/llama/blob/main/RESPONSIBLE_USE_GUIDE.md)  .
+
+
+
+## Quantization
+A practical approach to make models fit within smartphone memory constraints is through 4-bit groupwise per-token dynamic quantization of all linear layers. In this technique, dynamic quantization is applied to activations—meaning the quantization parameters are computed at runtime based on the observed minimum and maximum activation values. Meanwhile, the model weights are statically quantized, where each channel is quantized in groups using 4-bit signed integers. This method significantly reduces memory usage while maintaining model performance for on-device inference.
+
+This method ensures efficient memory usage while maintaining model performance on resource-constrained devices.
+
+For further information, refer to [torchao: PyTorch Architecture Optimization](https://github.com/pytorch-labs/ao/).
+
+The table below evaluates WikiText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness).
+
+The results are for two different groupsizes, with max_seq_len 2048, and 1000 samples:
+
+|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
+|--------|-----------------| ---------------------- | ---------------
+|Llama 2 7B | 9.2 | 10.2 | 10.7
+|Llama 3 8B | 7.9 | 9.4 | 9.7
+
+Note that groupsize less than 128 was not enabled in this example, since the model was still too large. This is because current efforts have focused on enabling FP32, and support for FP16 is under way.
diff --git a/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-2.md b/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-2.md
@@ -0,0 +1,79 @@
+---
+title: Environment Setup
+weight: 3
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Android NDK and Android Studio -  Environment Setup
+
+#### Plartform Required 
+- An AWS Graviton4 r8g.16xlarge instance to test Arm performance optimizations, or any [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server or Arm based laptop.
+- An Arm-powered smartphone with the i8mm feature running Android, with 16GB of RAM.
+- A USB cable to connect your smartphone to your development machine.
+
+The installation and configuration of Android Studio can be accomplished through the following steps:
+1. Download and install the latest version of [Android Studio](https://developer.android.com/studio).
+2. Launch Android Studio and access the Settings dialog.
+3. Navigate to Languages & Frameworks → Android SDK.
+4. Under the SDK Platforms tab, ensure that Android 14.0 (“UpsideDownCake”) is selected.
+
+Next, proceed to install the required version of the Android NDK by first setting up the Android Command Line Tools.
+Linux:
+```bash
+curl https://dl.google.com/android/repository/commandlinetools-linux-11076708_latest.zip -o commandlinetools.zip
+unzip commandlinetools.zip
+./commandlinetools/bin/sdkmanager --install "ndk;26.1.10909697"
+```
+Install the NDK in the same directory where Android Studio has installed the SDK, which is typically located at ~/Library/Android/sdk by default. Then, configure the necessary environment variables as follows:  
+```bash
+export ANDROID_HOME="$(realpath ~/Library/Android/sdk)"
+export PATH=$ANDROID_HOME/cmdline-tools/bin/:$PATH
+sdkmanager --sdk_root="${ANDROID_HOME}" --install "ndk;28.0.12433566"
+export ANDROID_NDK=$ANDROID_HOME/ndk/28.0.12433566/
+```
+
+#### Install Java 17 JDK
+1. Open the Java SE 17 Archive [Downloads](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) page in your browser.
+2. Select an appropriate download for your development machine operating system.
+
+#### Install Git and cmake
+```bash
+sudo apt-get install git cmake
+```
+
+#### Install Python 3.10
+```bash
+sudo apt-get install python3.10
+```
+
+#### Set up ExecuTorch
+ExecuTorch is an end-to-end framework designed to facilitate on-device inference across a wide range of mobile and edge platforms, including wearables, embedded systems, and microcontrollers. As a component of the PyTorch Edge ecosystem, it streamlines the efficient deployment of PyTorch models on edge devices. For further details, refer to the [ExecuTorch Overview](https://pytorch.org/executorch/stable/overview/).
+
+It is recommended to create an isolated Python environment to install the ExecuTorch dependencies. Instructions are available for setting up either a Python virtual environment or a Conda virtual environment—you only need to choose one of these options.
+
+##### Install Required Tools ( Python environment setup)
+```python
+python3 -m venv exec_env
+source exec_env/bin/activate
+pip install torch torchvision torchaudio
+pip install executorch
+```
+##### Clone Required Repositories 
+```bash
+git clone https://github.com/pytorch/executorch.git
+git clone https://github.com/pytorch/text.git
+```
+##### Download Pretrained Model (Llama 3.1 Instruct)
+Download the quantized model weights optimized for mobile deployment from either the Meta AI Hub or Hugging Face.
+```
+git lfs install
+git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
+```
+
+##### Verify Arm SDK Path
+```
+ANDROID_SDK_ROOT=/Users/<you>/Library/Android/sdk
+ANDROID_NDK_HOME=$ANDROID_SDK_ROOT/ndk/26.1.10909125
+```
diff --git a/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-3.md b/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-3.md
@@ -0,0 +1,161 @@
+---
+title: Model Preparation and Conversion
+weight: 4
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+To begin working with Llama 3, the pre-trained model parameters can be accessed through Meta’s Llama Downloads page. Users are required to request access by submitting their details and reviewing and accepting the Responsible Use Guide. Upon approval, a license and a download link—valid for 24 hours—are provided. For this exercise, the Llama 3.2 1B Instruct model is utilized; however, the same procedures can be applied to other available variants with only minor modifications.
+
+Convert the model into an ExecuTorch-compatible format optimized for Arm devices
+## Script the Model
+
+```python
+import torch
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.float16)
+scripted_model = torch.jit.script(model)
+scripted_model.save("llama_exec.pt")
+
+```
+
+Install the llama-stack package from pip.
+```python 
+pip install llama-stack
+```
+
+Run the command to download, and paste the download link from the email when prompted.
+```python 
+llama model download --source meta --model-id Llama3.2-1B-Instruct
+```
+
+When the download is finished, the installation path is printed as output.
+```python 
+Successfully downloaded model to /<path-to-home>/.llama/checkpoints/Llama3.2-1B-Instruct
+```
+Verify by viewing the downloaded files under this path:
+```
+ls $HOME/.llama/checkpoints/Llama3.2-1B-Instruct
+checklist.chk           consolidated.00.pth     params.json             tokenizer.model
+
+```
+
+Export the model and generate a .pte file by running the appropriate Python command. This command will export the model and save the resulting file in your current working directory.
+```python 
+python3 -m examples.models.llama.export_llama \
+--checkpoint $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/consolidated.00.pth \
+--params $HOME/.llama/checkpoints/Llama3.2-1B-Instruct/params.json \
+-kv --use_sdpa_with_kv_cache -X --xnnpack-extended-ops -qmode 8da4w \
+--group_size 64 -d fp32 \
+--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001, 128006, 128007]}' \
+--embedding-quantize 4,32 \
+--output_name="llama3_1B_kv_sdpa_xnn_qe_4_64_1024_embedding_4bit.pte" \
+--max_seq_length 1024 \
+--max_context_length 1024
+```
+
+Because Llama 3 has a larger vocabulary size, it is recommended to quantize the embeddings using the parameter --embedding-quantize 4,32. This helps to further optimize memory usage and reduce the overall model size.
+
+
+###### Load a pre-fine-tuned model (from Hugging Face)
+- Example: meta-llama/Llama-3-8B-Instruct or a customer-support fine-tuned variant
+
+###### Model Optimization for ARM (Understanding Quantization)
+- Reduces model precision (e.g., 32-bit → 8-bit)
+- Decreases memory footprint (~4x reduction)
+- Speeds up inference on CPU
+- Minimal accuracy loss for most tasks
+
+###### Apply Dynamic Quantization
+- Create optimize_model.py
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from torch.quantization import quantize_dynamic
+import time
+import os
+
+def load_base_model(model_name):
+    """Load the base model"""
+    print(f"Loading base model: {model_name}")
+
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch.float32,
+        device_map=None,
+        low_cpu_mem_usage=True
+    )
+    model.eval()
+
+    return model, tokenizer
+
+def apply_quantization(model):
+    """Apply dynamic quantization"""
+    print("Applying dynamic quantization...")
+
+    quantized_model = quantize_dynamic(
+        model,
+        {torch.nn.Linear},  # Quantize linear layers
+        dtype=torch.qint8
+    )
+
+    return quantized_model
+
+def test_model(model, tokenizer, prompt):
+    """Test model with a sample prompt"""
+    inputs = tokenizer(prompt, return_tensors="pt")
+
+    start_time = time.time()
+    with torch.no_grad():
+        outputs = model.generate(
+            inputs.input_ids,
+            max_new_tokens=100,
+            do_sample=False,
+            pad_token_id=tokenizer.eos_token_id
+        )
+    inference_time = time.time() - start_time
+
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+    return response, inference_time
+
+def main():
+    model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+    # Load base model
+    base_model, tokenizer = load_base_model(model_name)
+
+    # Test base model
+    test_prompt = "How do I track my order?"
+    print("\nTesting base model...")
+    response, base_time = test_model(base_model, tokenizer, test_prompt)
+    print(f"Base model inference time: {base_time:.2f}s")
+
+    # Apply quantization
+    quantized_model = apply_quantization(base_model)
+
+    # Test quantized model
+    print("\nTesting quantized model...")
+    response, quant_time = test_model(quantized_model, tokenizer, test_prompt)
+    print(f"Quantized model inference time: {quant_time:.2f}s")
+    print(f"Speedup: {base_time / quant_time:.2f}x")
+
+    # Save quantized model
+    save_dir = "./models/quantized_llama3"
+    os.makedirs(save_dir, exist_ok=True)
+
+    torch.save(quantized_model.state_dict(), f"{save_dir}/model.pt")
+    tokenizer.save_pretrained(save_dir)
+
+    print(f"\nQuantized model saved to: {save_dir}")
+
+if __name__ == "__main__":
+    main()
+
+```
diff --git a/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-4.md b/...pport-chatbot-with-llama-and-executorch-on-arm-based-mobile-devices/how-to-4.md
@@ -0,0 +1,31 @@
+---
+title: Building the Chatbot Logic
+
+### FIXED, DO NOT MODIFY
+layout: learningpathall
+---
+
+## Conversation Framework (Python prototype)
+```python 
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
+
+def generate_response(model, query, context):
+    prompt = f"### Context:\n{context}\n### User Query:\n{query}\n### Assistant Response:"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    outputs = model.generate(**inputs, max_new_tokens=200)
+    return tokenizer.decode(outputs[0], skip_special_tokens=True)
+```
+
+###### Context Memory (Simple JSON Store)
+
+```python
+import json
+
+def update_memory(user_id, query, response):
+    memory = json.load(open("chat_memory.json", "r"))
+    memory[user_id].append({"query": query, "response": response})
+    json.dump(memory, open("chat_memory.json", "w"))
+
+```
+