Skip to content

TinyML & Edge AI: On-device inference, model quantization, embedded ML, ultra-low-power AI for microcontrollers and IoT devices.

License

Notifications You must be signed in to change notification settings

umitkacar/awesome-tinyml

Repository files navigation

🚀 AI Edge Computing & TinyML

Comprehensive Guide to State-of-the-Art Edge AI

Typing SVG

GitHub stars GitHub forks License Latest Update


🌟 Latest Update: January 2025

Production-Ready Python Implementation with modern tooling (Hatch, Ruff, Mypy) 62/62 Tests Passing81.76% CoverageZero Security Issues State-of-the-Art Algorithms & Trends for Edge AI and Embedded Systems


📋 Table of Contents

🚀 Getting Started

🔥 Core Topics

🛠️ Frameworks & Tools

📚 Documentation

📚 Resources

🎓 Community


🚀 Quick Start & Development

Python Hatch Tests Coverage Type

📦 Installation

This project uses modern Python tooling with Hatch for dependency management and development workflows.

# Clone the repository
git clone https://github.com/umitkacar/ai-edge-computing-tiny-embedded.git
cd ai-edge-computing-tiny-embedded

# Install dependencies (using hatch)
pip install hatch

# Run tests
hatch run test

# Run full CI pipeline
hatch run ci

🛠️ Development Setup

Modern Python Stack:

  • Build System: Hatch - Modern Python project manager
  • Linting: Ruff - Ultra-fast Python linter (100x faster than flake8)
  • Formatting: Black - The uncompromising code formatter
  • Type Checking: Mypy - Static type checker (strict mode)
  • Testing: Pytest - Comprehensive test framework
  • Security: Bandit - Security vulnerability scanner
  • Pre-commit: Automated quality checks on commit/push

Available Commands:

# Linting & Formatting
hatch run lint          # Run Ruff linter
hatch run format        # Format code with Black
hatch run format-check  # Check formatting without changes

# Type Checking
hatch run type-check    # Run Mypy strict type checking

# Testing
hatch run test                    # Run tests (sequential)
hatch run test-parallel           # Run tests with auto workers
hatch run test-parallel-cov       # Parallel tests with coverage

# Security
hatch run security      # Run Bandit security audit

# Complete CI Pipeline
hatch run ci           # Run all checks (format, lint, type-check, security, test)

📊 Project Structure

ai-edge-computing-tiny-embedded/
├── src/ai_edge_tinyml/          # Source code (src layout)
│   ├── __init__.py              # Package initialization
│   ├── quantization.py          # INT8/INT4/FP16 quantization
│   ├── model_optimizer.py       # Model optimization pipeline
│   ├── utils.py                 # Utility functions
│   └── py.typed                 # PEP 561 marker (typed package)
├── tests/                       # Test suite (62 tests, 81.76% coverage)
│   ├── conftest.py              # Pytest configuration & fixtures
│   ├── test_quantization.py     # Quantization tests (21 tests)
│   ├── test_model_optimizer.py  # Optimizer tests (19 tests)
│   └── test_utils.py            # Utility tests (22 tests)
├── pyproject.toml               # Project configuration (single source of truth)
├── .pre-commit-config.yaml      # Pre-commit hooks configuration
├── CHANGELOG.md                 # Detailed change history
├── LESSONS-LEARNED.md           # Best practices & insights
├── DEVELOPMENT.md               # Development guidelines
└── README.md                    # This file

Quality Assurance

This project maintains production-ready code quality:

Check Status Details
Ruff Linting ✅ PASS 50+ rules, zero errors
Black Formatting ✅ PASS Line length: 100
Mypy Type Check ✅ PASS Strict mode enabled
Bandit Security ✅ PASS 0 vulnerabilities
Test Suite ✅ PASS 62/62 tests passing
Code Coverage ✅ PASS 81.76% (exceeds 80%)
Pre-commit Hooks ✅ PASS 15+ automated checks

Test Results:

tests/test_quantization.py      21 passed
tests/test_model_optimizer.py   19 passed
tests/test_utils.py             22 passed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total: 62 passed in 0.50s ✅
Coverage: 81.76% (exceeds 80% threshold) ✅

🔒 Security

  • Bandit Security Audit: Zero vulnerabilities detected
  • Type Safety: Full type annotations with mypy strict mode
  • Dependency Scanning: Automated security checks in CI
  • Pre-commit Hooks: Security validations before commit

📚 Documentation

  • CHANGELOG.md - Detailed version history and changes
  • LESSONS-LEARNED.md - Best practices, insights, and technical decisions
  • DEVELOPMENT.md - Comprehensive development guidelines
  • API Documentation: Auto-generated from Google-style docstrings

🎯 Features

Quantization Support:

  • ✅ INT8 Quantization (8-bit integers)
  • ✅ INT4 Quantization (4-bit integers)
  • ✅ FP16 Quantization (16-bit floats)
  • ✅ Dynamic Quantization
  • ✅ Symmetric & Asymmetric modes
  • ✅ Per-tensor & per-channel quantization

Model Optimization:

  • ✅ Weight quantization with 6 different modes
  • ✅ Compression ratio analysis
  • ✅ Model size calculation
  • ✅ Type-safe APIs with full annotations
  • ✅ Comprehensive error handling

Example Usage:

import numpy as np
from ai_edge_tinyml import Quantizer, QuantizationConfig, QuantizationMode

# Create quantization config
config = QuantizationConfig(
    mode=QuantizationMode.INT8,
    symmetric=True,
    per_channel=False
)

# Initialize quantizer
quantizer = Quantizer(config)

# Quantize weights
weights = np.random.randn(100, 100).astype(np.float32)
quantized = quantizer.quantize(weights)

# Dequantize for inference
dequantized = quantizer.dequantize(quantized)

# Calculate compression
from ai_edge_tinyml.utils import calculate_compression_ratio
ratio = calculate_compression_ratio(weights, quantized)
print(f"Compression ratio: {ratio:.2f}x")

🔥 SOTA Models & Algorithms (2024-2025)

AI Edge TinyML SOTA


🎯 Object Detection Models

🥇 YOLOv11 (YOLO11)

Release Status

🚀 State-of-the-art real-time object detection with transformer-based improvements

✨ Key Features:

  • ⚡ Transformer-based backbone with C3k2 blocks
  • 🎯 Partial Self-Attention (PSA) mechanism
  • 🔥 NMS-free training with dual label assignment
  • 📉 25-40% lower latency vs YOLOv10
  • 📊 10-15% improvement in mAP
  • 60+ FPS processing capability

📚 Resources:

📖 Ultralytics Docs → https://docs.ultralytics.com/models/
📄 YOLO Evolution → https://arxiv.org/html/2510.09653v2

🥈 YOLOv10

Release NMS

⚡ Eliminates NMS for end-to-end real-time detection

📊 Performance Metrics:

  • 🔸 YOLOv10s: 1.8x faster than RT-DETR-R18
  • 🔸 YOLOv10b: 46% less latency, 25% fewer parameters
  • 🔸 mAP Range: 38.5 - 54.4

📚 Resources:

📄 Paper → https://arxiv.org/pdf/2405.14458
📖 Docs → https://docs.ultralytics.com/models/yolov10/

🤖 RT-DETR & RT-DETRv2

Transformer Real-Time

🎯 First practical real-time detection transformer

Model AP Score FPS Device
RT-DETR 53.1% 108 NVIDIA T4
RT-DETRv2 >55% 108+ NVIDIA T4

🔗 Resources:


📱 Efficient Vision Models for Edge

graph LR
    A[🖼️ Input Image] --> B[📱 MobileNetV4]
    A --> C[⚡ EfficientViT]
    B --> D[🎯 87% Accuracy]
    C --> E[🔥 3.8ms Latency]
    D --> F[📲 Edge TPU]
    E --> F
    style A fill:#e1f5ff
    style B fill:#ffe1f5
    style C fill:#f5ffe1
    style D fill:#ffe1e1
    style E fill:#e1ffe1
    style F fill:#ffd700
Loading

📱 MobileNetV4

ECCV Mobile

🌐 Universal efficient architecture for mobile ecosystem

🎨 Innovations:

  • 🔹 Universal Inverted Bottleneck (UIB) block
  • ⚡ Mobile MQA attention (39% speedup)
  • 🎯 Optimized NAS recipe
  • 🏆 87% ImageNet accuracy @ 3.8ms (Pixel 8 EdgeTPU)

📚 Resources:

EfficientViT

ViT 2024

🧠 Lightweight multi-scale attention for high-resolution tasks

✨ Features:

  • 🔸 Memory-efficient Vision Transformer
  • 🔸 Cascaded group attention
  • 🔸 Dense prediction tasks optimized
  • 🔸 High-resolution image processing

🤖 Small Language Models (SLMs) for Edge

LLM Edge


🧠 Microsoft Phi-3

Microsoft

📊 Variants:

Model: Phi-3-mini
Parameters: 3.8B
Context: Up to 128K tokens
Deployment: GPU, CPU, Mobile
Status: ✅ Production Ready

🎯 Optimized For:

  • 💻 GPU acceleration
  • 🖥️ CPU inference
  • 📱 Mobile deployment

🔗 Resources:

🦙 TinyLlama

TinyLlama

📊 Specifications:

Parameters: 1.1B
Target: Mobile/Edge devices
Performance: High for size class
Year: 2024
Status: ✅ Active

✨ Highlights:

  • 🔸 Compact architecture
  • 🔸 Edge-optimized
  • 🔸 Strong performance/size ratio

🌟 Google Gemini Nano

Google

📱 On-device AI for Smartphones

Variants:

  • 📊 1.8B parameters (lightweight)
  • 📊 3.25B parameters (standard)

🎯 Capabilities:

  • ✅ Context-aware reasoning
  • ✅ Real-time translation
  • ✅ Text summarization
  • ✅ Edge-optimized for phones/IoT

🦙 Meta Llama 3.2

Meta

🖼️ Edge AI & Vision Capabilities

Features:

  • ⚡ Edge deployment optimized
  • 👁️ Vision-language capabilities
  • 📱 Mobile-friendly variants
  • 🔥 Latest architecture

🔗 Resources:


📷 MobileVLM

VLM

🎨 Efficient vision-language model for mobile devices

Specifications:

  • 🔹 mobileLLaMA: 2.7B parameters
  • 🔹 Trained from scratch on open datasets
  • 🔹 Fully optimized for mobile deployment
  • 🔹 Vision + Language capabilities

State Space Models - Efficient Transformers

SSM Efficiency


🐍 Mamba

Mamba

⚡ Linear-time sequence modeling with selective state spaces

🚀 Performance Highlights:

Metric Performance
Throughput 5x higher than Transformers
Scaling Linear in sequence length
Comparison Mamba-3B > Transformers (same size)
Power Matches Transformers 2x its size

📊 Advantages:

+ ✅ Linear time complexity
+ ✅ 5x throughput improvement
+ ✅ Efficient long sequences
+ ✅ Lower memory footprint
- ❌ Newer architecture (less tested)

📚 Resources:

📱 eMamba

eMamba

🔧 Edge-optimized Mamba acceleration framework

✨ Features:

Design: End-to-end hardware acceleration
Target: Edge platforms
Complexity: Linear time
Status: 2024 Release

🎯 Optimizations:

  • 🔹 Hardware-aware design
  • 🔹 Edge platform specific
  • 🔹 Leverages linear complexity
  • 🔹 Memory efficient

📚 Resources:


🚀 Inference Frameworks & Runtimes

Inference Runtime


TensorRT-LLM

NVIDIA

🏆 High-performance LLM inference on NVIDIA GPUs

📊 Performance:

+ 70% faster than llama.cpp on RTX 4090
+ State-of-the-art optimizations
+ Quality maintained across precisions

✨ Features:

  • 🔸 Python & C++ API
  • 🔸 Multi-precision support
  • 🔸 Advanced kernel optimization
  • 🔸 Production-grade quality

🔗 Resources:

📄 vLLM

vLLM

💡 High-throughput LLM serving with PagedAttention

🎯 Innovations:

  • ⚡ PagedAttention memory management
  • 🔸 Optimized KV cache handling
  • 🌐 Multi-platform support

🖥️ Supported Hardware:

AMD: GPU support
Google: TPU support
AWS: Inferentia support
Base: PyTorch

🔗 Resources:

🦙 ExecuTorch

Meta

📱 Efficient LLM execution on edge devices

Features:

  • 🔹 Lightweight edge runtime
  • 🔹 Static memory planning
  • 🔹 Multi-platform support
  • 🔹 TorchAO quantization

💻 Hardware Support:

  • ✅ CPU
  • ✅ GPU
  • ✅ AI Accelerators
  • ✅ Mobile devices

🔗 Resources:

💻 llama.cpp

llama.cpp

⚡ CPU-optimized LLM inference

Advantages:

+ ✅ Lower memory usage
+ ✅ No GPU required
+ ✅ Fast generation
+ ✅ Cross-platform
+ ✅ Wide model support

🔗 Comparison:


🔧 Model Compression & Optimization

Compression Quantization


📉 Advanced Quantization Techniques

🏆 AWQ

Award

Activation-aware Weight Quantization

🎯 MIT HAN Lab Innovation

Key Concept:

# Not all weights are equal!
if is_salient(weight):
    skip_quantization()
else:
    quantize_weight()

Features:

  • ⚡ Protects critical weights
  • 🎯 Activation-aware
  • 🔥 State-of-the-art results

🔗 Resources:

💎 GPTQ

GPTQ

GPU-Focused Quantization

Features:

  • 🔸 Row-wise quantization
  • 🔸 Hessian optimization
  • 🔸 GPU inference focused
  • 🔸 175B models supported

Achievements:

Models: BLOOM, OPT-175B
Precision: 4-bit
Platform: GPU optimized

🔬 QLoRA

QLoRA

Efficient Fine-tuning

Innovations:

  • ✨ 4-bit NormalFloat (NF4)
  • ✨ Double quantization
  • ✨ LoRA adapters
  • ✨ Single GPU fine-tuning

Capability:

+ Fine-tune 65B model
+ On single GPU
+ Maintain quality

🆕 Unsloth Dynamic 4-bit

Latest

🔥 Latest quantization innovation

Features:

  • Built on BitsandBytes
  • Dynamic parameter quantization
  • Per-parameter optimization

📚 Comprehensive Guides:


🔬 Neural Architecture Search (NAS)

NAS

🤖 Automate neural network architecture design

🎯 Once-for-All (OFA)

Concept: Train once, deploy everywhere

graph TD
    A[🌐 Supernet Training] --> B[📦 Weight Sharing]
    B --> C[📱 Mobile]
    B --> D[💻 Desktop]
    B --> E[⚡ Edge]
    style A fill:#e1f5ff
    style B fill:#ffe1f5
    style C fill:#f5ffe1
    style D fill:#ffe1e1
    style E fill:#ffd700
Loading

Features:

  • 🔹 Weight-sharing supernetwork
  • 🔹 Represents any architecture in search space
  • 🔹 Massive computational savings
  • 🔹 Applied to ImageNet with ProxylessNAS & MobileNetV3

🔗 Resources:


🎓 Knowledge Distillation & Pruning

🔬 TinyBERT

TinyBERT

📚 Two-stage distillation approach

Performance Metrics:

Accuracy: 96.8% of BERT-base
Size: 7.5x smaller (4 layers)
Energy: Lowest variability (0.1032 kWh SD)
Stages: Task-agnostic + Task-specific

Advantages:

  • ✅ Dual-stage distillation
  • ✅ Ultra-low energy variability
  • ✅ Compact architecture
  • ✅ High performance retention

📖 DistilBERT

DistilBERT

⚡ Single-phase task-agnostic distillation

Performance Metrics:

Accuracy: 97% of BERT
Size Reduction: 40% smaller
Speed: 60% faster
Use Case: General-purpose

Recent Research (2025):

  • 🔸 32% energy reduction with pruning
  • 🔸 Iterative distillation + adaptive pruning
  • 🔸 Nature Scientific Reports

📚 Resources:


🎯 TinyML & MCU-specific Advances

TinyML MIT


🧠 MCUNet Series - MIT HAN Lab

📱 MCUNetV1

V1

Foundation:

  • 🔸 Neural architecture for MCUs
  • 🔸 Co-designed model + inference engine
  • 🔸 Ultra-low memory footprint

🚀 MCUNetV2

V2

Achievements:

ImageNet: 71.8% accuracy
Visual Wake: >90% (32kB SRAM)
Capability: Object detection
Platform: Tiny devices

MCUNetV3

V3

Latest:

  • 🔸 Enhanced efficiency
  • 🔸 State-of-the-art MCU AI
  • 🔸 Production ready

🎓 Additional MCU Tools

🔧 TinyTL

  • Tiny transfer learning for MCUs
  • On-device learning capabilities
  • Minimal resource overhead

⚙️ PockEngine

  • Inference engine optimization
  • MCU-specific acceleration
  • Memory-efficient execution

📚 Resources:


🔬 TinyDL (Tiny Deep Learning)

TinyDL

🎯 Evolution from TinyML to deep learning on edge

Focus Areas:

  • 🔹 Deep learning on ultra-constrained hardware
  • 🔹 Power consumption in mW range
  • 🔹 On-device sensor analytics
  • 🔹 Real-time inference

📄 Resources:


🔩 Hardware Acceleration & Platforms

Hardware Edge


🖥️ Edge AI Platforms

🟢 NVIDIA Jetson Orin Nano Super

NVIDIA

Specifications:

Compute: 67 INT8 TOPS
Performance: 1.7x vs previous Orin
Price: $249
Release: Late 2024
Status: ✅ Available

Features:

  • ⚡ Generative AI optimized
  • 🎯 Edge AI development kit
  • 💰 Affordable price point

🔷 Edge TPU & Neural Accelerators

Hardware Platforms:

Google

  • Google Pixel EdgeTPU
  • Coral Dev Board

Apple

  • Apple Neural Engine
  • A-series chips

Generic

  • Specialized NPUs
  • Custom ASICs

📱 Mobile Deployment Targets

Platform Architecture Use Case
🔧 ARM CPUs ARM Cortex General compute
📡 Mobile DSPs Qualcomm/MediaTek Signal processing
🎮 Mobile GPUs Mali/Adreno Graphics + AI
🧠 NPUs Custom ASICs Neural processing

🛠️ Implementation Resources & Tools

ONNX TensorRT


🔷 ONNX Runtime

Cross-platform inference with ONNX models

📚 Documentation & Tutorials

🔧 Compatibility

💻 Example Implementations

📦 Model Repositories


📉 ONNX Runtime Quantization

Quantization

Tools & Resources:


🎯 YOLO Implementations

🔥 Click to expand YOLO implementations

🟣 YOLO-NAS with ONNX

🟢 YOLO + TensorRT (Detection, Pose, Segmentation)

🔵 YOLO + ONNXRuntime (All Tasks)

🌐 Community Resources


TensorRT

TensorRT

🚀 NVIDIA's high-performance deep learning inference optimizer

Resources:


🌐 Edge Deployment Frameworks

Deployment Frameworks


🚀 FastDeploy - PaddlePaddle

PaddlePaddle

📦 Easy-to-use deployment toolbox for AI models

Resources:


💎 DeepSparse & SparseML - Neural Magic

Neural Magic

🖥️ CPU-optimized inference with sparsity

Features:

  • ⚡ CPU inference acceleration
  • 🔸 Sparsity-aware optimization
  • 📊 YOLOv5 CPU benchmarks

Resources:

📱 NCNN - Tencent

Tencent

🎯 High-performance neural network inference for mobile

Resources:


🔧 MACE - Xiaomi

Xiaomi

🤖 Mobile AI Compute Engine

Resources:


🍎 CoreML - Apple

Apple

🎨 Machine learning framework for iOS/macOS

📦 Click to expand CoreML resources

🎨 Model Collections

🛠️ Tools & Documentation

🎨 Stable Diffusion on CoreML


⚙️ Compilers & Low-Level Frameworks

Compilers Optimization


🔧 TVM - Apache

TVM

🎯 End-to-end deep learning compiler stack

Resources:


🔨 LLVM

LLVM

⚙️ Compiler infrastructure project

Resources:


XNNPack - Google

Google

🚀 High-efficiency floating-point neural network operators

Resources:

🔷 ARM-NN

ARM

💪 Inference engine for ARM platforms

Resources:


🧠 CMSIS-NN

CMSIS

📱 Efficient neural network kernels for ARM Cortex-M

Resources:


📱 Samsung ONE

Samsung

🔧 On-device Neural Engine compiler

Resources:


💼 Industry & Commercial Solutions

Industry


🚀 Deeplite

Deeplite

🎯 AI-Driven Optimizer for Deep Neural Networks

Focus:


Faster
Inference
📦
Smaller
Models
🔋
Energy
Efficient
☁️
Cloud to
Edge
🎯
Maintain
Accuracy

🔗 Resources:


🔧 Utility Frameworks & Tools

Tools


👁️ OpenCV

OpenCV

📷 Computer vision library with C++ support

Resources:

🎬 VQRF - Video Compression

VQRF

📹 Vector Quantized Radiance Fields

Resources:


🖼️ Additional Model Architectures

Models


🎯 PP-PicoDet

PicoDet

📱 Lightweight real-time object detector for mobile

Resources:

🔬 EtinyNet

EtinyNet

🎯 Extremely tiny network for TinyML

Resources:

TinyML Architecture


🧠 Computing Architectures & APIs

Computing


ARM

Mobile &
Embedded

RISC-V

Open-Source
ISA

CUDA

NVIDIA
GPU

Metal

Apple
GPU

OpenCL

Cross-
Platform

Vulkan

Graphics &
Compute


📚 Research Papers & Academic Resources

Research 2024-2025


📖 Foundational Surveys (2024-2025)

🔍 Click to expand research papers

🌐 Edge Computing & Deep Learning

🔬 TinyML Specific

State Space Models & Efficient Architectures

👁️ Vision Models

🔧 Model Compression & Optimization

📚 Collections


🎓 Contributing & Community

Community Contributions


This repository serves as a comprehensive resource for AI edge computing and TinyML practitioners.

Contributions, updates, and corrections are welcome! 🚀


📊 Repository Stats

Last Commit Contributors Issues


🏷️ Keywords

TinyMLEdge AIEmbedded MLModel CompressionQuantizationNeural Architecture SearchYOLOMobileNetTransformerState Space ModelsONNX RuntimeTensorRTInference OptimizationMCUIoTReal-Time AI


📅 Last Updated

January 2025