diff --git a/docs/build/eps.md b/docs/build/eps.md index 9fc7a2510eeb8..2128a56f54ca6 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -337,19 +337,20 @@ See more information on the OpenVINO™ Execution Provider [here](../execution-p ### Prerequisites {: .no_toc } -1. Install the OpenVINO™ offline/online installer from Intel® Distribution of OpenVINO™TM Toolkit **Release 2024.3** for the appropriate OS and target hardware: - * [Windows - CPU, GPU, NPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?PACKAGE=OPENVINO_BASE&VERSION=v_2024_3_0&OP_SYSTEM=WINDOWS&DISTRIBUTION=ARCHIVE). - * [Linux - CPU, GPU, NPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?PACKAGE=OPENVINO_BASE&VERSION=v_2024_3_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE) - - Follow [documentation](https://docs.openvino.ai/2024/home.html) for detailed instructions. - - *2024.5 is the current recommended OpenVINO™ version. [OpenVINO™ 2024.5](https://docs.openvino.ai/2024/index.html) is minimal OpenVINO™ version requirement.* +1. Install the OpenVINO™ offline/online installer from Intel® Distribution of OpenVINO™TM Toolkit **Release 2025.3** for the appropriate OS and target hardware: + * [Windows - CPU, GPU, NPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?PACKAGE=OPENVINO_BASE&VERSION=v_2025_3_0&OP_SYSTEM=WINDOWS&DISTRIBUTION=ARCHIVE). + * [Linux - CPU, GPU, NPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?PACKAGE=OPENVINO_BASE&VERSION=v_2025_3_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE) + + Follow [documentation](https://docs.openvino.ai/2025/index.html) for detailed instructions. + + *2025.3 is the current recommended OpenVINO™ version. [OpenVINO™ 2025.0](https://docs.openvino.ai/2025/index.html) is minimal OpenVINO™ version requirement.* -2. Configure the target hardware with specific follow on instructions: - * To configure Intel® Processor Graphics(GPU) please follow these instructions: [Windows](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html#windows), [Linux](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html#linux) +2. Install CMake 3.28 or higher. Download from the [official CMake website](https://cmake.org/download/). +3. Configure the target hardware with specific follow on instructions: + * To configure Intel® Processor Graphics(GPU) please follow these instructions: [Windows](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html#windows), [Linux](https://docs.openvino.ai/2025/get-started/install-openvino/configurations/configurations-intel-gpu.html#linux) -3. Initialize the OpenVINO™ environment by running the setupvars script as shown below. This is a required step: +4. Initialize the OpenVINO™ environment by running the setupvars script as shown below. This is a required step: * For Windows: ``` C:\\setupvars.bat @@ -358,7 +359,7 @@ See more information on the OpenVINO™ Execution Provider [here](../execution-p ``` $ source /setupvars.sh ``` - **Note:** If you are using a dockerfile to use OpenVINO™ Execution Provider, sourcing OpenVINO™ won't be possible within the dockerfile. You would have to explicitly set the LD_LIBRARY_PATH to point to OpenVINO™ libraries location. Refer our [dockerfile](https://github.com/microsoft/onnxruntime/blob/main/dockerfiles/Dockerfile.openvino). + ### Build Instructions {: .no_toc } @@ -366,7 +367,7 @@ See more information on the OpenVINO™ Execution Provider [here](../execution-p #### Windows ``` -.\build.bat --config RelWithDebInfo --use_openvino --build_shared_lib --build_wheel +.\build.bat --config Release --use_openvino --build_shared_lib --build_wheel ``` *Note: The default Windows CMake Generator is Visual Studio 2019, but you can also use the newer Visual Studio 2022 by passing `--cmake_generator "Visual Studio 17 2022"` to `.\build.bat`* @@ -374,14 +375,14 @@ See more information on the OpenVINO™ Execution Provider [here](../execution-p #### Linux ```bash -./build.sh --config RelWithDebInfo --use_openvino --build_shared_lib --build_wheel +./build.sh --config Release --use_openvino --build_shared_lib --build_wheel ``` * `--build_wheel` Creates python wheel file in dist/ folder. Enable it when building from source. * `--use_openvino` builds the OpenVINO™ Execution Provider in ONNX Runtime. -* ``: Specifies the default hardware target for building OpenVINO™ Execution Provider. This can be overriden dynamically at runtime with another option (refer to [OpenVINO™-ExecutionProvider](../execution-providers/OpenVINO-ExecutionProvider.md#summary-of-options) for more details on dynamic device selection). Below are the options for different Intel target devices. +* ``: Specifies the default hardware target for building OpenVINO™ Execution Provider. This can be overriden dynamically at runtime with another option (refer to [OpenVINO™-ExecutionProvider](../execution-providers/OpenVINO-ExecutionProvider.md#configuration-options) for more details on dynamic device selection). Below are the options for different Intel target devices. -Refer to [Intel GPU device naming convention](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html#device-naming-convention) for specifying the correct hardware target in cases where both integrated and discrete GPU's co-exist. +Refer to [Intel GPU device naming convention](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html#device-naming-convention) for specifying the correct hardware target in cases where both integrated and discrete GPU's co-exist. | Hardware Option | Target Device | | --------------- | ------------------------| @@ -390,37 +391,20 @@ Refer to [Intel GPU device naming convention](https://docs.openvino.ai/2024/open | GPU.0 | Intel® Integrated Graphics | | GPU.1 | Intel® Discrete Graphics | | NPU | Intel® Neural Processor Unit | -| HETERO:DEVICE_TYPE_1,DEVICE_TYPE_2,DEVICE_TYPE_3... | All Intel® silicons mentioned above | -| MULTI:DEVICE_TYPE_1,DEVICE_TYPE_2,DEVICE_TYPE_3... | All Intel® silicons mentioned above | -| AUTO:DEVICE_TYPE_1,DEVICE_TYPE_2,DEVICE_TYPE_3... | All Intel® silicons mentioned above | - -Specifying Hardware Target for HETERO or Multi or AUTO device Build: - -HETERO:DEVICE_TYPE_1,DEVICE_TYPE_2,DEVICE_TYPE_3... -The DEVICE_TYPE can be any of these devices from this list ['CPU','GPU', 'NPU'] -A minimum of two device's should be specified for a valid HETERO or MULTI or AUTO device build. - -``` -Example's: HETERO:GPU,CPU or AUTO:GPU,CPU or MULTI:GPU,CPU -``` #### Disable subgraph partition Feature -* Builds the OpenVINO™ Execution Provider in ONNX Runtime with sub graph partitioning disabled. - -* With this option enabled. Fully supported models run on OpenVINO Execution Provider else they completely fall back to default CPU EP. +* Builds the OpenVINO™ Execution Provider in ONNX Runtime with graph partitioning disabled, which will run fully supported models on OpenVINO Execution Provider else they completely fall back to default CPU EP, * To enable this feature during build time. Use `--use_openvino ` `_NO_PARTITION` ``` -Usage: --use_openvino CPU_FP32_NO_PARTITION or --use_openvino GPU_FP32_NO_PARTITION or - --use_openvino GPU_FP16_NO_PARTITION +Usage: --use_openvino CPU_NO_PARTITION or --use_openvino GPU_NO_PARTITION or --use_openvino NPU_NO_PARTITION ``` -For more information on OpenVINO™ Execution Provider's ONNX Layer support, Topology support, and Intel hardware enabled, please refer to the document [OpenVINO™-ExecutionProvider](../execution-providers/OpenVINO-ExecutionProvider.md) +For more information on OpenVINO™ Execution Provider's ONNX Layer support, Topology support, and Intel hardware enabled, please refer to the document [OpenVINO™-ExecutionProvider](../execution-providers/OpenVINO-ExecutionProvider.md#support-coverage) --- - ## QNN See more information on the QNN execution provider [here](../execution-providers/QNN-ExecutionProvider.md). @@ -895,4 +879,4 @@ build.bat --config --build_shared_lib --build_whe ```bash ./build.sh --config --build_shared_lib --build_wheel --use_azure -``` \ No newline at end of file +``` diff --git a/docs/execution-providers/OpenVINO-ExecutionProvider.md b/docs/execution-providers/OpenVINO-ExecutionProvider.md index efbb2ebca5577..04b37aa2c516d 100644 --- a/docs/execution-providers/OpenVINO-ExecutionProvider.md +++ b/docs/execution-providers/OpenVINO-ExecutionProvider.md @@ -19,258 +19,463 @@ Accelerate ONNX models on Intel CPUs, GPUs, NPU with Intel OpenVINO™ Execution ## Install -Pre-built packages and Docker images are published for OpenVINO™ Execution Provider for ONNX Runtime by Intel for each release. -* OpenVINO™ Execution Provider for ONNX Runtime Release page: [Latest v5.6 Release](https://github.com/intel/onnxruntime/releases) +Intel publishes pre-built OpenVINO™ Execution Provider packages for ONNX Runtime with each release. +* OpenVINO™ Execution Provider for ONNX Runtime Release page: [Latest v5.8 Release](https://github.com/intel/onnxruntime/releases) * Python wheels Ubuntu/Windows: [onnxruntime-openvino](https://pypi.org/project/onnxruntime-openvino/) -* Docker image: [openvino/onnxruntime_ep_ubuntu20](https://hub.docker.com/r/openvino/onnxruntime_ep_ubuntu20) ## Requirements -ONNX Runtime OpenVINO™ Execution Provider is compatible with three lastest releases of OpenVINO™. + +ONNX Runtime OpenVINO™ Execution Provider is compatible with three latest releases of OpenVINO™. |ONNX Runtime|OpenVINO™|Notes| |---|---|---| +|1.23.0|2025.3|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.8)| +|1.22.0|2025.1|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.7)| |1.21.0|2025.0|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.6)| -|1.20.0|2024.4|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.5)| -|1.19.0|2024.3|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.4)| -|1.18.0|2024.1|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.3)| -|1.17.1|2023.3|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.2)| ## Build -For build instructions, please see the [BUILD page](../build/eps.md#openvino). +For build instructions, refer [BUILD page](../build/eps.md#openvino). ## Usage -**Set OpenVINO™ Environment for Python** +**Python Package Installation** -Please download onnxruntime-openvino python packages from PyPi.org: +For Python users, install the onnxruntime-openvino package: ``` pip install onnxruntime-openvino ``` +**Set OpenVINO™ Environment Variables** + +To use OpenVINO™ Execution Provider with any programming language (Python, C++, C#), you must set up the OpenVINO™ Environment Variables using the full installer package of OpenVINO™. + * **Windows** +``` +C:\ \setupvars.bat +``` +* **Linux** +``` +$ source /setupvars.sh +``` - To enable OpenVINO™ Execution Provider with ONNX Runtime on Windows it is must to set up the OpenVINO™ Environment Variables using the full installer package of OpenVINO™. - Initialize the OpenVINO™ environment by running the setupvars script as shown below. This is a required step: - ``` - C:\ \setupvars.bat - ``` -* **Linux** +**Set OpenVINO™ Environment for C#** - OpenVINO™ Execution Provider with Onnx Runtime on Linux, installed from PyPi.org comes with prebuilt OpenVINO™ libs and supports flag CXX11_ABI=0. So there is no need to install OpenVINO™ separately. +To use csharp api for openvino execution provider create a custom nuget package. Follow the instructions [here](../build/inferencing.md#build-nuget-packages) to install prerequisites for nuget creation. Once prerequisites are installed follow the instructions to [build openvino execution provider](../build/eps.md#openvino) and add an extra flag `--build_nuget` to create nuget packages. Two nuget packages will be created Microsoft.ML.OnnxRuntime.Managed and Intel.ML.OnnxRuntime.Openvino. - But if there is need to enable CX11_ABI=1 flag of OpenVINO, build Onnx Runtime python wheel packages from source. For build instructions, please see the [BUILD page](../build/eps.md#openvino). - OpenVINO™ Execution Provider wheels on Linux built from source will not have prebuilt OpenVINO™ libs so we must set the OpenVINO™ Environment Variable using the full installer package of OpenVINO™: - ``` - $ source /setupvars.sh - ``` +## Configuration Options -**Set OpenVINO™ Environment for C++** +Runtime parameters set during OpenVINO Execution Provider initialization to control the inference flow. -For Running C++/C# ORT Samples with the OpenVINO™ Execution Provider it is must to set up the OpenVINO™ Environment Variables using the full installer package of OpenVINO™. -Initialize the OpenVINO™ environment by running the setupvars script as shown below. This is a required step: - * For Windows run: - ``` - C:\ \setupvars.bat - ``` - * For Linux run: - ``` - $ source /setupvars.sh - ``` - **Note:** If you are using a dockerfile to use OpenVINO™ Execution Provider, sourcing OpenVINO™ won't be possible within the dockerfile. You would have to explicitly set the LD_LIBRARY_PATH to point to OpenVINO™ libraries location. Refer our [dockerfile](https://github.com/microsoft/onnxruntime/blob/main/dockerfiles/Dockerfile.openvino). +| **Key** | **Type** | **Allowable Values** | **Value Type** | **Description** | +|---------|----------|---------------------|----------------|-----------------| +| [**device_type**](#device_type) | string | CPU, NPU, GPU, GPU.0, GPU.1, HETERO, MULTI, AUTO | string | Specify intel target H/W device | +| [**precision**](#precision) | string | FP32, FP16, ACCURACY | string | Set inference precision level | +| [**num_of_threads**](#num_of_threads--num_streams) | string | Any positive integer > 0 | size_t | Control number of inference threads | +| [**num_streams**](#num_of_threads--num_streams) | string | Any positive integer > 0 | size_t | Set parallel execution streams for throughput | +| [**cache_dir**](#cache_dir) | string | Valid filesystem path | string | Enable openvino model caching for improved latency | +| [**load_config**](#load_config) | string | JSON file path | string | Load and set custom/HW specific OpenVINO properties from JSON | +| [**enable_qdq_optimizer**](#enable_qdq_optimizer) | string | True/False | boolean | Enable QDQ optimization for NPU | +| [**disable_dynamic_shapes**](#disable_dynamic_shapes) | string | True/False | boolean | Convert dynamic models to static shapes | +| [**reshape_input**](#reshape_input) | string | input_name[shape_bounds] | string | Specify upper and lower bound for dynamic shaped inputs for improved performance with NPU | +| [**layout**](#layout) | string | input_name[layout_format] | string | Specify input/output tensor layout format | -**Set OpenVINO™ Environment for C#** +**Deprecation Notice** -To use csharp api for openvino execution provider create a custom nuget package. Follow the instructions [here](../build/inferencing.md#build-nuget-packages) to install prerequisites for nuget creation. Once prerequisites are installed follow the instructions to [build openvino execution provider](../build/eps.md#openvino) and add an extra flag `--build_nuget` to create nuget packages. Two nuget packages will be created Microsoft.ML.OnnxRuntime.Managed and Microsoft.ML.OnnxRuntime.Openvino. +The following provider options are **deprecated** and should be migrated to `load_config` for better compatibility with future releases. -## Features +| Deprecated Provider Option | `load_config` Equivalent | Recommended Migration | +|---------------------------|------------------------|----------------------| +| `precision="FP16"` | `INFERENCE_PRECISION_HINT` | `{"GPU": {"INFERENCE_PRECISION_HINT": "f16"}}` | +| `precision="FP32"` | `INFERENCE_PRECISION_HINT` | `{"GPU": {"INFERENCE_PRECISION_HINT": "f32"}}` | +| `precision="ACCURACY"` | `EXECUTION_MODE_HINT` | `{"GPU": {"EXECUTION_MODE_HINT": "ACCURACY"}}` | +| `num_of_threads=8` | `INFERENCE_NUM_THREADS` | `{"CPU": {"INFERENCE_NUM_THREADS": "8"}}` | +| `num_streams=4` | `NUM_STREAMS` | `{"GPU": {"NUM_STREAMS": "4"}}` | -### OpenCL queue throttling for GPU devices +Refer to [Examples](#examples) for usage. -Enables [OpenCL queue throttling](https://docs.openvino.ai/2024/api/c_cpp_api/group__ov__runtime__ocl__gpu__prop__cpp__api.html) for GPU devices. Reduces CPU utilization when using GPUs with OpenVINO EP. +## Configuration Descriptions -### Model caching +### `device_type` -OpenVINO™ supports [model caching](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/optimizing-latency/model-caching-overview.html). +Specify the target hardware device for compilation and inference execution. The OpenVINO Execution Provider supports the following devices for deep learning model execution: **CPU**, **GPU**, and **NPU**. Configuration supports both single device and multi-device setups, enabling: +- Automatic device selection +- Heterogeneous inference across devices +- Multi-device parallel execution -Model caching feature is supported on CPU, NPU, GPU along with kernel caching on iGPU, dGPU. +**Supported Devices:** -This feature enables users to save and load the blob file directly on to the hardware device target and perform inference with improved Inference Latency. +- `CPU` — Intel CPU +- `GPU` — Intel integrated GPU or discrete GPU +- `GPU.0`, `GPU.1` — Specific GPU when multiple GPUs are available +- `NPU` — Intel Neural Processing Unit -Kernel Caching on iGPU and dGPU: +**Multi-Device Configurations:** -This feature also allows user to save kernel caching as cl_cache files for models with dynamic input shapes. These cl_cache files can be loaded directly onto the iGPU/dGPU hardware device target and inferencing can be performed. +OpenVINO offers the option of running inference with the following inference modes: -#### Enabling Model Caching via Runtime options using C++/python API's. +- `AUTO:,...` — Automatic Device Selection +- `HETERO:,...` — Heterogeneous Inference +- `MULTI:,...` — Multi-Device Execution -This flow can be enabled by setting the runtime config option 'cache_dir' specifying the path to dump and load the blobs (CPU, NPU, iGPU, dGPU) or cl_cache(iGPU, dGPU) while using the C++/python API'S. +Minimum **two devices** required for multi-device configurations. -Refer to [Configuration Options](#configuration-options) for more information about using these runtime options. +**Examples:** +- `AUTO:GPU,NPU,CPU` +- `HETERO:GPU,CPU` +- `MULTI:GPU,CPU` -### Support for INT8 Quantized models +**Automatic Device Selection** -Int8 models are supported on CPU, GPU and NPU. +Automatically selects the best device available for the given task. It offers many additional options and optimizations, including inference on multiple devices at the same time. AUTO internally recognizes CPU, integrated GPU, discrete Intel GPUs, and NPU, then assigns inference requests to the best-suited device. -### Support for Weights saved in external files +**Heterogeneous Inference** -OpenVINO™ Execution Provider now supports ONNX models that store weights in external files. It is especially useful for models larger than 2GB because of protobuf limitations. +Enables splitting inference among several devices automatically. If one device doesn't support certain operations, HETERO distributes the workload across multiple devices, utilizing accelerator power for heavy operations while falling back to CPU for unsupported layers. -See the [OpenVINO™ ONNX Support documentation](https://docs.openvino.ai/2024/openvino-workflow/model-preparation/convert-model-onnx.html). +**Multi-Device Execution** -Converting and Saving an ONNX Model to External Data: -Use the ONNX API's.[documentation](https://github.com/onnx/onnx/blob/master/docs/ExternalData.md#converting-and-saving-an-onnx-model-to-external-data). +Runs the same model on multiple devices in parallel to improve device utilization. MULTI automatically groups inference requests to improve throughput and performance consistency via load distribution. -Example: +**Note:** Deprecated options `CPU_FP32`, `GPU_FP32`, `GPU_FP16`, `NPU_FP16` are no longer supported. Use `device_type` and `precision` separately. -```python -import onnx -onnx_model = onnx.load("model.onnx") # Your model in memory as ModelProto -onnx.save_model(onnx_model, 'saved_model.onnx', save_as_external_data=True, all_tensors_to_one_file=True, location='data/weights_data', size_threshold=1024, convert_attribute=False) -``` +--- -Note: -1. In the above script, model.onnx is loaded and then gets saved into a file called 'saved_model.onnx' which won't have the weights but this new onnx model now will have the relative path to where the weights file is located. The weights file 'weights_data' will now contain the weights of the model and the weights from the original model gets saved at /data/weights_data. - -2. Now, you can use this 'saved_model.onnx' file to infer using your sample. But remember, the weights file location can't be changed. The weights have to be present at /data/weights_data - -3. Install the latest ONNX Python package using pip to run these ONNX Python API's successfully. - -### Support for IO Buffer Optimization - -To enable IO Buffer Optimization we have to set OPENCL_LIBS, OPENCL_INCS environment variables before build. For IO Buffer Optimization, the model must be fully supported on OpenVINO™ and we must provide in the remote context cl_context void pointer as C++ Configuration Option. We can provide cl::Buffer address as Input using GPU Memory Allocator for input and output. - -Example: -```bash -//Set up a remote context -cl::Context _context; -..... -// Set the context through openvino options -std::unordered_map ov_options; -ov_options[context] = std::to_string((unsigned long long)(void *) _context.get()); -..... -//Define the Memory area -Ort::MemoryInfo info_gpu("OpenVINO_GPU", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemTypeDefault); -//Create a shared buffer , fill in with data -cl::Buffer shared_buffer(_context, CL_MEM_READ_WRITE, imgSize, NULL, &err); -.... -//Cast it to void*, and wrap it as device pointer for Ort::Value -void *shared_buffer_void = static_cast(&shared_buffer); -Ort::Value inputTensors = Ort::Value::CreateTensor( - info_gpu, shared_buffer_void, imgSize, inputDims.data(), - inputDims.size(), ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT); -``` +### `precision` +**DEPRECATED:** This option is deprecated and can be set via `load_config` using the `INFERENCE_PRECISION_HINT` property. +- Controls numerical precision during inference, balancing **performance** and **accuracy**. -### Multi-threading for OpenVINO™ Execution Provider +**Precision Support on Devices:** -OpenVINO™ Execution Provider for ONNX Runtime enables thread-safe deep learning inference +- **CPU:** `FP32` +- **GPU:** `FP32`, `FP16`, `ACCURACY` +- **NPU:** `FP16` -### Multi streams for OpenVINO™ Execution Provider -OpenVINO™ Execution Provider for ONNX Runtime allows multiple stream execution for difference performance requirements part of API 2.0 +**ACCURACY Mode** -### Auto-Device Execution for OpenVINO™ Execution Provider +- Maintains original model precision without conversion, ensuring maximum accuracy. -Use `AUTO:,..` as the device name to delegate selection of an actual accelerator to OpenVINO™. Auto-device internally recognizes and selects devices from CPU, integrated GPU, discrete Intel GPUs (when available) and NPU (when available) depending on the device capabilities and the characteristic of ONNX models, for example, precisions. Then Auto-device assigns inference requests to the selected device. +**Note 1:** `FP16` generally provides ~2x better performance on GPU/NPU with minimal accuracy loss. -From the application point of view, this is just another device that handles all accelerators in full system. -For more information on Auto-Device plugin of OpenVINO™, please refer to the -[Intel OpenVINO™ Auto Device Plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html#automatic-device-selection). -### Heterogeneous Execution for OpenVINO™ Execution Provider +--- +### `num_of_threads` & `num_streams` -The heterogeneous execution enables computing for inference on one network on several devices. Purposes to execute networks in heterogeneous mode: +**DEPRECATED:** These options are deprecated and can be set via `load_config` using the `INFERENCE_NUM_THREADS` and `NUM_STREAMS` properties respectively. -* To utilize accelerator's power and calculate the heaviest parts of the network on the accelerator and execute unsupported layers on fallback devices like the CPU to utilize all available hardware more efficiently during one inference. +**Multi-Threading** -For more information on Heterogeneous plugin of OpenVINO™, please refer to the -[Intel OpenVINO™ Heterogeneous Plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html). +- Controls the number of inference threads for CPU execution (default: `8`). OpenVINO EP provides thread-safe inference across all devices. -### Multi-Device Execution for OpenVINO™ Execution Provider +**Multi-Stream Execution** -Multi-Device plugin automatically assigns inference requests to available computational devices to execute the requests in parallel. Potential gains are as follows: +Manages parallel inference streams for throughput optimization (default: `1` for latency-focused execution). -* Improved throughput that multiple devices can deliver (compared to single-device execution) -* More consistent performance, since the devices can now share the inference burden (so that if one device is becoming too busy, another device can take more of the load) +- **Multiple streams:** Higher throughput for batch workloads +- **Single stream:** Lower latency for real-time applications -For more information on Multi-Device plugin of OpenVINO™, please refer to the -[Intel OpenVINO™ Multi Device Plugin](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html#multi-stream-execution). -### Export OpenVINO Compiled Blob -Export the OpenVINO compiled blob as an ONNX model. Using this ONNX model for subsequent inferences avoids model recompilation and could have a positive impact on Session creation time. This feature is currently enabled for fully supported models only. It complies with the ORT session config keys -``` - Ort::SessionOptions session_options; +--- - // Enable EP context feature to dump the partitioned graph which includes the EP context into Onnx file. - // "0": disable. (default) - // "1": enable. +### `cache_dir` - session_options.AddConfigEntry(kOrtSessionOptionEpContextEnable, "1"); +**DEPRECATED:** This option is deprecated and can be set via `load_config` using the `CACHE_DIR` property. - // Flag to specify whether to dump the EP context into single Onnx model or pass bin path. - // "0": dump the EP context into separate file, keep the file name in the Onnx model. - // "1": dump the EP context into the Onnx model. (default). +Enables model caching to significantly reduce subsequent load times. Supports CPU, NPU, and GPU devices with kernel caching on iGPU/dGPU. - session_options.AddConfigEntry(kOrtSessionOptionEpContextEmbedMode, "1"); +**Benefits** +- Saves compiled models and `cl_cache` files for dynamic shapes +- Eliminates recompilation overhead on subsequent runs +- Particularly useful for complex models and frequent application restarts - // Specify the file path for the Onnx model which has EP context. - // Defaults to /original_file_name_ctx.onnx if not specified - session_options.AddConfigEntry(kOrtSessionOptionEpContextFilePath, ".\ov_compiled_epctx.onnx"); +--- - sess = onnxruntime.InferenceSession(, session_options) -``` -Refer to [Session Options](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h) for more information about session options. +### `load_config` + +**Recommended Configuration Method** for setting OpenVINO runtime properties. Provides direct access to OpenVINO properties through a JSON configuration file during runtime. + +#### Overview -### Enable QDQ Optimizations Passes -Optimizes ORT quantized models for the NPU device to only keep QDQs for supported ops and optimize for performance and accuracy.Generally this feature will give better performance/accuracy with ORT Optimizations disabled. -Refer to [Configuration Options](#configuration-options) for more information about using these runtime options. +`load_config` enables fine-grained control over OpenVINO inference behavior by loading properties from a JSON file. This is the **preferred method** for configuring advanced OpenVINO features, offering: -### Loading Custom JSON OpenVINO™ Config During Runtime -The `load_config` feature is developed to facilitate loading of OpenVINO EP parameters using a JSON input schema, which mandatorily follows below format - +- Direct access to OpenVINO runtime properties +- Device-specific configuration +- Better compatibility with future OpenVINO releases +- No property name translation required + +#### JSON Configuration Format +```json +{ + "DEVICE_NAME": { + "PROPERTY_KEY": "value" + } +} ``` + +**Supported Device Names:** +- `"CPU"` - Intel CPU +- `"GPU"` - Intel integrated/discrete GPU +- `"NPU"` - Intel Neural Processing Unit +- `"AUTO"` - Automatic device selection + + +#### Popular OpenVINO Properties + +The following properties are commonly used for optimizing inference performance. For complete property definitions and all possible values, refer to the [OpenVINO properties](https://github.com/openvinotoolkit/openvino/blob/master/src/inference/include/openvino/runtime/properties.hpp) header file. +##### Performance & Execution Hints + +| Property | Valid Values | Description | +|----------|-------------|-------------| +| `PERFORMANCE_HINT` | `"LATENCY"`, `"THROUGHPUT"` | High-level performance optimization goal | +| `EXECUTION_MODE_HINT` | `"ACCURACY"`, `"PERFORMANCE"` | Accuracy vs performance trade-off | +| `INFERENCE_PRECISION_HINT` | `"f32"`, `"f16"`, `"bf16"` | Explicit inference precision | + + +**PERFORMANCE_HINT:** +- `"LATENCY"`: Optimizes for low latency +- `"THROUGHPUT"`: Optimizes for high throughput + +**EXECUTION_MODE_HINT:** +- `"ACCURACY"`: Maintains model precision, dynamic precision selection +- `"PERFORMANCE"`: Optimizes for speed, may use lower precision + +**INFERENCE_PRECISION_HINT:** +- `"f16"`: FP16 precision +- `"f32"`: FP32 precision - highest accuracy +- `"bf16"`: BF16 precision - balance between f16 and f32 + +**Important:** Use either `EXECUTION_MODE_HINT` OR `INFERENCE_PRECISION_HINT`, not both. These properties control similar behavior and should not be combined. + +**Note:** CPU accepts `"f16"` hint in configuration but will upscale to FP32 during execution, as CPU only supports FP32 precision natively. + + +##### Threading & Streams + +| Property | Valid Values | Description | +|----------|-------------|-------------| +| `NUM_STREAMS` | Positive integer (e.g., `"1"`, `"4"`, `"8"`) | Number of parallel execution streams | +| `INFERENCE_NUM_THREADS` | Integer | Maximum number of inference threads | +| `COMPILATION_NUM_THREADS` | Integer | Maximum number of compilation threads | + +**NUM_STREAMS:** +- Controls parallel execution streams for throughput optimization +- Higher values increase throughput for batch processing +- Lower values optimize latency for real-time inference + +**INFERENCE_NUM_THREADS:** +- Controls CPU thread count for inference execution +- Explicit value: Fixed thread count (e.g., `"4"` limits to 4 threads) + +##### Caching Properties + +| Property | Valid Values | Description | +|----------|-------------|-------------| +| `CACHE_DIR` | File path string | Model cache directory | +| `CACHE_MODE` | `"OPTIMIZE_SIZE"`, `"OPTIMIZE_SPEED"` | Cache optimization strategy | + +**CACHE_MODE:** +- `"OPTIMIZE_SPEED"`: Faster cache creation, larger cache files +- `"OPTIMIZE_SIZE"`: Slower cache creation, smaller cache files +##### Logging Properties + +| Property | Valid Values | Description | +|----------|-------------|-------------| +| `LOG_LEVEL` | `"LOG_NONE"`, `"LOG_ERROR"`, `"LOG_WARNING"`, `"LOG_INFO"`, `"LOG_DEBUG"`, `"LOG_TRACE"` | Logging verbosity level | + +**Note:** `LOG_LEVEL` is not supported on GPU devices. + +##### AUTO Device Properties + +| Property | Valid Values | Description | +|----------|-------------|-------------| +| `ENABLE_STARTUP_FALLBACK` | `"YES"`, `"NO"` | Enable device fallback during model loading | +| `ENABLE_RUNTIME_FALLBACK` | `"YES"`, `"NO"` | Enable device fallback during inference runtime | +| `DEVICE_PROPERTIES` | Nested JSON string | Device-specific property configuration | + +**DEVICE_PROPERTIES Syntax:** + +Used to configure properties for individual devices when using AUTO mode. +```json { - "DEVICE_KEY": {"PROPERTY": "PROPERTY_VALUE"} + "AUTO": { + "DEVICE_PROPERTIES": "{CPU:{PROPERTY:value},GPU:{PROPERTY:value}}" + } } ``` -where "DEVICE_KEY" can be CPU, NPU or GPU , "PROPERTY" must be a valid entity defined in [OpenVINO™ supported properties](https://github.com/openvinotoolkit/openvino/blob/releases/2025/1/src/inference/include/openvino/runtime/properties.hpp) & "PROPERTY_VALUE" must be a valid corresponding supported property value passed in as a string. -If a property is set using an invalid key (i.e., a key that is not recognized as part of the `OpenVINO™ supported properties`), it will be ignored & a warning will be logged against the same. However, if a valid property key is used but assigned an invalid value (e.g., a non-integer where an integer is expected), the OpenVINO™ framework will result in an exception during execution. +#### Property Reference Documentation + +For complete property definitions and advanced options, refer to the official OpenVINO properties header: + +**[OpenVINO Runtime Properties](https://github.com/openvinotoolkit/openvino/blob/master/src/inference/include/openvino/runtime/properties.hpp)** + +Property keys used in `load_config` JSON must match the string literal defined in the properties header file. + + + + + +--- + + +### `enable_qdq_optimizer` + +**DEPRECATED:** This option is deprecated and can be set via `load_config` using the `NPU_QDQ_OPTIMIZATION` property. + +NPU-specific optimization for Quantize-Dequantize (QDQ) operations in the inference graph. This optimizer enhances ORT quantized models by: + +- Retaining QDQ operations only for supported operators +- Improving inference performance on NPU devices +- Maintaining model accuracy while optimizing execution + +--- + +### `disable_dynamic_shapes` + +**Dynamic Shape Management** + +- Handles models with variable input dimensions. +- Provides the option to convert dynamic shapes to static shapes when beneficial for performance optimization. + +--- + +### `reshape_input` + +**NPU Shape Bounds Configuration** + +- Use `reshape_input` to explicitly set dynamic shape bounds for NPU devices. + +**Format:** +- Range bounds: `input_name[lower..upper]` +- Fixed shape: `input_name[fixed_shape]` + +This configuration is required for optimal NPU memory allocation and management. + +--- + +### `model_priority` + +**DEPRECATED:** This option is deprecated and can be set via `load_config` using the `MODEL_PRIORITY` property. + +Configures resource allocation priority for multi-model deployment scenarios. + +**Priority Levels:** + +| Level | Description | +|-------|-------------| +| **HIGH** | Maximum resource allocation for critical models | +| **MEDIUM** | Balanced resource sharing across models | +| **LOW** | Minimal allocation, yields resources to higher priority models | +| **DEFAULT** | System-determined priority based on workload | + + +--- + +### `layout` + +- Provides explicit control over tensor memory layout for performance optimization. +- Helps OpenVINO optimize memory access patterns and tensor operations. + +**Layout Characters:** + +- **N:** Batch dimension +- **C:** Channel dimension +- **H:** Height dimension +- **W:** Width dimension +- **D:** Depth dimension +- **T:** Time dimension +- **?:** Unknown/dynamic dimension + +**Format:** + +`input_name[LAYOUT],output_name[LAYOUT]` + +**Example:** -The valid properties are of two types viz. Mutable (Read/Write) & Immutable (Read only) these are also governed while setting the same. If an Immutable property is being set, we skip setting the same with a similar warning. +`input_image[NCHW],output_tensor[NC]` -For setting appropriate `"PROPERTY"`, refer to OpenVINO config options for [CPU](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/cpu-device.html#supported-properties), [GPU](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html#supported-properties), [NPU](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html#supported-features-and-properties) and [AUTO](https://docs.openvino.ai/2025/openvino-workflow/running-inference/inference-devices-and-modes/auto-device-selection.html#using-auto). +--- + +## Examples + +### Python -Example: +#### Using load_config with JSON file +```python +import onnxruntime as ort +import json + +# Create config file +config = { + "AUTO": { + "PERFORMANCE_HINT": "THROUGHPUT", + "PERF_COUNT": "NO", + "DEVICE_PROPERTIES": "{CPU:{INFERENCE_PRECISION_HINT:f32,NUM_STREAMS:3},GPU:{INFERENCE_PRECISION_HINT:f32,NUM_STREAMS:5}}" + } +} -The usage of this functionality using onnxruntime_perf_test application is as below – +with open("ov_config.json", "w") as f: + json.dump(config, f) +# Use config with session +options = {"device_type": "AUTO", "load_config": "ov_config.json"} +session = ort.InferenceSession("model.onnx", + providers=[("OpenVINOExecutionProvider", options)]) ``` -onnxruntime_perf_test.exe -e openvino -m times -r 1 -i "device_type|NPU load_config|test_config.json" model.onnx -``` -### OpenVINO Execution Provider Supports EP-Weight Sharing across sessions -The OpenVINO Execution Provider (OVEP) in ONNX Runtime supports EP-Weight Sharing, enabling models to efficiently share weights across multiple inference sessions. This feature enhances the execution of Large Language Models (LLMs) with prefill and KV cache, reducing memory consumption and improving performance when running multiple inferences. +#### Using load_config for CPU +```python +import onnxruntime as ort +import json + +# Create CPU config +config = { + "CPU": { + "INFERENCE_PRECISION_HINT": "f32", + "NUM_STREAMS": "3", + "INFERENCE_NUM_THREADS": "8" + } +} -With EP-Weight Sharing, prefill and KV cache models can now reuse the same set of weights, minimizing redundancy and optimizing inference. Additionally, this ensures that EP Context nodes are still created even when the model undergoes subgraph partitioning. +with open("cpu_config.json", "w") as f: + json.dump(config, f) -These changes enable weight sharing between two models using the session context option: ep.share_ep_contexts. -Refer to [Session Options](https://github.com/microsoft/onnxruntime/blob/5068ab9b190c549b546241aa7ffbe5007868f595/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h#L319) for more details on configuring this runtime option. +options = {"device_type": "CPU", "load_config": "cpu_config.json"} +session = ort.InferenceSession("model.onnx", + providers=[("OpenVINOExecutionProvider", options)]) +``` -### OVEP supports CreateSessionFromArray API -The OpenVINO Execution Provider (OVEP) in ONNX Runtime supports creating sessions from memory using the CreateSessionFromArray API. This allows loading models directly from memory buffers instead of file paths. The CreateSessionFromArray loads the model in memory then creates a session from the in-memory byte array. - -Note: -Use the -l argument when running the inference with perf_test using CreateSessionFromArray API. +#### Using load_config for GPU +```python +import onnxruntime as ort +import json + +# Create GPU config with caching +config = { + "GPU": { + "INFERENCE_PRECISION_HINT": "f16", + "CACHE_DIR": "./model_cache", + "PERFORMANCE_HINT": "LATENCY" + } +} -## Configuration Options +with open("gpu_config.json", "w") as f: + json.dump(config, f) + +options = {"device_type": "GPU", "load_config": "gpu_config.json"} +session = ort.InferenceSession("model.onnx", + providers=[("OpenVINOExecutionProvider", options)]) +``` -OpenVINO™ Execution Provider can be configured with certain options at runtime that control the behavior of the EP. These options can be set as key-value pairs as below:- +--- ### Python API Key-Value pairs for config options can be set using InferenceSession API as follow:- @@ -279,6 +484,7 @@ session = onnxruntime.InferenceSession(, providers=['OpenVIN ``` *Note that the releases from (ORT 1.10) will require explicitly setting the providers parameter if you want to use execution providers other than the default CPU provider (as opposed to the current behavior of providers getting set/registered by default based on the build flags) when instantiating InferenceSession.* +--- ### C/C++ API 2.0 The session configuration options are passed to SessionOptionsAppendExecutionProvider API as shown in an example below for GPU device type: @@ -294,7 +500,7 @@ options[enable_qdq_optimizer] = "True"; options[load_config] = "config_path.json"; session_options.AppendExecutionProvider_OpenVINO_V2(options); ``` - +--- ### C/C++ Legacy API Note: This API is no longer officially supported. Users are requested to move to V2 API. @@ -302,13 +508,14 @@ The session configuration options are passed to SessionOptionsAppendExecutionPro ``` OrtOpenVINOProviderOptions options; -options.device_type = "GPU_FP32"; +options.device_type = "CPU"; options.num_of_threads = 8; options.cache_dir = ""; options.context = 0x123456ff; options.enable_opencl_throttling = false; SessionOptions.AppendExecutionProvider_OpenVINO(session_options, &options); ``` +--- ### Onnxruntime Graph level Optimization OpenVINO™ backend performs hardware, dependent as well as independent optimizations on the graph to infer it on the target hardware with best possible performance. In most cases it has been observed that passing the ONNX input graph as it is without explicit optimizations would lead to best possible optimizations at kernel level by OpenVINO™. For this reason, it is advised to turn off high level optimizations performed by ONNX Runtime for OpenVINO™ Execution Provider. This can be done using SessionOptions() as shown below:- @@ -324,38 +531,7 @@ OpenVINO™ backend performs hardware, dependent as well as independent optimiza ``` SessionOptions::SetGraphOptimizationLevel(ORT_DISABLE_ALL); ``` - -## Summary of options - -The following table lists all the available configuration options for API 2.0 and the Key-Value pairs to set them: - -| **Key** | **Key type** | **Allowable Values** | **Value type** | **Description** | -| --- | --- | --- | --- | --- | -| device_type | string | CPU, NPU, GPU, GPU.0, GPU.1 based on the available GPUs, NPU, Any valid Hetero combination, Any valid Multi or Auto devices combination | string | Overrides the accelerator hardware type with these values at runtime. If this option is not explicitly set, default hardware specified during build is used. | -| precision | string | FP32, FP16, ACCURACY based on the device_type chosen | string | Supported precisions for HW {CPU:FP32, GPU:[FP32, FP16, ACCURACY], NPU:FP16}. Default precision for HW for optimized performance {CPU:FP32, GPU:FP16, NPU:FP16}. To execute model with the default input precision, select ACCURACY precision type. | -| num_of_threads | string | Any unsigned positive number other than 0 | size_t | Overrides the accelerator default value of number of threads with this value at runtime. If this option is not explicitly set, default value of 8 during build time will be used for inference. | -| num_streams | string | Any unsigned positive number other than 0 | size_t | Overrides the accelerator default streams with this value at runtime. If this option is not explicitly set, default value of 1, performance for latency is used during build time will be used for inference. | -| cache_dir | string | Any valid string path on the hardware target | string | Explicitly specify the path to save and load the blobs enabling model caching feature.| -| context | string | OpenCL Context | void* | This option is only available when OpenVINO EP is built with OpenCL flags enabled. It takes in the remote context i.e the cl_context address as a void pointer.| -| enable_opencl_throttling | string | True/False | boolean | This option enables OpenCL queue throttling for GPU devices (reduces CPU utilization when using GPU). | -| enable_qdq_optimizer | string | True/False | boolean | This option enables QDQ Optimization to improve model performance and accuracy on NPU. | -| load_config | string | Any custom JSON path | string | This option enables a feature for loading custom JSON OV config during runtime which sets OV parameters. | -| disable_dynamic_shapes | string | True/False | boolean | This option enables rewriting dynamic shaped models to static shape at runtime and execute. | -| model_priority | string | LOW, MEDIUM, HIGH, DEFAULT | string | This option configures which models should be allocated to the best resource. | - - -Valid Hetero or Multi or Auto Device combinations: -`HETERO:,...` -The `device` can be any of these devices from this list ['CPU','GPU', 'NPU'] - -A minimum of two DEVICE_TYPE'S should be specified for a valid HETERO, MULTI, or AUTO Device Build. - -Example: -HETERO:GPU,CPU AUTO:GPU,CPU MULTI:GPU,CPU - -Deprecated device_type option : -CPU_FP32, GPU_FP32, GPU_FP16, NPU_FP16 are no more supported. They will be deprecated in the future release. Kindly upgrade to latest device_type and precision option. - +--- ## Support Coverage **ONNX Layers supported using OpenVINO** @@ -612,28 +788,35 @@ For NPU if model is not supported we fallback to CPU. **Note:** We have added support for INT8 models, quantized with Neural Network Compression Framework (NNCF). To know more about NNCF refer [here](https://github.com/openvinotoolkit/nncf). -## OpenVINO™ Execution Provider Samples Tutorials +--- + +# OpenVINO™ Execution Provider Samples & Tutorials + +In order to showcase what you can do with the OpenVINO™ Execution Provider for ONNX Runtime, we have created a few samples that show how you can get that performance boost you're looking for with just one additional line of code. -In order to showcase what you can do with the OpenVINO™ Execution Provider for ONNX Runtime, we have created a few samples that shows how you can get that performance boost you’re looking for with just one additional line of code. +## Samples ### Python API -[Object detection with tinyYOLOv2 in Python](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/OpenVINO_EP/tiny_yolo_v2_object_detection) -[Object detection with YOLOv4 in Python](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/OpenVINO_EP/yolov4_object_detection) +- [Object detection with tinyYOLOv2 in Python](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/OpenVINO_EP/tiny_yolo_v2_object_detection) +- [Object detection with YOLOv4 in Python](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/OpenVINO_EP/yolov4_object_detection) ### C/C++ API -[Image classification with Squeezenet in CPP](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_cxx/OpenVINO_EP) -### Csharp API -[Object detection with YOLOv3 in C#](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_sharp/OpenVINO_EP/yolov3_object_detection) +- [Image classification with Squeezenet in C++](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_cxx/OpenVINO_EP) -## Blogs/Tutorials +### C# API + +- [Object detection with YOLOv3 in C#](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/c_sharp/OpenVINO_EP/yolov3_object_detection) + +## Blogs & Tutorials ### Overview of OpenVINO Execution Provider for ONNX Runtime -[OpenVINO Execution Provider](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/faster-inferencing-with-one-line-of-code.html) -### Tutorial on how to use OpenVINO™ Execution Provider for ONNX Runtime Docker Containers -[Docker Containers](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/openvino-execution-provider-docker-container.html) +[OpenVINO Execution Provider](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/faster-inferencing-with-one-line-of-code.html) - Learn about faster inferencing with one line of code + +### Python Pip Wheel Packages + +[Tutorial: Using OpenVINO™ Execution Provider for ONNX Runtime Python Wheel Packages](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/openvino-execution-provider-for-onnx-runtime.html) -### Tutorial on how to use OpenVINO™ Execution Provider for ONNX Runtime python wheel packages -[Python Pip Wheel Packages](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/openvino-execution-provider-for-onnx-runtime.html) +--- \ No newline at end of file