This web-document outlines methods for improving frames-per-second(FPS) performance in ML deployments leveraging Vitis AI across MPSoC, Kria, and Versal platforms.
If your AI/ML solution needs to run at much higher FPS or with much lower overall latency, here are some ways to achieve those improvements. At this article, we had segmented the potential optimizations into three categories: hardware design level, ML model preparation level, and FPGA deployment level.
Hardware (VIVADO or Vitis Overlay):
- Increase the clock frequency of DPU. In general DPU frequency can be upto 350MHz (with 2x Clock of 700 MHz). While for Versal the recently released NPU clock (Vitis AI 5.1 support) can go way higher.
- Enabling multi-core DPU/NPU. Reference – Changing number of DPU core in DPU TRD designs – VIVADO/Vitis.
- Use the maximum architecture of DPU or NPU on MPSoC/Kria or Versal.
- Offload the pre-process and post-process logic on PL fabric – follow the Whole App Acceleration flow for reference. An article in it – Whole App Acceleration for MPSoC Boards- Flow Summary
ML Model during training:
- Determine the light weight neural network variant to match the classification/detection requirement.
- Train the neural network with lower or medium range input shape , i.e 256 x 256 or 320 x 320 or 416 x 416. Reference discussion – How to achieve higher FPS on YOLO nano models at 1280×720 or higher on ZCU106 ?
- Perform pruning if required. Reference example – Pruning Yolo network and deploying with Vitis AI on Kria/MPSoC
During Inferencing or Model deployment:
- Implement the inference logic with multi-threading on CPU(PS) and DPU. Vitis AI runtime example gives idea on implementation of such. Check VART examples (available on Python and C++)- https://github.com/Xilinx/Vitis-AI/blob/3.0/examples/vai_runtime
- Reduce the CPU intensive other task while deploying like reducing the CPU or PS involvement on display/visualization of result, as visualization can be done at lower resolution.
- Implement inference logic in C++ instead of Python. Vitis AI runtime example.
- If one want to run multi-ML model then leverage VVAS framework. VVAS 3.0 – Feature Summary & Supported ML Model Class
Tracing tools for Analyzing the performance or latency:
- On the Petalinux image one can use Vitis AI profiler or generic time measurement logic to determine the time or latency while running each layer of network or on ML model run. Tutorial for Vitis AI profiler is at here.
- Measure the bottleneck point or region and work for optimization on that.
The above-mentioned approaches are compatible with both the Vivado and Vitis flows in Vitis AI/DPU-based AI/ML acceleration. In those approaches, the only difference is how the system is prepared or made ready.
LogicTronix AI-ML Acceleration flow:

This article mainly highlights the workflow and optimization strategies for accelerating machine-learning workloads and deploying models on AMD-Xilinx Adaptive SoCs and FPGA platforms. An alternative method for AI/ML acceleration involves leveraging pre-built accelerator modules or implementing custom ML accelerators in RTL.
At LogicTronix, we deliver tailored AI/ML acceleration solutions aligned with project performance needs, spanning the Vitis AI toolchain, accelerator-module integration flows, and custom RTL-based accelerator design.
LogicTronix is AMD-Xilinx Partner for FPGA Design and AI/ML Acceleration!
Would you like to accelerate ML model in AMD-Xilinx FPGA/MPSoC/Kria or Versal and looking for partner for the acceleration/deployment?
If then please fill-up the following form.
