BEVFormer on Axera NPU

This repository contains the BEVFormer model converted for high-performance inference on the Axera NPU. BEVFormer is a paradigm-shifting transformer-based framework for 3D object detection that learns unified spatio-temporal bird's-eye-view (BEV) representations from multi-camera inputs.

This version is optimized with w8a16 quantization and is compatible with Pulsar2 version 4.2.

Convert Tools Links

For model conversion and deployment guidance:

AXera Platform GitHub Repo: Sample code and optimization guides for Axera NPU.
Pulsar2 Documentation: Guide for converting ONNX models to .axmodel.

Support Platforms

AX650
- M4N-Dock (爱芯派Pro)
- M.2 Accelerator card

Chips	Model Variant	NPU1 Latency (Per Frame)	NPU3 Latency (Per Frame)
AX650	BEVFormer-Tiny	253.966 ms	91.209 ms

How to Use

BEVFormer requires multi-view camera inputs (typically 6 views: front, front-left, front-right, back, back-left, back-right).

Prerequisites

Environment: Ensure you have the required Python environment activated (e.g., using Conda or a virtual environment) with the following core packages installed:
- NPU Runtime: axengine (PyAXEngine)
- Core Libraries: numpy (>= 1.22.0), opencv-python (cv2), tqdm, and cffi.
- (Recommended: Use a dedicated Conda environment to manage these dependencies.)
Model/Data: Ensure the compiled .axmodel, inference_config.json, and input data (inference_data/) are available on the host.

Inference Command

Run the inference script by providing the compiled model, configuration, and data directory.

python inference_axmodel.py compiled.axmodel inference_config.json inference_data/ --output-dir inference_results

Inference with AX650 Host

(base) root@ax650:~/data# python inference_axmodel.py compiled.axmodel inference_config.json inference_data/ --output-dir ./inference_results
[INFO] Available providers:  ['AXCLRTExecutionProvider', 'AxEngineExecutionProvider']
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.12.0s
[INFO] Model type: 0 (single core)
[INFO] Compiler version: 5.1-patch1 82190926
Processing scene 1/2: fcbccedd61424f1b85dcbf8f897f9754 (40 frames)
Scene fcbccedd61424f1b85dcbf8f897f9754:  28%|██████████████████████                                                          | 11/40 [00:12<00:33,  1.15s/it]/root/guofangming/inference_axmodel.py:389: RuntimeWarning: invalid value encountered in cast
  corners = imgfov_pts_2d[i].astype(np.int32)
Scene fcbccedd61424f1b85dcbf8f897f9754: 100%|████████████████████████████████████████████████████████████████████████████████| 40/40 [00:47<00:00,  1.18s/it]Processing scene 2/2: 325cef682f064c55a255f2625c533b75 (41 frames)
Scene 325cef682f064c55a255f2625c533b75: 100%|████████████████████████████████████████████████████████████████████████████████| 41/41 [00:48<00:00,  1.18s/it]Creating video: fcbccedd61424f1b85dcbf8f897f9754_result.mp4: 100%|███████████████████████████████████████████████████████████| 40/40 [00:08<00:00,  4.83it/s]✓ Scene fcbccedd61424f1b85dcbf8f897f9754: 40 frames, video: ./inference_results/fcbccedd61424f1b85dcbf8f897f9754/fcbccedd61424f1b85dcbf8f897f9754_result.mp4]Save scene results:  50%|███████████████████████████████████████████████████                                                   | 1/2 [00:23<00:23, 23.05s/it]
Creating video: 325cef682f064c55a255f2625c533b75_result.mp4:   7%|█▋                     | 3/41 [00:00<00:07,  4.92it/s]
Creating video: 325cef682f064c55a255f2625c533b75_result.mp4: 100%|██████████████████████| 41/41 [00:08<00:00,  4.78it/s]
✓ Scene 325cef682f064c55a255f2625c533b75: 41 frames, video: ./inference_results/325cef682f064c55a255f2625c533b75/325cef682f064c55a255f2625c533b75_result.mp4
Save scene results: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:47<00:00, 23.79s/it]

Results

The model generates a 3D detection map projected onto the Bird's-Eye-View plane. Results are saved as images and videos which visualize the ego-vehicle and surrounding detected objects.

Example Visualization:

Downloads last month: 15

Paper for AXERA-TECH/bevformer

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Paper • 2203.17270 • Published Mar 31, 2022