Project RP013: Optimizing PyTorch ATen Operators for High-Performance RISC-V Hardware
Bidding Starts:4/9/2025
Bidding Ends: 5/23/2025
SUMMARY:
To stay competitive, RISC-V hardware must not only support but also deliver high-performance, out-of-the-box compatibility with essential AI tools. To advance this goal, RISE is funding the optimization of PyTorch ATen operators for RISC-V.
This RFP aims to optimize PyTorch ATen operators for RISC-V, specifically leveraging the RISC-V Vector (RVV) architecture. The goal is to enhance performance and compatibility of PyTorch on RISC-V hardware, ensuring it runs efficiently out of the box. The work involves optimizing PyTorch ATen operators for RVV, implementing vector-length agnostic (VLA) support, and improving OpenBLAS for common matrix shapes used in machine learning models. Contributions will be upstreamed to the PyTorch community, targeting the BPI-F3 board and using only ratified RISC-V extensions.
Milestones to Deliver:
Optimize PyTorch ATen operators on CPU for RVV
Optimize PyTorch ATen operators for RVV architecture to leverage its vector processing capabilities.
Optimize for Vector Length Agnostic (VLA).
Based on the work done in https://github.com/pytorch/pytorch/pull/135570 , add a VLA implementation of the aten/vec library
Ensure compatibility and performance across various PyTorch models and workloads on RVV-enabled CPUs.
Contribute optimizations and bug fixes back to the PyTorch upstream community.
Current plan: Target the BPI-F3 board. Only ratified extensions are to be used (RVA23 mandatory extensions, no AME/IME currently)
The
torch.compilefeature is out-of-scope of this milestoneSingle-core vs Multi-core: initial optimization is for single-core; expectation is that it will scale sufficiently well.
Optimize OpenBLAS for various matrix shapes for PyTorch ATen operators
We focus on
aten::mmandaten::admmoperators, which call into OpenBLAS for types float32, float64, complex64, complex128The common matrix shapes we want to optimize are listed here for
aten::mmand here foraten::addmmImprove tail-handling in current kernel implementation for 128bits and 256bits vector-wide implementations.
The current implementation processes the remaining elements at the end of rows or columns using loops with a decreasing number of elements. For instance, the sgemm 256-bit vector-wide kernel handles the final M<16 columns with separate loops for 8, 4, 2, and 1 element(s). To simplify and improve performance, we aim to utilize RVV's tail-handling features instead of these decreasing-element loops. A similar approach should be taken for the last N<8 rows.
Changes are expected to be made to the following source files (included but not limited; contributions may be needed in other files):
Success criteria - Upon completion and integration of these modifications, PyTorch will leverage the optimized kernels developed during this stage for enhanced performance.
The performance will be measured with torchbench.py before and after the proposed changes.
The models listed in torchperf are used to measure performance uplift. The reproduction steps are documented in https://gitlab.com/riseproject/torchperf
Interested vendors should submit their proposals including:
Technical approach and implementation plan.
Please provide a breakdown of the total cost along with the individual costs and durations for each milestone.
Proficiency in Chinese/Mandarin is highly desirable, as the position requires regular collaboration with Chinese-speaking stakeholders.
Please read the RISE RFP instructions PRIOR to bidding.
Some things to note include:
Contracts will be written using the Standard Linux Foundation Europe Paper with the SOW and payment schedule added as an addendum.
Please review prior to your bid submission to address any concerns.
Contract Language is not negotiable as Linux Foundation will be contracting the work and paying the invoices.
Contracts are milestone based, not hourly.
Biweekly progress reporting is a requirement of this contract.
Bidding Closed