Project RP014: Optimizing Llama.cpp and GGML for RVV

Project RP014: Optimizing Llama.cpp and GGML for RVV

Bidding Starts:8/12/2025

Bidding Ends: 9/02/2025

 

SUMMARY:

The goal of this Request for Proposal (RFP) is to bring RISC-V support within the Llama.cpp and GGML ecosystem on par with other supported architectures. This requires robust, performant, and well-supported functionality comparable to established architectures like x86 and ARM. The work involves comprehensive support for the RISC-V Vector (RVV) 1.0 extension, with optimized VLEN-agnostic implementations. Contractors are expected to make their best effort to upstream all developed contributions to the Llama.cpp community, targeting the BPI-F3 board and using only ratified RISC-V extensions.

Current State of RISC-V Vector Support in Llama.cpp/GGML:

Current RVV support in Llama.cpp/GGML is incremental and kernel-specific, primarily focusing on 128-bit VLEN for a subset of quantization kernels.

Key Gaps Identified:

  • Incomplete 128-bit RVV Support: RVV acceleration needs systematic extension to cover diverse quantization types (e.g., Q2_K, Q3_K, Q5_K, Q6_K, Q8_0, FP16, FP32) and a broader range of GGML operations beyond initial dot product implementations.

  • Absence of Mature VLEN-Agnostic RVV Support: Critical lack of optimization for VLEN > 128-bit, essential for leveraging emerging wider vector hardware (e.g., BananaPi BPI-F3).

Milestone to Deliver: VLEN-Agnostic RVV Implementation

  • Action: Implement VLEN-Agnostic RVV versions for performance-critical GGML functions. This extends work initiated in PR #12530 and PR #13892.

    • It’s important to implement VLEN-agnostic code, since many RISE member companies sell RISC-V core products with a range of VLENs - from 64-bit VLEN up to 1024-bit VLEN.

    • Add a dynamic dispatch based on the VLEN of the hardware it’s currently running on to different intrinsics each optimizing for specific VLEN. This mechanism would be similar to ifunc in glibc.

  • Scope: This includes functions the following functions and their purpose:

    • Quantization/Dequantization Kernels: Functions such as quantize_row_q8_K, ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8, ggml_quantize_mat_q8_K_4x8
      These are crucial for converting floating-point data to various quantized integer formats and vice-versa, directly impacting memory footprint and inference speed.

    • Vector Dot Products (Quantized): Functions like ggml_vec_dot_tq1_0_q8_K, ggml_vec_dot_tq2_0_q8_K, ggml_vec_dot_iq2_xxs_q8_K, ggml_vec_dot_iq2_xs_q8_K, ggml_vec_dot_iq2_s_q8_K, ggml_vec_dot_iq3_xxs_q8_K, ggml_vec_dot_iq3_s_q8_K, ggml_vec_dot_iq1_s_q8_K, ggml_vec_dot_iq1_m_q8_K, ggml_vec_dot_iq4_nl_q8_0, ggml_vec_dot_iq4_xs_q8_K
      These are core operations for efficient matrix multiplications within quantized LLMs, performing element-wise multiplication and summation of vectors.

    • Matrix-Vector Multiplication (GEMV): Functions including ggml_gemv_q4_0_4x4_q8_0, ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_q4_K_8x8_q8_K, ggml_gemv_iq4_nl_4x4_q8_0
      These are fundamental operations for applying weights to input vectors in neural network layers. 

    • Matrix-Matrix Multiplication (GEMM): Functions such as ggml_gemm_q4_0_4x4_q8_0, ggml_gemm_q4_0_4x8_q8_0, ggml_gemm_q4_K_8x8_q8_K, ggml_gemm_iq4_nl_4x4_q8_0
      These are critical for the linear layers and attention mechanisms in LLMs.

    • Activation Functions and Utilities: This category includes ggml_vec_dot_bf16, ggml_vec_silu_f32, ggml_vec_soft_max_f32, and ensuring GGML_SIMD and all related GGML_* defines in simd-mappings.h are correctly utilized. Also, llamafile_sgemm based on tinyBLAS for general matrix multiplication. These cover various non-linear activations and general SIMD utility functions.

  • Target Hardware: We are currently targeting the BananaPi BPI-F3 board.  Only ratified extensions are to be used (RVA23 mandatory extensions, no AME/IME currently). Any change must be functionally tested using QEMU and multiple hardware VLEN (no rvv support, 128, 256, 512, 1024 at least). For illustration, this can be done using QEMU syscall-emulation and setting the QEMU_CPU environment variable to rv64,v=true,vlen=1024,vext_spec=v1.0 for example to emulate a 1024-bit VLEN, RVV 1.0 machine.

  • Success criteria:

    • Demonstrate performance improvements with the RVV-optimized code on the TinyLlama and BERT-Large-uncased models for inference, with FP32, FP16, INT8, and Q4_K_M datatypes on the BananaPi BPI-F3.

    • Deliver a benchmarking and test harness, as open source software, to enable RISE and the broader open source community to verify and reproduce the performance gains from this work.  This should include both an overall end-to-end benchmark and integration test framework, plus a framework that can be used to run and measure the performance of individual kernels or operators.  RISE expects the contractor to work with the Llama.cpp community to upstream this test framework into mainline Llama.cpp.

    • Write a short summary describing the work done, along with any notable results and areas for potential improvement

    • Post a merge request for the work to upstream Llama.cpp, and address all maintainer feedback

 

Interested vendors should submit their proposals including:

  1. Technical approach and implementation plan.

  2. Please provide a breakdown of the total cost along with the individual costs and durations for each milestone. 


Please read the RISE RFP instructions PRIOR to bidding.

Some things to note include:

  • Contracts will be written using the Standard Linux Foundation Europe Paper with the SOW and payment schedule added as an addendum. 

    • Please review prior to your bid submission to address any concerns.

    • Contract Language is not negotiable as Linux Foundation will be contracting the work and paying the invoices.

  • Contracts are milestone based, not hourly.

  • Biweekly progress reporting is a requirement of this contract.


Bidding is Closed