

# **Unlocking 15% More** Performance: A Case Study in **LLVM Optimization for RISC-V**

Phd Mikhail R. Gadelha Igalia



















































# Unlocking 45% More Performance: A Case Study in **LLVM Optimization for RISC-V**

Phd Mikhail R. Gadelha Igalia



















































# 16% Unlocking 45% More Performance: A Case Study in **LLVM Optimization for RISC-V**

Phd Mikhail R. Gadelha Igalia



















































 This RISE project was a focused, ten-month effort to optimize the LLVM compiler for RISC-V.



- This RISE project was a focused, ten-month effort to optimize the LLVM compiler for RISC-V.
- Our target board was the Banana Pi BPI-F3 with a SpacemiT-X60 8-core RISC-V processor:
  - In-order processor.
  - Supports the RVA22U64 Profile and 256-bit RVV 1.0 standard.



- This RISE project was a focused, ten-month effort to optimize the LLVM compiler for RISC-V.
- Our target board was the Banana Pi BPI-F3 with a SpacemiT-X60 8-core RISC-V processor:
  - In-order processor.
  - Supports the RVA22U64 Profile and 256-bit RVV 1.0 standard.
- Our goal: to close the performance gap between LLVM and the GCC compiler.



- This RISE project was a focused, ten-month effort to optimize the LLVM compiler for RISC-V.
- Our target board was the Banana Pi BPI-F3 with a SpacemiT-X60 8-core RISC-V processor:
  - In-order processor.
  - Supports the RVA22U64 Profile and 256-bit RVV 1.0 standard.
- Our goal: to close the performance gap between LLVM and the GCC compiler.
- Our result: individual contributions boosted performance by up to 16% on SPEC CPU® 2017 benchmarks.

## The Project



 Prior to this work, a clear performance gap existed between code generated by LLVM and GCC for RISC-V.

## The Project



- Prior to this work, a clear performance gap existed between code generated by LLVM and GCC for RISC-V.
- There is no single solution to close the gap, as improvements and regressions occur daily within the codebase.

## The Project



- Prior to this work, a clear performance gap existed between code generated by LLVM and GCC for RISC-V.
- There is no single solution to close the gap, as improvements and regressions occur daily within the codebase.
- This presentation will focus on our three main contributions to help close the gap:
  - Introducing a scheduling model for the SpacemiT-X60.
  - Improvements to vectorization across calls.
  - Register Allocation with IPRA Support for RISC-V.



## **Our Contributions**

(major contributions first)





- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.

```
1 fld ft0, 0(a0)
2 fadd.d ft1, ft0, ft2
3 fmul.d ft3, ft4, ft5
```

before

```
1 fld ft0, 0(a0)
2 fmul.d ft3, ft4, ft5
3 fadd.d ft1, ft0, ft2
```

after



- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.



```
1 fld ft0, 0(a0)
2 fmul.d ft3, ft4, ft5
3 fadd.d ft1, ft0, ft2
```

after



- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.



```
1 fld ft0, 0(a0)
2 fmul.d ft3, ft4, ft5
3 fadd.d ft1, ft0, ft2
```

after



- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.



```
1 fld ft0, 0(a0)
2 fmul.d ft3, ft4, ft5
3 fadd.d ft1, ft0, ft2
```



- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.





- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.





- Instruction Scheduling == Performance.
- Wrong latencies/resources → compiler makes poor choices.



#### **How We Got the Numbers**



 We built custom microbenchmarks to measure instruction latencies.

#### **How We Got the Numbers**



- We built custom microbenchmarks to measure instruction latencies.
- Most of instruction throughput data available at <a href="https://camel-cdr.github.io/rvv-bench-results/bpi\_f3/index.html">https://camel-cdr.github.io/rvv-bench-results/bpi\_f3/index.html</a>.

#### **How We Got the Numbers**



- We built custom microbenchmarks to measure instruction latencies.
- Most of instruction throughput data available at <a href="https://camel-cdr.github.io/rvv-bench-results/bpi\_f3/index.html">https://camel-cdr.github.io/rvv-bench-results/bpi\_f3/index.html</a>.
- It's RISC but:
  - 201 scalar instructions.
  - 82 floating-point instructions.
  - 9185 RVV instructions (because of the combination of different LMULs and SEWs).

#### **RVA22U64 SPEC exec time, O3+LTO+mcpu=spacemit-x60**





#### **RVA22U64 SPEC exec time, O3+LTO+mcpu=spacemit-x60**





## RVA22U64\_V SPEC exec time, O3+LTO+mcpu=spacemit-x60





## RVA22U64\_V SPEC exec time, O3+LTO+mcpu=spacemit-x60





## RVA22U64 vs RVA22U64\_V, O3+LTO+mcpu=spacemit-x60





 Scheduling nearly eliminated the gap between scalar and vector configs.

## RVA22U64 vs RVA22U64\_V, O3+LTO+mcpu=spacemit-x60





- Scheduling nearly eliminated the gap between scalar and vector configs.
- On in-order processors like X60, scheduling is critical; on out-of-order, impact would be smaller and vectorization more decisive.





 Initial surprise: vectorized code sometimes underperformed scalar.



- Initial surprise: vectorized code sometimes underperformed scalar.
- Root cause: poor cost modeling and suboptimal spill behavior.



- Initial surprise: vectorized code sometimes underperformed scalar.
- Root cause: poor cost modeling and suboptimal spill behavior.
- The extra cycles were due to register spilling, particularly around function call boundaries.



```
. . .
    define void @f(i1 %c, ptr %p, ptr %q) {
    entry:
      %x0 = load i64, ptr %p
     %p1 = getelementptr i64, ptr %p, i64 1
     %x1 = load i64, ptr %p1
      br il %c, label %foo, label %bar
    foo:
      call void @g()
      br label %baz
    bar:
      call void @g()
      br label %baz
12
13
    baz:
      store i64 %x0, ptr %q
     %q1 = getelementptr i64, ptr %q, i64 1
      store i64 %x1, ptr %q1
      ret void
```





```
. . .
                  define void @f(il %c, ptr %p, ptr %q) {
  Loads first
                  entry:
value from %p
                %x0 = load i64, ptr %p
                    %p1 = getelementptr i64, ptr %p, i64 1
                    %x1 = load i64, ptr %p1
                    br il %c, label %foo, label %bar
                  foo:
                    call void @g()
                    br label %baz
                  bar:
                    call void @g()
                    br label %baz
              12
              13
                  baz:
                    store i64 %x0, ptr %q
                    %q1 = getelementptr i64, ptr %q, i64 1
                    store i64 %x1, ptr %q1
                    ret void
```





```
. .
                  define void @f(i1 %c, ptr %p, ptr %q) {
                  entry:
                    %x0 = load i64, ptr %p
Loads second
                    %p1 = getelementptr i64, ptr %p, i64 1
value from %p
                %x1 = load i64, ptr %p1
                    br il %c, label %foo, label %bar
                  foo:
                    call void @g()
                    br label %baz
                  bar:
                    call void @g()
                    br label %baz
              12
              13
                  baz:
                    store i64 %x0, ptr %q
                    %q1 = getelementptr i64, ptr %q, i64 1
                    store i64 %x1, ptr %q1
                    ret void
```





```
. .
                define void @f(i1 %c, ptr %p, ptr %q) {
                entry:
                  %x0 = load i64, ptr %p
                  %p1 = getelementptr i64, ptr %p, i64 1
Conditional
                  %x1 = load i64, ptr %p1
  jump
                  br il %c, label %foo, label %bar
                 foo:
                  call void @g()
                  br label %baz
                bar:
                  call void @g()
                  br label %baz
             12
             13
                baz:
                  store i64 %x0, ptr %q
                  %q1 = getelementptr i64, ptr %q, i64 1
                  store i64 %x1, ptr %q1
                  ret void
```





```
. .
                   define void @f(i1 %c, ptr %p, ptr %q) {
                   entry:
                     %x0 = load i64, ptr %p
                     %p1 = getelementptr i64, ptr %p, i64 1
                    %x1 = load i64, ptr %p1
                     br il %c, label %foo, label %bar
Conditional call
                   foo:
     to g()
                     call void @q()
                     br label %baz
               10 bar:
                     call void @g()
                     br label %baz
               12
               13
                   baz:
                     store i64 %x0, ptr %q
                    %q1 = getelementptr i64, ptr %q, i64 1
                     store i64 %x1, ptr %q1
                     ret void
```





```
. .
                   define void @f(i1 %c, ptr %p, ptr %q) {
                   entry:
                     %x0 = load i64, ptr %p
                     %p1 = getelementptr i64, ptr %p, i64 1
                    %x1 = load i64, ptr %p1
                     br il %c, label %foo, label %bar
                   foo:
                     call void @g()
                     br label %baz
Conditional call
                   bar:
                   > call void @g()
    to g()
                     br label %baz
               13
                   baz:
                     store i64 %x0, ptr %q
                     %q1 = getelementptr i64, ptr %q, i64 1
                     store i64 %x1, ptr %q1
                     ret void
```





```
. .
                  define void @f(i1 %c, ptr %p, ptr %q) {
                  entry:
                    %x0 = load i64, ptr %p
                    %p1 = getelementptr i64, ptr %p, i64 1
                    %x1 = load i64, ptr %p1
                    br il %c, label %foo, label %bar
                  foo:
                    call void @g()
                    br label %baz
                  bar:
                    call void @g()
                    br label %baz
 Stores first
                  baz:
value from %q
                 > store i64 %x0, ptr %q
                    %q1 = getelementptr i64, ptr %q, i64 1
                    store i64 %x1, ptr %q1
                    ret void
```





```
. .
                  define void @f(i1 %c, ptr %p, ptr %q) {
                  entry:
                    %x0 = load i64, ptr %p
                    %p1 = getelementptr i64, ptr %p, i64 1
                    %x1 = load i64, ptr %p1
                    br il %c, label %foo, label %bar
                  foo:
                    call void @g()
                    br label %baz
                  bar:
                    call void @g()
                    br label %baz
              12
              13
                  baz:
                    store i64 %x0, ptr %q
Stores second
                    %q1 = getelementptr i64, ptr %q, i64 1
value from %q
                 > store i64 %x1, ptr %q1
                    ret void
```



#### Fixing Real Bugs



 Since we are storing and loading from continuous regions, the accesses can be vectorized.

#### Fixing Real Bugs



- Since we are storing and loading from continuous regions, the accesses can be vectorized.
- We found that the SLP Vectorizer was aggressively vectorizing regions without properly accounting for the cost of spilling vector registers across calls.

#### Fixing Real Bugs



- Since we are storing and loading from continuous regions, the accesses can be vectorized.
- We found that the SLP Vectorizer was aggressively vectorizing regions without properly accounting for the cost of spilling vector registers across calls.
- Previously, the SLP vectorizer only analyzed the entry and baz blocks, ignoring foo and bar entirely.



```
. .
                  define void @f(i1 %c, ptr %p, ptr %q) {
                  entry:
                    %x0 = load i64, ptr %p
                    %p1 = getelementptr i64, ptr %p, i64 1
Not being
                    %x1 = load i64, ptr %p1
analyzed
                    br il %c, label %foo, label %bar
                  foo:
                    call void @g()
                    br label %baz
                  bar:
                    call void @g()
                   br label %baz
              12
              13
                  baz:
                    store i64 %x0, ptr %q
                    %q1 = getelementptr i64, ptr %q, i64 1
                    store i64 %x1, ptr %q1
                    ret void
```





 To address the issue, we first proposed a patch which modified the SLP vectorizer to properly walk through all basic blocks when analyzing cost.



- To address the issue, we first proposed a patch which modified the SLP vectorizer to properly walk through all basic blocks when analyzing cost.
- Promising results: execution time dropped by 9.92% in 544.nab r.



- To address the issue, we first proposed a patch which modified the SLP vectorizer to properly walk through all basic blocks when analyzing cost.
- Promising results: execution time dropped by 9.92% in 544.nab\_r.
- But with a major drawback: +6.9% increase compilation time in 502.gcc\_r.



- To address the issue, we first proposed a patch which modified the SLP vectorizer to properly walk through all basic blocks when analyzing cost.
- Promising results: execution time dropped by 9.92% in 544.nab\_r.
- But with a major drawback: +6.9% increase compilation time in 502.gcc\_r.
- Following discussions with the community, Alexey Bataev (SLP Vectorizer code owner) proposed and landed refined solution.

#### RVA22U64\_V SPEC exec time, O3+LTO, SLP fix





#### RVA22U64\_V SPEC exec time, O3+LTO, SLP fix







## IPRA (Inter-Procedural Register Allocation) Support

#### IPRA (Inter-Procedural Register Allocation) Support



 Function calls can often spill (when you store a register to the stack) more registers than necessary → wasted cycles on every call.

#### IPRA (Inter-Procedural Register Allocation) Support



- Function calls can often spill (when you store a register to the stack) more registers than necessary → wasted cycles on every call.
- IPRA: caller/callee register use is tracked across the functions, by eliminating unnecessary save/restore sequences.

#### **What IPRA Brings**



```
1 foo:
2 addi sp, sp, -32
3 sd ra, 24(sp)
4 sd s0, 16(sp)
5 sd s1, 8(sp)
6 ...
7 ld s1, 8(sp)
8 ld s0, 16(sp)
9 ld ra, 24(sp)
10 addi sp, sp, 32
11 ret
```

```
foo:
    addi    sp, sp, -8
    sd    ra, 0(sp)
    ...
    id    ra, 0(sp)
    addi    sp, sp, 8
    ret
```

before after

#### **What IPRA Brings**



```
saved even if
     foo:
                    sp, sp, -3
ra, 24(sp)
s0, 16(sp)
          addi
                                  not really live
                                                          foo:
          sd
                                                                 addi
                                                                              sp, sp, -8
          sd
                                                                              ra, 0(sp)
                    s1, 8(sp)
                                                                 sd
          sd
                    s1, 8(sp)
s0, 16(sp)
ra, 24(sp)
sp, sp, 32
          ld
                                  saved even if
                                                                 ld
                                                                              ra, 0(sp)
          ld
                                  not really live
          ld
                                                                 addi
                                                                              sp, sp, 8
          addi
                                                                 ret
11
          ret
```

before after

#### **What IPRA Brings**



```
1 foo:
2 addi sp, sp, -32
3 sd ra, 24(sp)
4 sd s0, 16(sp)
5 sd s1, 8(sp)
6 ...
7 ld s1, 8(sp)
8 ld s0, 16(sp)
9 ld ra, 24(sp)
10 addi sp, sp, 32
11 ret
```

```
1 foo:
2 addi sp, sp, struly needed
3 sd ra, 0(sp)
4 ...
5 ld ra, 0(sp)
6 addi sp, sp, 8
7 ret
```

before after

#### Impact of IPRA



Reduction in register pressure, shorter prologue/epilogue code.

#### Impact of IPRA



- Reduction in register pressure, shorter prologue/epilogue code.
- SPEC benchmarks showed measurable improvements (small but consistent).

#### Impact of IPRA



- Reduction in register pressure, shorter prologue/epilogue code.
- SPEC benchmarks showed measurable improvements (small but consistent).
- Unfortunately, it can't be enabled by default: IPRA is not enabled by default due to a bug (described in issue <u>119556</u>), however, it does not affect the SPEC benchmarks.

#### **RVA22U64 SPEC exec time, O3+LTO+IPRA**





### RVA22U64 SPEC exec time, O3+LTO+IPRA





#### RVA22U64\_V SPEC exec time, O3+LTO+IPRA





#### RVA22U64\_V SPEC exec time, O3+LTO+IPRA







## Conclusions

#### **Faster RISC-V Today**



- We contributed with:
  - Scheduling: largest wins, especially for scalar-heavy code.
  - Vectorization: enabled smarter spilling cost calculations.
  - IPRA: smaller but consistent improvements across workloads.

#### **Faster RISC-V Today**



- We contributed with:
  - Scheduling: largest wins, especially for scalar-heavy code.
  - Vectorization: enabled smarter spilling cost calculations.
  - IPRA: smaller but consistent improvements across workloads.
- Almost all changes are upstream benefiting everyone. Under review:
  - https://github.com/llvm/llvm-project/pull/150618
  - https://github.com/llvm/llvm-project/pull/150644
  - https://github.com/llvm/llvm-project/pull/152557
  - https://github.com/llvm/llvm-project/pull/152737
  - https://github.com/llvm/llvm-project/pull/152738

#### What We Learned Along the Way



- Scheduling is critical for performance.
  - No scheduling model → LLVM pessimises the final code.
  - We should likely adopt some scheduling model as default, like other targets do.
  - Should we make the X60 scheduling model default for in-order RISC-V processors?

#### What We Learned Along the Way



- Scheduling is critical for performance.
  - No scheduling model → LLVM pessimises the final code.
  - We should likely adopt some scheduling model as default, like other targets do.
  - Should we make the X60 scheduling model default for in-order RISC-V processors?
- Many contributions don't have immediate benchmark impact.

#### What We Learned Along the Way



- Scheduling is critical for performance.
  - No scheduling model → LLVM pessimizes the final code.
  - We should likely adopt some scheduling model as default, like other targets do.
  - Should we make the X60 scheduling model default for in-order RISC-V processors?
- Many contributions don't have immediate benchmark impact.
- Vectorization needs careful tuning to avoid regressions.



# Did we close the performance gap between LLVM and the GCC compiler?



Note it's not a direct apples-to-apples comparison.



- Note it's not a direct apples-to-apples comparison.
- The code was compiled with the same RISC-V extensions enabled.



- Note it's not a direct apples-to-apples comparison.
- The code was compiled with the same RISC-V extensions enabled.
- But GCC doesn't have X60-specific scheduling latencies, while LLVM does.



- Note it's not a direct apples-to-apples comparison.
- The code was compiled with the same RISC-V extensions enabled.
- But GCC doesn't have X60-specific scheduling latencies, while LLVM does.
- Still useful to show relative progress and identify where LLVM has caught up.

#### RVA22U64 SPEC execution time, GCC vs LLVM





#### RVA22U64\_V SPEC execution time, GCC vs LLVM





#### Thanks!



- This work at Igalia was made possible thanks to support from RISE, under Project RP009.
- Thanks to my Igalia colleagues for discussions, and feedback.
- And to the LLVM RISC-V community for reviews and for getting patches upstream quickly.

#### RISC-V @ Igalia



- Accidental Dataflow Analysis: Extending the RISC-V VL Optimizer @ 2025 EuroLLVM by Luke Lau: <a href="https://www.youtube.com/watch?v=bkOwPr36SrQ">https://www.youtube.com/watch?v=bkOwPr36SrQ</a>
- Improvements to RISC-V Vector code generation in LLVM @ 2025 RISC-V Summit Europe by Luke Lau and Alex Bradbury: <a href="https://www.youtube.com/watch?v=0NjugW7FF48">https://www.youtube.com/watch?v=0NjugW7FF48</a>
- RISC-V nightly performance testing of top-of-tree GCC and clang: <a href="https://cc-perf.igalia.com/">https://cc-perf.igalia.com/</a>

























































































