Note: This is an LLM-translated version from Simplified Chinese. Please be aware of potential translation errors.

Introduction

Modern compilers can infer variable values in program code to eliminate certain computation instructions and branches, thereby reducing runtime overhead of compiled artifacts. With these optimizations, programmers can typically obtain well-optimized compiled output while maintaining code readability, without manually specializing each variable.

Compilers can infer a variable’s value or range through branch conditions or assertions. Here’s a simple example:

if (a > 10) {
  if (a > 5) return 1;
  return 0;
}
return 1;

Clearly, when the program’s control flow enters the true branch of the first if statement, a>10 holds, which means a>5 must also hold. Therefore, the nested if statement can be eliminated directly, and the program can be optimized to simply return 1.

Similarly, programmers can pass information to the compiler through assume statements:

assume(a > 10);
if (a > 5) return 1;
return 0;

In this example, the compiler can learn from the assume statement that a>10, and thus determine that a>5 holds. Therefore, this program can also be optimized to simply return 1.

An intuitive assumption would be that the more information a compiler has, the more efficient the compiled output it can produce. However, researchers have discovered that this isn’t always the case (Refined Input, Degraded Output: The Counterintuitive World of Compiler Behavior, PLDI 2024).

Due to unexpected interactions between compiler components and phase ordering issues, sometimes more information leads to worse optimization.

This blog post will start with a simple issue in LLVM to explore the implementation of IPSCCP (Interprocedural Sparse Conditional Constant Propagation) in LLVM, attempting to understand how modern compilers utilize inferable constant information in code, and the considerations and trade-offs made in practical engineering.

Background

The two examples in the introduction presented the compiler’s conditional constant propagation optimization in pseudocode. To explore LLVM’s specific optimization implementation more deeply, we need to shift our focus to its intermediate representation: LLVM IR. Here’s an example of LLVM IR:

; Code Snippet #1
@a1 = external global i32

define i32 @src(i32 %L) {
BB:
  store i32 %L, ptr @a1, align 4
  %cmp = icmp eq i32 %L, 10
  call void @llvm.assume(i1 %cmp)
  ret i32 %L
}

In the above example, the function src receives parameter %L and performs the following two operations:

  1. Writes %L to the memory location pointed to by @a1
  2. Returns the value %L

Additionally, the compiler is provided with extra information: %L is first compared for equality with the constant 10 (icmp eq operation), and then the code informs the compiler through the @llvm.assume function that this comparison result is always true. Clearly, we can infer that the value of %L is the constant 10. Therefore, %L in both the store statement and the ret statement should be replaced with this constant.

However, the compiler’s output doesn’t meet our expectations:

; clang-trunk -O3 generates:
define noundef i32 @src(i32 %L) local_unnamed_addr #0 {
BB:
  store i32 %L, ptr @a1, align 4 ; %L is not substituted here!
  %cmp = icmp eq i32 %L, 10
  tail call void @llvm.assume(i1 %cmp)
  ret i32 10 ; %L is substituted as expected.
}

The compiler successfully replaced %L in the ret statement with the constant 10, but %L in the store statement was not replaced. To further investigate whether LLVM can infer the value of %L before the assume statement, we can try rewriting the IR above:

; Code Snippet #2
@a1 = external global i32
define i32 @src(i32 %L) {
BB:
  %H = add i32 %L, 1
  store i32 %H, ptr @a1, align 4
  %cmp = icmp eq i32 %L, 10
  call void @llvm.assume(i1 %cmp)
  ret i32 %L
}

In this new example, we use the add command to add 1 to %L’s value before performing the store operation. Surprisingly, the compiler exhibits the correct optimization behavior we expected on this code:

define noundef i32 @src(i32 %L) local_unnamed_addr #0 {
BB:
  store i32 11, ptr @a1, align 4 ; %H is substituted by 11!
  %cmp = icmp eq i32 %L, 10
  tail call void @llvm.assume(i1 %cmp)
  ret i32 10 ; %L is substituted as expected.
}

In the code above, the operand %L in the store statement is correctly replaced with the constant 11. Therefore, we have reason to believe that the initial code snippet is the result of a missed-optimization bug in the compiler.

The above examples are simplified forms of Issue #134540 and Issue #134992 from the LLVM repository, with the original code found in libstdc++ (see the original issue descriptions). Resolving this issue could potentially improve the execution efficiency of compiled artifacts and introduce new optimization opportunities for various foundational system software.

Problem Analysis

In the previous section, we presented a simple example where LLVM optimization didn’t behave as expected. By running the llvm-opt tool with the -debug parameter, we can obtain the IR transformations performed by each optimization pass. We can observe that the optimizations related to %L that we’re concerned with are mainly concentrated in two optimization passes: IPSCCP (Interprocedural Sparse Conditional Constant Propagation) and InstCombine.

  1. IPSCCP Pass

    According to the official LLVM documentation, this pass is a variant of SCCP (Sparse Conditional Constant Propagation) with interprocedural information propagation added. SCCP primarily implements two optimizations: (1) proving that a variable stores a constant value and replacing it with the corresponding constant; (2) proving that a conditional branch is unconditional. These two optimizations can transform IR, thereby reducing compile-time computation and the overhead of executing branch statements.

  2. InstCombine Pass

    One of the functions of this pass is to examine the inferred value ranges of instruction operands and attempt to simplify related instructions using this information. For example, for the instruction add %x, 1, if we can infer that the last bit of variable %x is 0, this instruction can be simplified to the more processor-friendly or %x, 1, since adding 1 won’t generate a carry. Similarly, if %x is inferred to be a constant, it will directly eliminate this add instruction and replace the computation result with the constant.

Taking code snippet #2 from the previous section as input to the optimizer and executing optimizations in the default order of Source->IPSCCP->InstCombine, we can obtain the following step-by-step results (some intermediate optimization steps are omitted):

  1. Source->IPSCCP Pass
BB:
  %H = add i32 %L, 1
  store i32 %H, ptr @a1, align 4
  %cmp = icmp eq i32 %L, 10
  call void @llvm.assume(i1 %cmp)
  ret i32 10
}
  1. Source->IPSCCP Pass->InstCombine Pass
BB:
  store i32 11, ptr @a1, align 4
  %cmp = icmp eq i32 %L, 10
  tail call void @llvm.assume(i1 %cmp)
  ret i32 10
}

As we can see, the IPSCCP Pass replaced %L in the ret instruction, but it didn’t replace the operand %L in the add instruction. The replacement of %L and %H in the add and store statements is completed by the InstCombine Pass.

Further analysis of the InstCombine Pass reveals that it utilizes currently obtained assumptions to simplify operands for binary arithmetic operations (such as add, sub, etc.) and some other operations, which is why code snippet #2 can be correctly optimized. However, for the store statement in code snippet #1 (as well as other statements like zext not in the above examples), it doesn’t attempt to check whether operands are constants, thus causing the operand %L in the store statement of code snippet #1 not to be replaced. In theory, adding the following check for whether operands are constants in the visitStoreInst function of the InstCombine Pass could solve the problem in code snippet #1:

Instruction *InstCombinerImpl::visitStoreInst(StoreInst &SI) {
  // Existing code...
  if(!isa<Constant>(Val)) {
    KnownBits Known(Val->getType()->getScalarSizeInBits());
    computeKnownBits(Val, Known, 0, &SI);
    if (Known.isConstant()) {
      Constant *C = ConstantInt::get(Val->getType(), Known.getConstant());
      return replaceOperand(SI, 0, C);
    }
  }
  // Existing code...
}

However, the InstCombine Pass is not a “fallback” for the IPSCCP Pass. Performing constant checks for all instruction operands doesn’t align with its responsibilities, and since it can track all binary bits of a value at a fine-grained level, using this mechanism for constant substitution could introduce non-negligible compilation performance overhead. Such modifications would be almost impossible to be accepted upstream. Therefore, we need to turn our attention to the root cause of this problem: the IPSCCP Pass.

Design, Implementation, and Trade-offs of IPSCCP

This blog post won’t exhaustively introduce the architecture of IPSCCP. Instead, it will focus on the parts of IPSCCP that handle branch conditions and assertions using the issues mentioned above as an entry point, explaining how it uses assume information provided by programmers to guide optimization and how the problems mentioned in the previous sections arise. To better understand the necessity of variable renaming and the “Rename Stack” introduced by IPSCCP later, this section will first introduce the concept of predicate scope and gradually approach the root cause of the missed-optimization problem discussed in this article.

Predicate Scope

Formally, a predicate’s scope includes the basic block where the predicate is located and the blocks it dominates. A more intuitive explanation is that we can only consider an assume statement’s condition or a branch condition (or its negation) to be true in subsequent execution flow when the program has “executed” the assume statement or entered one of the branches of a conditional statement.

Let’s explain the concept of predicate scope through an intuitive example:

; Code Snippet #3
define void @src(i32 %L) {
BB:
  %cond = icmp ult i32 %L, 5
  br i1 %cond, label %BB1, label %BB2
BB1:
  %cmp1 = icmp ult i32 %L, 10
  call void @use(i1 %cmp1)
  ret void
BB2:
  %cmp2 = icmp ult i32 %L, 10
  call void @use(i1 %cmp2)
  ret void
}

The above is a simple nested branch code snippet. In basic block BB, the code first checks whether %L is less than 5. If true, it jumps to basic block BB1; otherwise, it jumps to basic block BB2. The code in basic blocks BB1 and BB2 is identical—both check whether %L is less than 10 and use the judgment result as a parameter to call the external function @use. Clearly, if %L<5 holds, then %L<10 must also hold. Therefore, %cmp1 in BB1 can be directly optimized to true, while %cmp2 in BB2 needs to be preserved.

In this example, LLVM can obtain the correct optimization result:

define void @src(i32 %L) {
BB:
  %cond = icmp ult i32 %L, 5
  br i1 %cond, label %BB1, label %BB2
BB1: ; preds = %BB
  call void @use(i1 true) ; substituted by "true"!
  ret void
BB2: ; preds = %BB
  %cmp2 = icmp ult i32 %L, 10
  call void @use(i1 %cmp2)
  ret void
}

Therefore, for the same parameter %L, its inferred range may not be consistent across different basic blocks (in the example above, the range in BB1 is 0<%L<5, while the range in BB2 is 5≤%L). This creates difficulties for maintaining a variable’s value range and for “directionally” replacing those variables that can be inferred after obtaining the value range. To solve this problem more elegantly, IPSCCP introduces a variable renaming mechanism.

Variable Renaming

IPSCCP generates copies of a variable for all users with the same value range, and then uses the copied variables to replace the original operands of these users. Let’s illustrate this with code snippet #3:

; Code Snippet #3.1
define void @src(i32 %L) {
BB:
  %cond = icmp ult i32 %L, 5
  %L.0 = call i32 @llvm.ssa.copy.i32(i32 %L)
  %L.1 = call i32 @llvm.ssa.copy.i32(i32 %L)
  br i1 %cond, label %BB1, label %BB2
BB1:
  %cmp1 = icmp ult i32 %L.0, 10
  call void @use(i1 %cmp1)
  ret void
BB2:
  %cmp2 = icmp ult i32 %L.1, 10
  call void @use(i1 %cmp2)
  ret void
}

Code snippet #3.1 shows the code after renaming %L. It uses the @llvm.ssa.copy intrinsic to generate two copies of %L, %L.0 and %L.1, and replaces them in basic blocks BB1 and BB2 respectively. Therefore, we can now easily obtain the value range of each variable (for simplicity, we assume all variables are unsigned types):

Variable NameValue Range
%L.0(0, 5)
%L.1[5, 1«32)

As we can see, variable renaming provides convenience for analyzing variable value ranges and substituting variables in basic blocks. For nested code such as multi-level conditional statements, a variable may be renamed multiple times and may generate several copies. IPSCCP uses a stack to maintain variable value ranges. Each element on the stack represents a renaming of the variable, and each renaming brings a new constraint to the variable (imagine nested if conditional statements on a variable). Therefore, we only need to merge all constraints on the stack using an AND operation to obtain the current variable’s value range. When we leave the scope of a certain constraint, we simply pop that element from the stack.

Rename Stack and Operand Substitution Implementation

To efficiently implement variable renaming and substitution of corresponding user operands (without scanning the entire code), IPSCCP divides code analysis into three main phases (some iteration details are omitted):

  1. Build Phase: Collect all predicate information associated with variables (including Branch statements, Switch statements, and Assume statements);
  2. Rename Phase: For each variable with optimization opportunities, traverse its uses while simultaneously completing variable renaming and substitution of all user operands;
  3. Solve Phase: Merge possible value ranges of all variables and simplify operands (such as replacing them with constants) using the collected information.

Among these, both the Build Phase and Solve Phase are relatively standardized and not closely related to the issue mentioned at the beginning of this article. Therefore, we will mainly introduce the implementation of the Rename Phase. To better demonstrate how the Rename Stack works, we’ll introduce the final code snippet of this article.

; Code Snippet #4
define void @basic(i32 %v) {
  %a1 = icmp ult i32 %v, 10
  call void @llvm.assume(i1 %a1)
  %a2 = icmp ugt i32 %v, 5
  call void @llvm.assume(i1 %a2)
  %c1 = icmp ult i32 %v, 10
  call void @use(i1 %c1)
  ret void
}

In this snippet, we restrict the range of variable %v to (5, 10) through two assume statements, after which variable %c1 can be inferred as true. Below, this article will step by step introduce how IPSCCP completes the optimization of this code.

Build Phase

In this phase, IPSCCP traverses all assume statements to find variables that need renaming. For each variable, it finds all use points and sorts them according to the DFS order of basic blocks. Note that assume statements are also treated as indirect use points (stored in the PredicateInfo field in the ValueDFS data structure). For code snippet #4, the sorted user sequence produced is as follows:

OrderedUses:
[0] Use: %a1
[1] Predicate: %a1, OriginalOp: %v, RenamedOp: null
[2] Use: %a2
[3] Predicate: %a2, OriginalOp: %v, RenamedOp: null
[4] Use: %c1

RenameStack:
(empty)

RenamePhase

After constructing the user sequence, IPSCCP enters the RenamePhase, which traverses this sequence in order while maintaining the rename stack RenameStack for the current variable.

Step 1
OrderedUses:
* [0] Use: %a1
[1] Predicate: %a1, OriginalOp: %v, RenamedOp: null
[2] Use: %a2
[3] Predicate: %a2, OriginalOp: %v, RenamedOp: null
[4] Use: %c1

RenameStack:
(empty)

IPSCCP first visits the first use %a1, but since RenameStack is empty, it skips directly.

Step 2
OrderedUses:
[0] Use: %a1
* [1] Predicate: %a1, OriginalOp: %v, RenamedOp: null
[2] Use: %a2
[3] Predicate: %a2, OriginalOp: %v, RenamedOp: null
[4] Use: %c1

RenameStack:
-> [0]: Predicate: %a1, OriginalOp: %v, RenamedOp: null

IPSCCP visits the next item Predicate and pushes it onto the stack.

Step 3
OrderedUses:
[0] Use: %a1
[1] Predicate: %a1, OriginalOp: %v, RenamedOp: %v
* [2] Use: %a2
[3] Predicate: %a2, OriginalOp: %v, RenamedOp: null
[4] Use: %c1

RenameStack:
[0]: Predicate: %a1, OriginalOp: %v, RenamedOp: %v

IPSCPP visits use %a2. At this point, the rename stack is not empty. Since the variable in the stack has not yet been renamed (late-materialization, which can save renaming overhead for variables without uses), the RenamedOp of the corresponding new element in the stack is set to the original variable name %v. Meanwhile, the operand in use %a2 that is the same as OriginalOp %v is replaced with the renamed variable %1.

At this point, the corresponding IR instruction sequence transforms to:

; Code Snippet #4.1
define void @basic(i32 %v) {
  %a1 = icmp ult i32 %v, 10
  call void @llvm.assume(i1 %a1)
  -> %1 = call i32 @llvm.ssa.copy.i32(i32 %v)
  %a2 = icmp ugt i32 %1, 5
  call void @llvm.assume(i1 %a2)
  %c1 = icmp ult i32 %v, 10
  call void @use(i1 %c1)
  ret void
}

Note that only the operand of the %a2 instruction is replaced here. The operand %v of the %a1 instruction is not replaced, which is also consistent with the behavior where RenamedOp of Predicate: %a1 in RenameStack is %v. RenamedOp represents the operand corresponding to the variable we’re interested in within that Predicate. In the Solve Phase, RenamedOp can help us easily find the variable we care about in each Predicate (regardless of whether it has been renamed).

Step 4
OrderedUses:
[0] Use: %a1
[1] Predicate: %a1, OriginalOp: %v, RenamedOp: %v
[2] Use: %a2
* [3] Predicate: %a2, OriginalOp: %v, RenamedOp: null
[4] Use: %c1

RenameStack:
[0]: Predicate: %a1, OriginalOp: %v, RenamedOp: %v (Def: %1)
-> [1]: Predicate: %a2, OriginalOp: %v, RenamedOp: null

Same as Step 2, IPSCCP visits the Predicate and pushes it onto the stack.

Step 5
OrderedUses:
[0] Use: %a1
[1] Predicate: %a1, OriginalOp: %v, RenamedOp: %v
[2] Use: %a2
[3] Predicate: %a2, OriginalOp: %v, RenamedOp: null
* [4] Use: %c1

RenameStack:
[0]: Predicate: %a1, OriginalOp: %v, RenamedOp: %v (Def: %1)
[1]: Predicate: %a2, OriginalOp: %v, RenamedOp: %1 (Def: %2)

IPSCPP visits use %c1. At this point, the rename stack is not empty. Starting from the earliest variable that hasn’t been materialized, it materializes all items in the RenameStack and inserts ssa_copy statements after the corresponding statements.

; Code Snippet #4.2
define void @basic(i32 %v) {
  %a1 = icmp ult i32 %v, 10
  call void @llvm.assume(i1 %a1)
  %1 = call i32 @llvm.ssa.copy.i32(i32 %v)
  %a2 = icmp ugt i32 %1, 5
  call void @llvm.assume(i1 %a2)
  -> %2 = call i32 @llvm.ssa.copy.i32(i32 %1)
  %c1 = icmp ult i32 %v, %2
  call void @use(i1 %c1)
  ret void
}

Again, only the operand of the %c1 instruction is replaced here. Additionally, the RenamedOp of Predicate %a2 in RenameStack is pointed to %1, which is also consistent with the actual operand substitution that occurred in that icmp instruction.

Summary

Through the use of the rename stack, IPSCPP’s RenamePhase elegantly maintains information about which specific operand in each Predicate’s icmp (or other comparison instruction) is related to the variable we care about. Renaming information propagates along the stack toward the top. When a Predicate’s operand is renamed due to a Predicate earlier in the stack, it can find out what value it has been specifically renamed to based on the information on the stack.

Solve Phase

After collecting Predicate information for all variables, IPSCCP iteratively finds the value range of each variable in each basic block and attempts to perform constant substitution. This step is unrelated to the missed-optimization bug mentioned earlier in this article, so it is not within the scope of this article’s discussion.

Trade-offs

Let’s return to the first code snippet:

; Code Snippet #1
@a1 = external global i32

define i32 @src(i32 %L) {
BB:
  store i32 %L, ptr @a1, align 4
  %cmp = icmp eq i32 %L, 10
  call void @llvm.assume(i1 %cmp)
  ret i32 %L
}

By analyzing the specific implementation of IPSCPP, we finally understand how the bug in this code occurred: since RenamePhase visits statements in a basic block sequentially, when visiting store, the RenameStack is still empty, so it didn’t correctly complete the renaming.

In fact, this bug is more of an engineering trade-off: if we need assume to affect instructions before that assume statement in the current basic block, then the RenameStack and the way renaming information is propagated would need to be redesigned. The reason is that if we need to hoist variable renaming and the corresponding ssa_copy instructions before the assume and its corresponding icmp and other conditional judgment statements, then the operands of these conditional judgment statements would be affected by variable renaming. At that point, we would have difficulty using RenamedOp to simply track which operand in the conditional judgment statements corresponds to the variable we’re interested in.

Of course, we might be able to introduce new mechanisms, such as ensuring an original variable is not renamed multiple times within a basic block, or maintaining RenamedOp as a set to work around this problem, but this would inevitably bring more complexity to the algorithm’s design. Whether it would ultimately be accepted by the community remains uncertain.

Conclusion

Modern operating systems, compilers, and other foundational system software are becoming increasingly complex at an alarming rate. In the example we’ve discussed, LLVM’s original implementation is elegant, simple, and correctly implements the vast majority of optimizations. However, in certain cases, we find that designers’ assumptions (that instructions earlier in a basic block shouldn’t use the results of assume statements) often hinder optimization progress, and solving this problem would break the intuitive characteristics of the original algorithm and could unintentionally create pitfalls for the compiler or future maintainers.

When better performance means more complex and harder-to-maintain designs, how should we make trade-offs?