Deconstructing the Avoidance of Uninitialized Reads of Auto Variables

Document #: P2754R0
Date: 2023-01-24
Project: Programming Language C++
Audience: isocpp-ext
Reply-to: Jake Fevold
<>

Abstract

Inadvertently reading uninitialized values of automatic variables on the program stack presents a common source of security holes. An initial suggestion was proposed in [P2723R0] as “an opening bid” to get the discussion started. As a result, a variety of thoughtful proposals have been offered regarding how the C++ language might be altered to minimalize or mitigate such occurrences. These suggestions have varying tradeoffs, especially with respect to safety versus correctness and the extent to which initial intent is preserved. We attempt to survey the various alternatives that have been proposed on the The WG21 Reflector and objectively contrast their respective properties in the hope of facilitating a productive and fruitful exchange of ideas and opinions at the February 2023 Standards meeting and in the future.

1 Introduction

This paper is meant to drive fruitful discussion of the topics described in JF Bastien’s paper, “Zero-initialize objects of automatic storage duration” [P2723R0]. The stated goal of Bastien’s paper is to eliminate, where possible, security problems stemming from an indeterminate value held by variables of nonclass type of automatic storage duration. Nonclass types are arithmetic type, pointer-to-object type, pointer-to-function type, pointer-to-data-member type, pointer-to-member-function type, enumeration type, std::nullptr_t (since C++11), and POD (before C++11), as well as array (of any of these types).

A similar problem exists with uninitialized dynamic memory (such as that returned from malloc, operator new, and similar allocation utilities), which is not addressed in Bastien’s paper nor here.

2 Open Questions of Scope

Considering the various suggested mitigation strategies for security holes due to typical nonclass variables raised several other questions regarding scope. That is, should a proposed solution to address scalar and pointer types also apply to:

These considerations of scope seem mostly orthogonal to choosing a basic direction and are not discussed further here.

3 Properties of a Solution

As part of curating the information provided by proposals and discussions in The WG21 Reflector, several recurring questions were used to nail down the salient properties and features of a given proposal. A well-considered, though non-exhaustive, curation of such questions is presented here.

4 Relevant Code Examples

The examples in this section serve to elucidate various common security or correctness defects that a viable proposal might address.

4.1 Reading an Uninitialized Automatic Variable Directly

void f1()
{
    int p;
    int q = p + 1;  // UB
}

The C++ code snippet above is clearly and unconditionally incorrect today.

4.2 Reading an Uninitialized Automatic Variable Conditionally

void f2()
{
    int y;
    int z = b ? y + 1 : 0;
}

If b is never true, then this code is technically correct today. If b is never true, then the conditional serves no purpose. int z = 0; is unconditionally an improvement.

4.3 Passing an Uninitialized Automatic Variable to Another Function by Value

void g3(int);
void f3()
{
    int x;
    g3(x);  // likely a bug
}

The C++ code snippet above is likely a bug. If the author is certain g3 doesn’t use the value of the argument, then a literal would suffice.

4.4 Passing a Potentially Uninitialized Automatic Variable to Another Function by Value

void g4(int);
void f4()
{
    int s;
    if (c) s = 0;
    g4(s);  // likely a bug
}

If g4 uses the value of the parameter only when c is true, then this example code is correct today. Because g4 doesn’t take the value of c as a parameter, the example code likely has a bug.

4.5 Passing an Uninitialized Automatic Variable to Another Function by Reference or Pointer

void g5(int*);
void f5()
{
    int t;
    g5(&t);  // possibly a bug
}

If t is strictly an output of g5, this example code is correct today. If g5 compares the address of t but does not dereference the pointer, this example code is correct today. Compilers cannot currently reason about the contract of g5 when the definition of g5 is in a different translation unit.

4.6 Common Idioms Delaying Initialization for Performance

void f6()
{
    char buffer[1000];
    BufferAllocator a(buffer, sizeof buffer);
    std::vector v(&a);

    char buffer2[1000];
    snprintf(buffer2, sizeof buffer2, "cstring");
}

Idioms like these are safe and efficient and are not a common source of security concerns.

4.7 Initialization for Template Types Leaves Room for Improvement

template <typename T>
void f7()
{
  T t;
  cout << t;
}

For class types, t is initialized and the C++ code snippet above is correct today. For primitive types, t is uninitialized and the behavior of f7 is undefined.

5 Suggested Solutions

Each solution is evaluated for viability, backward compatibility, and expressability.

Viability is an evaluation of whether a solution is logically consistent, both internally and with respect to the existing C++ Standard. Viability is summarized for each solution as either viable, nonviable, or unclear. Viable means that a given solution is consistent. Nonviable means that a given solution is inconsistent either with itself or with other foundational rules or definitions in the Standard. Unclear means that the available information is insufficient for making a sound determination.

Backward compatibility is an evaluation of whether all existing, compiling code would continue to compile and behave as it does now if a given solution were adopted. Note, if buggy code continues to compile and behave identically, then the root security problem is unaddressed. Backward compatibility is summarized as either compatible, correct-code compatible, incompatible, or unclear. Compatible means that, if a given solution were adopted, all code which previously compiled continues to compile, with behavior differences only in the case of previously undefined behavior (UB). Correct-code compatible means that all previous correct code compiles, but some or all code that had UB would not compile. Incompatible means that some previously correct code would not compile. Unclear means that the available information is insufficient for making a sound determination.

Expressability is an evaluation of whether previously existing code would maintain its current meaning if a given solution were adopted. Currently, an uninitialized automatic nonclass variable declaration could be either an inadvertent, logical error (e.g., the original author meant to initialize but didn’t), or an intentional, delayed initialization. Expressability is summarized as either better, unchanged, worse, or unclear. Better means that previously existing code must be updated to make explicit the intent to delay initialization or correct the logical error. Unchanged means that previously existing code would be no more or less ambiguous. Worse means that previously existing code would be more ambiguous because logical error and intentionally delayed initialization are no longer the only two possibilities. Unclear means that the available information is insufficient for making a sound determination.

Note that some solutions list additional concerns that are not generally applicable to other solutions.

5.1 Always Zero-Initialize

All uninitialized automatic-storage-duration nonclass variables are initialized to a specific value. Numerical types would be initialized to zero. The value for pointer types is an open question. Major compilers offer an option to zero-initialize already.

5.2 Zero-Initialize or Diagnose

Code having an unconditional read of an indeterminate value is diagnosed (i.e., rejected), and code with a potential read of an indeterminate value must zero-initialize the variable and accept the code as well formed.

5.3 Force Initialization in Source

All uninitialized automatic-storage-duration nonclass variables are ill formed. Delayed initialization would require use of std::optional or a similar mechanism.

5.4 Force Initialization or Explicit Delayed Initialization Intent in Source

All uninitialized automatic-storage-duration nonclass variables are ill formed unless specifically annotated. A suitable syntax must be chosen to specify when leaving a variable uninitialized is deemed necessary.

5.5 Initialize to Implementation-Defined Value, But Read Before Write Is Still Undefined Behavior

All uninitialized automatic-storage-duration nonclass variables are initialized to an implementation-defined value, but reading that value is still UB. One could, in development and testing, inject values that are likely to cause noticeable failures (e.g., signaling NaN, unaligned pointer, and so on) and, in production, inject best-guess values, such as zero for integers.

5.6 Initialize to Implementation-Defined Value, But Read Before Write Is Erroneous Behavior

All uninitialized automatic-storage-duration nonclass variables are initialized to an implementation-defined value, yet reading that value is always wrong (like UB) but still defined (unlike UB). The original term proposed for this defined but undesirable behavior is erroneous behavior (EB). Any program that contains EB is incorrect, but the behavior is implementation defined. Different projects could elect to treat EB as UB for performance, could use hostile default values for testing and development, or could use somewhat safe default values for production.

5.7 Value-Initialize Only

Remove default initialization entirely and have only value initialization. The state of initialization in C++ is already very complex, and the cost of this complexity is dubious for the level of utility it affords. The entire initialization system could be pared down to one single form of initialization that provides values in all cases. This more fundamental change addresses uninitialized-variable problems, as well as other known issues with initialization.

6 Conclusion

Inadvertently reading an uninitialized nonclass variable on the program stack is a known source of difficult to diagnose bugs. Moreover, these defects lead to security holes that can be additionally problematic. One possible solution, proposed in [P2723R0], is to simply zero-initialize every automatic variable. Based on that initial suggestion, several other solutions have been proposed. We began this paper by enumerating some related issues involving solution scope concerning dynamic memory, arrays, unions, and padding. After identifying several useful diagnostic questions to elicit important distinguishing properties, we then proceeded to elucidate the various manifestations of this correctness and security problem with several small code examples. Finally, we identified seven different solution approaches and evaluated them against three separate criteria (viability, backward compatibility, and expressability), the results of which are summarized below.

Section
Proposed Solution
Viability
Backward Compatibility
Expressability
5.1 Always Zero-Initialize Viable Compatible Worse
5.2 Zero-Initialize or Diagnose Unclear Correct-Code Compatible Unchanged
5.3 Force-Initialize in Source Viable Incompatible Better
5.4 Force-Initialize or Annotate Viable Incompatible Better
5.5 Default Value, Still UB Nonviable Compatible Unchanged
5.6 Default Value, Erroneous Viable Compatible Unchanged
5.7 Value-Initialize Only Unclear Unclear Unclear

Based on this analysis, we conclude that the baseline approach [section 5.1] of zero-initializing everything, similar to how static nonclass data is initialized, would be effective at plugging all such security holes but would add a meaningful definition to currently UB, which in turn would make diagnosing such inadvertent mistakes more difficult moving forward.

Combining zero initialization with compile time failure [section 5.2] has the serious drawback of some compilers accepting code which other compilers reject. Forced initialization, without [section 5.3] or with [section 5.4] annotation for intentional delayed initialization, imposes an enormous effort to modify existing code, even existing correct code. This change would encourage the cavalier use of scripts to explicitly default-initialize (or annotate) all previously uninitialized variables, thereby losing the intent of the original author. This strategy is likely to be met with resistance by many existing code bases. Requiring a defined meaning and behavior for UB [section 5.5] is nonviable, and recommending such behavior, though perfectly reasonable, simply cannot be enforced on an otherwise compliant implementation.

The EB approach [section 5.6] affords almost all the advantages of the others with few drawbacks. This strategy will, however, require some thought to introduce a new kind of behavior, EB, to the C++ abstract machine. Importantly and unlike many of the other proposals, defining uninitialized memory reads as EB provides no hindrance to longer-term solutions, such as the option of eliminating default initialization entirely [section 5.7].