[x86/SLH] Add the design document for Speculative Load Hardening,

author Chandler Carruth <chandlerc@gmail.com>

Wed, 18 Jul 2018 14:05:14 +0000 (14:05 +0000)

committer Chandler Carruth <chandlerc@gmail.com>

Wed, 18 Jul 2018 14:05:14 +0000 (14:05 +0000)
author Chandler Carruth <chandlerc@gmail.com>
Wed, 18 Jul 2018 14:05:14 +0000 (14:05 +0000)
committer Chandler Carruth <chandlerc@gmail.com>
Wed, 18 Jul 2018 14:05:14 +0000 (14:05 +0000)
diff --git a/docs/SpeculativeLoadHardening.md b/docs/SpeculativeLoadHardening.md

new file mode 100644 (file)

index 0000000..bf5c7d3
--- /dev/null
+++ b/docs/SpeculativeLoadHardening.md
@@ -0,0 +1,1099 @@
+# Speculative Load Hardening
+
+### A Spectre Variant #1 Mitigation Technique
+
+Author: Chandler Carruth - [chandlerc@google.com](mailto:chandlerc@google.com)
+
+## Problem Statement
+
+Recently, Google Project Zero and other researchers have found information leak
+vulnerabilities by exploiting speculative execution in modern CPUs. These
+exploits are currently broken down into three variants:
+* GPZ Variant #1 (a.k.a. Spectre Variant #1): Bounds check (or predicate) bypass
+* GPZ Variant #2 (a.k.a. Spectre Variant #2): Branch target injection
+* GPZ Variant #3 (a.k.a. Meltdown): Rogue data cache load
+
+For more details, see the Google Project Zero blog post and the Spectre research
+paper:
+* https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
+* https://spectreattack.com/spectre.pdf
+
+The core problem of GPZ Variant #1 is that speculative execution uses branch
+prediction to select the path of instructions speculatively executed. This path
+is speculatively executed with the available data, and may load from memory and
+leak the loaded values through various side channels that survive even when the
+speculative execution is unwound due to being incorrect. Mispredicted paths can
+cause code to be executed with data inputs that never occur in correct
+executions, making checks against malicious inputs ineffective and allowing
+attackers to use malicious data inputs to leak secret data. Here is an example,
+extracted and simplified from the Project Zero paper:
+```
+struct array {
+  unsigned long length;
+  unsigned char data[];
+};
+struct array *arr1 = ...; // small array
+struct array *arr2 = ...; // array of size 0x400
+unsigned long untrusted_offset_from_caller = ...;
+if (untrusted_offset_from_caller < arr1->length) {
+  unsigned char value = arr1->data[untrusted_offset_from_caller];
+  unsigned long index2 = ((value&1)*0x100)+0x200;
+  unsigned char value2 = arr2->data[index2];
+}
+```
+
+The key of the attack is to call this with `untrusted_offset_from_caller` that
+is far outside of the bounds when the branch predictor will predict that it
+will be in-bounds. In that case, the body of the `if` will be executed
+speculatively, and may read secret data into `value` and leak it via a
+cache-timing side channel when a dependent access is made to populate `value2`.
+
+## High Level Mitigation Approach
+
+While several approaches are being actively pursued to mitigate specific
+branches and/or loads inside especially risky software (most notably various OS
+kernels), these approaches require manual and/or static analysis aided auditing
+of code and explicit source changes to apply the mitigation. They are unlikely
+to scale well to large applications. We are proposing a comprehensive
+mitigation approach that would apply automatically across an entire program
+rather than through manual changes to the code. While this is likely to have a
+high performance cost, some applications may be in a good position to take this
+performance / security tradeoff.
+
+The specific technique we propose is to cause loads to be checked using
+branchless code to ensure that they are executing along a valid control flow
+path. Consider the following C-pseudo-code representing the core idea of a
+predicate guarding potentially invalid loads:
+```
+void leak(int data);
+void example(int* pointer1, int* pointer2) {
+  if (condition) {
+    // ... lots of code ...
+    leak(*pointer1);
+  } else {
+    // ... more code ...
+    leak(*pointer2);
+  }
+}
+```
+
+This would get transformed into something resembling the following:
+```
+uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
+uintptr_t all_zeros_mask = 0;
+void leak(int data);
+void example(int* pointer1, int* pointer2) {
+  uintptr_t predicate_state = all_ones_mask;
+  if (condition) {
+    // Assuming ?: is implemented using branchless logic...
+    predicate_state = !condition ? all_zeros_mask : predicate_state;
+    // ... lots of code ...
+    //
+    // Harden the pointer so it can't be loaded
+    pointer1 &= predicate_state;
+    leak(*pointer1);
+  } else {
+    predicate_state = condition ? all_zeros_mask : predicate_state;
+    // ... more code ...
+    //
+    // Alternative: Harden the loaded value
+    int value2 = *pointer2 & predicate_state;
+    leak(value2);
+  }
+}
+```
+
+The result should be that if the `if (condition) {` branch is mis-predicted,
+there is a *data* dependency on the condition used to zero out any pointers
+prior to loading through them or to zero out all of the loaded bits. Even
+though this code pattern may still execute speculatively, *invalid* speculative
+executions are prevented from leaking secret data from memory (but note that
+this data might still be loaded in safe ways, and some regions of memory are
+required to not hold secrets, see below for detailed limitations). This
+approach only requires the underlying hardware have a way to implement a
+branchless and unpredicted conditional update of a register's value. All modern
+architectures have support for this, and in fact such support is necessary to
+correctly implement constant time cryptographic primitives.
+
+Crucial properties of this approach:
+* It is not preventing any particular side-channel from working. This is
+  important as there are an unknown number of potential side channels and we
+  expect to continue discovering more. Instead, it prevents the observation of
+  secret data in the first place.
+* It accumulates the predicate state, protecting even in the face of nested
+  *correctly* predicted control flows.
+* It passes this predicate state across function boundaries to provide
+  [interprocedural protection](#interprocedural-checking).
+* When hardening the address of a load, it uses a *destructive* or
+  *non-reversible* modification of the address to prevent an attacker from
+  reversing the check using attacker-controlled inputs.
+* It does not completely block speculative execution, and merely prevents
+  *mis*-speculated paths from leaking secrets from memory (and stalls
+  speculation until this can be determined).
+* It is completely general and makes no fundamental assumptions about the
+  underlying architecture other than the ability to do branchless conditional
+  data updates and a lack of value prediction.
+* It does not require programmers to identify all possible secret data using
+  static source code annotations or code vulnerable to a variant #1 style
+  attack.
+
+Limitations of this approach:
+* It requires re-compiling source code to insert hardening instruction
+  sequences. Only software compiled in this mode is protected.
+* The performance is heavily dependent on a particular architecture's
+  implementation strategy. We outline a potential x86 implementation below and
+  characterize its performance.
+* It does not defend against secret data already loaded from memory and
+  residing in registers or leaked through other side-channels in
+  non-speculative execution. Code dealing with this, e.g cryptographic
+  routines, already uses constant-time algorithms and code to prevent
+  side-channels. Such code should also scrub registers of secret data following
+  [these
+  guidelines](https://github.com/HACS-workshop/spectre-mitigations/blob/master/crypto_guidelines.md).
+* To achieve reasonable performance, many loads may not be checked, such as
+  those with compile-time fixed addresses. This primarily consists of accesses
+  at compile-time constant offsets of global and local variables. Code which
+  needs this protection and intentionally stores secret data must ensure the
+  memory regions used for secret data are necessarily dynamic mappings or heap
+  allocations. This is an area which can be tuned to provide more comprehensive
+  protection at the cost of performance.
+* [Hardened loads](#hardening-the-address-of-the-load) may still load data from
+  _valid_ addresses if not _attacker-controlled_ addresses. To prevent these
+  from reading secret data, the low 2gb of the address space and 2gb above and
+  below any executable pages should be protected.
+
+Credit:
+* The core idea of tracing misspeculation through data and marking pointers to
+  block misspeculated loads was developed as part of a HACS 2018 discussion
+  between Chandler Carruth, Paul Kocher, Thomas Pornin, and several other
+  individuals.
+* Core idea of masking out loaded bits was part of the original mitigation
+  suggested by Jann Horn when these attacks were reported.
+
+
+### Indirect Branches, Calls, and Returns
+
+It is possible to attack control flow other than conditional branches with
+variant #1 style mispredictions.
+* A prediction towards a hot call target of a virtual method can lead to it
+  being speculatively executed when an expected type is used (often called
+  "type confusion").
+* A hot case may be speculatively executed due to prediction instead of the
+  correct case for a switch statement implemented as a jump table.
+* A hot common return address may be predicted incorrectly when returning from
+  a function.
+
+These code patterns are also vulnerable to Spectre variant #2, and as such are
+best mitigated with a
+[retpoline](https://support.google.com/faqs/answer/7625886) on x86 platforms.
+When a mitigation technique like retpoline is used, speculation simply cannot
+proceed through an indirect control flow edge (or it cannot be mispredicted in
+the case of a filled RSB) and so it is also protected from variant #1 style
+attacks. However, some architectures, micro-architectures, or vendors do not
+employ the retpoline mitigation, and on future x86 hardware (both Intel and
+AMD) it is expected to become unnecessary due to hardware-based mitigation.
+
+When not using a retpoline, these edges will need independent protection from
+variant #1 style attacks. The analogous approach to that used for conditional
+control flow should work:
+```
+uintptr_t all_ones_mask = std::numerical_limits<uintptr_t>::max();
+uintptr_t all_zeros_mask = 0;
+void leak(int data);
+void example(int* pointer1, int* pointer2) {
+  uintptr_t predicate_state = all_ones_mask;
+  switch (condition) {
+  case 0:
+    // Assuming ?: is implemented using branchless logic...
+    predicate_state = (condition != 0) ? all_zeros_mask : predicate_state;
+    // ... lots of code ...
+    //
+    // Harden the pointer so it can't be loaded
+    pointer1 &= predicate_state;
+    leak(*pointer1);
+    break;
+
+  case 1:
+    predicate_state = (condition != 1) ? all_zeros_mask : predicate_state;
+    // ... more code ...
+    //
+    // Alternative: Harden the loaded value
+    int value2 = *pointer2 & predicate_state;
+    leak(value2);
+    break;
+
+    // ...
+  }
+}
+```
+
+The core idea remains the same: validate the control flow using data-flow and
+use that validation to check that loads cannot leak information along
+misspeculated paths. Typically this involves passing the desired target of such
+control flow across the edge and checking that it is correct afterwards. Note
+that while it is tempting to think that this mitigates variant #2 attacks, it
+does not. Those attacks go to arbitrary gadgets that don't include the checks.
+
+
+### Variant #1.1 and #1.2 attacks: "Bounds Check Bypass Store"
+
+Beyond the core variant #1 attack, there are techniques to extend this attack.
+The primary technique is known as "Bounds Check Bypass Store" and is discussed
+in this research paper: https://people.csail.mit.edu/vlk/spectre11.pdf
+
+We will analyze these two variants independently. First, variant #1.1 works by
+speculatively storing over the return address after a bounds check bypass. This
+speculative store then ends up being used by the CPU during speculative
+execution of the return, potentially directing speculative execution to
+arbitrary gadgets in the binary. Let's look at an example.
+```
+unsigned char local_buffer[4];
+unsigned char *untrusted_data_from_caller = ...;
+unsigned long untrusted_size_from_caller = ...;
+if (untrusted_size_from_caller < sizeof(local_buffer)) {
+  // Speculative execution enters here with a too-large size.
+  memcpy(local_buffer, untrusted_data_from_caller,
+         untrusted_size_from_caller);
+  // The stack has now been smashed, writing an attacker-controlled
+  // address over the return adress.
+  minor_processing(local_buffer);
+  return;
+  // Control will speculate to the attacker-written address.
+}
+```
+
+However, this can be mitigated by hardening the load of the return address just
+like any other load. This is sometimes complicated because x86 for example
+*implicitly* loads the return address off the stack. However, the
+implementation technique below is specifically designed to mitigate this
+implicit load by using the stack pointer to communicate misspeculation between
+functions. This additionally causes a misspeculation to have an invalid stack
+pointer and never be able to read the speculatively stored return address. See
+the detailed discussion below.
+
+For variant #1.2, the attacker speculatively stores into the vtable or jump
+table used to implement an indirect call or indirect jump. Because this is
+speculative, this will often be possible even when these are stored in
+read-only pages. For example:
+```
+class FancyObject : public BaseObject {
+public:
+  void DoSomething() override;
+};
+void f(unsigned long attacker_offset, unsigned long attacker_data) {
+  FancyObject object = getMyObject();
+  unsigned long *arr[4] = getFourDataPointers();
+  if (attacker_offset < 4) {
+    // We have bypassed the bounds check speculatively.
+    unsigned long *data = arr[attacker_offset];
+    // Now we have computed a pointer inside of `object`, the vptr.
+    *data = attacker_data;
+    // The vptr points to the virtual table and we speculatively clobber that.
+    g(object); // Hand the object to some other routine.
+  }
+}
+// In another file, we call a method on the object.
+void g(BaseObject &object) {
+  object.DoSomething();
+  // This speculatively calls the address stored over the vtable.
+}
+```
+
+Mitigating this requires hardening loads from these locations, or mitigating
+the indirect call or indirect jump. Any of these are sufficient to block the
+call or jump from using a speculatively stored value that has been read back.
+
+For both of these, using retpolines would be equally sufficient. One possible
+hybrid approach is to use retpolines for indirect call and jump, while relying
+on SLH to mitigate returns.
+
+Another approach that is sufficient for both of these is to harden all of the
+speculative stores. However, as most stores aren't interesting and don't
+inherently leak data, this is expected to be prohibitively expensive given the
+attack it is defending against.
+
+
+## Implementation Details
+
+There are a number of complex details impacting the implementation of this
+technique, both on a particular architecture and within a particular compiler.
+We discuss proposed implementation techniques for the x86 architecture and the
+LLVM compiler. These are primarily to serve as an example, as other
+implementation techniques are very possible.
+
+
+### x86 Implementation Details
+
+On the x86 platform we break down the implementation into three core
+components: accumulating the predicate state through the control flow graph,
+checking the loads, and checking control transfers between procedures.
+
+
+#### Accumulating Predicate State
+
+Consider baseline x86 instructions like the following, which test three
+conditions and if all pass, loads data from memory and potentially leaks it
+through some side channel:
+```
+# %bb.0:                                # %entry
+        pushq   %rax
+        testl   %edi, %edi
+        jne     .LBB0_4
+# %bb.1:                                # %then1
+        testl   %esi, %esi
+        jne     .LBB0_4
+# %bb.2:                                # %then2
+        testl   %edx, %edx
+        je      .LBB0_3
+.LBB0_4:                                # %exit
+        popq    %rax
+        retq
+.LBB0_3:                                # %danger
+        movl    (%rcx), %edi
+        callq   leak
+        popq    %rax
+        retq
+```
+
+When we go to speculatively execute the load, we want to know whether any of
+the dynamically executed predicates have been misspeculated. To track that,
+along each conditional edge, we need to track the data which would allow that
+edge to be taken. On x86, this data is stored in the flags register used by the
+conditional jump instruction. Along both edges after this fork in control flow,
+the flags register remains alive and contains data that we can use to build up
+our accumulated predicate state. We accumulate it using the x86 conditional
+move instruction which also reads the flag registers where the state resides.
+These conditional move instructions are known to not be predicted on any x86
+processors, making them immune to misprediction that could reintroduce the
+vulnerability. When we insert the conditional moves, the code ends up looking
+like the following:
+```
+# %bb.0:                                # %entry
+        pushq   %rax
+        xorl    %eax, %eax              # Zero out initial predicate state.
+        movq    $-1, %r8                # Put all-ones mask into a register.
+        testl   %edi, %edi
+        jne     .LBB0_1
+# %bb.2:                                # %then1
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        testl   %esi, %esi
+        jne     .LBB0_1
+# %bb.3:                                # %then2
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        testl   %edx, %edx
+        je      .LBB0_4
+.LBB0_1:
+        cmoveq  %r8, %rax               # Conditionally update predicate state.
+        popq    %rax
+        retq
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        ...
+```
+
+Here we create the "empty" or "correct execution" predicate state by zeroing
+`%rax`, and we create a constant "incorrect execution" predicate value by
+putting `-1` into `%r8`. Then, along each edge coming out of a conditional
+branch we do a conditional move that in a correct execution will be a no-op,
+but if misspeculated, will replace the `%rax` with the value of `%r8`.
+Misspeculating any one of the three predicates will cause `%rax` to hold the
+"incorrect execution" value from `%r8` as we preserve incoming values when
+execution is correct rather than overwriting it.
+
+We now have a value in `%rax` in each basic block that indicates if at some
+point previously a predicate was mispredicted. And we have arranged for that
+value to be particularly effective when used below to harden loads.
+
+
+##### Indirect Call, Branch, and Return Predicates
+
+(Not yet implemented.)
+
+There is no analogous flag to use when tracing indirect calls, branches, and
+returns. The predicate state must be accumulated through some other means.
+Fundamentally, this is the reverse of the problem posed in CFI: we need to
+check where we came from rather than where we are going. For function-local
+jump tables, this is easily arranged by testing the input to the jump table
+within each destination:
+```
+        pushq   %rax
+        xorl    %eax, %eax              # Zero out initial predicate state.
+        movq    $-1, %r8                # Put all-ones mask into a register.
+        jmpq    *.LJTI0_0(,%rdi,8)      # Indirect jump through table.
+.LBB0_2:                                # %sw.bb
+        testq   $0, %rdi                # Validate index used for jump table.
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        ...
+        jmp     _Z4leaki                # TAILCALL
+
+.LBB0_3:                                # %sw.bb1
+        testq   $1, %rdi                # Validate index used for jump table.
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        ...
+        jmp     _Z4leaki                # TAILCALL
+
+.LBB0_5:                                # %sw.bb10
+        testq   $2, %rdi                # Validate index used for jump table.
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        ...
+        jmp     _Z4leaki                # TAILCALL
+        ...
+
+        .section        .rodata,"a",@progbits
+        .p2align        3
+.LJTI0_0:
+        .quad   .LBB0_2
+        .quad   .LBB0_3
+        .quad   .LBB0_5
+        ...
+```
+
+Returns have a simple mitigation technique on x86-64 (or other ABIs which have
+what is called a "red zone" region beyond the end of the stack). This region is
+guaranteed to be preserved across interrupts and context switches, making the
+return address used in returning to the current code remain on the stack and
+valid to read. We can emit code in the caller to verify that a return edge was
+not mispredicted:
+```
+        callq   other_function
+return_addr:
+        testq   -8(%rsp), return_addr   # Validate return address.
+        cmovneq %r8, %rax               # Update predicate state.
+```
+
+For an ABI without a "red zone" (and thus unable to read the return address
+from the stack), mitigating returns face similar problems to calls below.
+
+Indirect calls (and returns in the absence of a red zone ABI) pose the most
+significant challenge to propagate. The simplest technique would be to define a
+new ABI such that the intended call target is passed into the called function
+and checked in the entry. Unfortunately, new ABIs are quite expensive to deploy
+in C and C++. While the target function could be passed in TLS, we would still
+require complex logic to handle a mixture of functions compiled with and
+without this extra logic (essentially, making the ABI backwards compatible).
+Currently, we suggest using retpolines here and will continue to investigate
+ways of mitigating this.
+
+
+##### Optimizations, Alternatives, and Tradeoffs
+
+Merely accumulating predicate state involves significant cost. There are
+several key optimizations we employ to minimize this and various alternatives
+that present different tradeoffs in the generated code.
+
+First, we work to reduce the number of instructions used to track the state:
+* Rather than inserting a `cmovCC` instruction along every conditional edge in
+  the original program, we track each set of condition flags we need to capture
+  prior to entering each basic block and reuse a common `cmovCC` sequence for
+  those.
+  * We could further reuse suffixes when there are multiple `cmovCC`
+    instructions required to capture the set of flags. Currently this is
+    believed to not be worth the cost as paired flags are relatively rare and
+    suffixes of them are exceedingly rare.
+* A common pattern in x86 is to have multiple conditional jump instructions
+  that use the same flags but handle different conditions. Naively, we could
+  consider each fallthrough between them an "edge" but this causes a much more
+  complex control flow graph. Instead, we accumulate the set of conditions
+  necessary for fallthrough and use a sequence of `cmovCC` instructions in a
+  single fallthrough edge to track it.
+
+Second, we trade register pressure for simpler `cmovCC` instructions by
+allocating a register for the "bad" state. We could read that value from memory
+as part of the conditional move instruction, however, this creates more
+micro-ops and requires the load-store unit to be involved. Currently, we place
+the value into a virtual register and allow the register allocator to decide
+when the register pressure is sufficient to make it worth spilling to memory
+and reloading.
+
+
+#### Hardening Loads
+
+Once we have the predicate accumulated into a special value for correct vs.
+misspeculated, we need to apply this to loads in a way that ensures they do not
+leak secret data. There are two primary techniques for this: we can either
+harden the loaded value to prevent observation, or we can harden the address
+itself to prevent the load from occuring. These have significantly different
+performance tradeoffs.
+
+
+##### Hardening loaded values
+
+The most appealing way to harden loads is to mask out all of the bits loaded.
+The key requirement is that for each bit loaded, along the misspeculated path
+that bit is always fixed at either 0 or 1 regardless of the value of the bit
+loaded. The most obvious implementation uses either an `and` instruction with
+an all-zero mask along misspeculated paths and an all-one mask along correct
+paths, or an `or` instruction with an all-one mask along misspeculated paths
+and an all-zero mask along correct paths. Other options become less appealing
+such as multiplying by zero, or multiple shift instructions. For reasons we
+elaborate on below, we end up suggesting you use `or` with an all-ones mask,
+making the x86 instruction sequence look like the following:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        movl    (%rsi), %edi            # Load potentially secret data from %rsi.
+        orl     %eax, %edi
+```
+
+Other useful patterns may be to fold the load into the `or` instruction itself
+at the cost of a register-to-register copy.
+
+There are some challenges with deploying this approach:
+1. Many loads on x86 are folded into other instructions. Separating them would
+   add very significant and costly register pressure with prohibitive
+   performance cost.
+1. Loads may not target a general purpose register requiring extra instructions
+   to map the state value into the correct register class, and potentially more
+   expensive instructions to mask the value in some way.
+1. The flags registers on x86 are very likely to be live, and challenging to
+   preserve cheaply.
+1. There are many more values loaded than pointers & indices used for loads. As
+   a consequence, hardening the result of a load requires substantially more
+   instructions than hardening the address of the load (see below).
+
+Despite these challenges, hardening the result of the load critically allows
+the load to proceed and thus has dramatically less impact on the total
+speculative / out-of-order potential of the execution. There are also several
+interesting techniques to try and mitigate these challenges and make hardening
+the results of loads viable in at least some cases. However, we generally
+expect to fall back when unprofitable from hardening the loaded value to the
+next approach of hardening the address itself.
+
+
+###### Loads folded into data-invariant operations can be hardened after the operation
+
+The first key to making this feasible is to recognize that many operations on
+x86 are "data-invariant". That is, they have no (known) observable behavior
+differences due to the particular input data. These instructions are often used
+when implementing cryptographic primitives dealing with private key data
+because they are not believed to provide any side-channels. Similarly, we can
+defer hardening until after them as they will not in-and-of-themselves
+introduce a speculative execution side-channel. This results in code sequences
+that look like:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        addl    (%rsi), %edi            # Load and accumulate without leaking.
+        orl     %eax, %edi
+```
+
+While an addition happens to the loaded (potentially secret) value, that
+doesn't leak any data and we then immediately harden it.
+
+
+###### Hardening of loaded values deferred down the data-invariant expression graph
+
+We can generalize the previous idea and sink the hardening down the expression
+graph across as many data-invariant operations as desirable. This can use very
+conservative rules for whether something is data-invariant. The primary goal
+should be to handle multiple loads with a single hardening instruction:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        addl    (%rsi), %edi            # Load and accumulate without leaking.
+        addl    4(%rsi), %edi           # Continue without leaking.
+        addl    8(%rsi), %edi
+        orl     %eax, %edi              # Mask out bits from all three loads.
+```
+
+
+###### Preserving the flags while hardening loaded values on Haswell, Zen, and newer processors
+
+Sadly, there are no useful instructions on x86 that apply a mask to all 64 bits
+without touching the flag registers. However, we can harden loaded values that
+are narrower than a word (fewer than 32-bits on 32-bit systems and fewer than
+64-bits on 64-bit systems) by zero-extending the value to the full word size
+and then shifting right by at least the number of original bits using the BMI2
+`shrx` instruction:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        addl    (%rsi), %edi            # Load and accumulate 32 bits of data.
+        shrxq   %rax, %rdi, %rdi        # Shift out all 32 bits loaded.
+```
+
+Because on x86 the zero-extend is free, this can efficiently harden the loaded
+value.
+
+
+##### Hardening the address of the load
+
+When hardening the loaded value is inapplicable, most often because the
+instruction directly leaks information (like `cmp` or `jmpq`), we switch to
+hardening the _address_ of the load instead of the loaded value. This avoids
+increasing register pressure by unfolding the load or paying some other high
+cost.
+
+To understand how this works in practice, we need to examine the exact
+semantics of the x86 addressing modes which, in its fully general form, looks
+like `(%base,%index,scale)offset`. Here `%base` and `%index` are 64-bit
+registers that can potentially be any value, and may be attacker controlled,
+and `scale` and `offset` are fixed immediate values. `scale` must be `1`, `2`,
+`4`, or `8`, and `offset` can be any 32-bit sign extended value. The exact
+computation performed to find the address is then: `%base + (scale * %index) +
+offset` under 64-bit 2's complement modular arithmetic.
+
+One issue with this approach is that, after hardening, the  `%base + (scale *
+%index)` subexpression will compute a value near zero (`-1 + (scale * -1)`) and
+then a large, positive `offset` will index into memory within the first two
+gigabytes of address space. While these offsets are not attacker controlled,
+the attacker could chose to attack a load which happens to have the desired
+offset and then successfully read memory in that region. This significantly
+raises the burden on the attacker and limits the scope of attack but does not
+eliminate it. To fully close the attack we must work with the operating system
+to preclude mapping memory in the low two gigabytes of address space.
+
+
+###### 64-bit load checking instructions
+
+We can use the following instruction sequences to check loads. We set up `%r8`
+in these examples to hold the special value of `-1` which will be `cmov`ed over
+`%rax` in misspeculated paths.
+
+Single register addressing mode:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        orq     %rax, %rsi              # Mask the pointer if misspeculating.
+        movl    (%rsi), %edi
+```
+
+Two register addressing mode:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        orq     %rax, %rsi              # Mask the pointer if misspeculating.
+        orq     %rax, %rcx              # Mask the index if misspeculating.
+        movl    (%rsi,%rcx), %edi
+```
+
+This will result in a negative address near zero or in `offset` wrapping the
+address space back to a small positive address. Small, negative addresses will
+fault in user-mode for most operating systems, but targets which need the high
+address space to be user accessible may need to adjust the exact sequence used
+above. Additionally, the low addresses will need to be marked unreadable by the
+OS to fully harden the load.
+
+
+###### RIP-relative addressing is even easier to break
+
+There is a common addressing mode idiom that is substantially harder to check:
+addressing relative to the instruction pointer. We cannot change the value of
+the instruction pointer register and so we have the harder problem of forcing
+`%base + scale * %index + offset` to be an invalid address, by *only* changing
+`%index`. The only advantage we have is that the attacker also cannot modify
+`%base`. If we use the fast instruction sequence above, but only apply it to
+the index, we will always access `%rip + (scale * -1) + offset`. If the
+attacker can find a load which with this address happens to point to secret
+data, then they can reach it. However, the loader and base libraries can also
+simply refuse to map the heap, data segments, or stack within 2gb of any of the
+text in the program, much like it can reserve the low 2gb of address space.
+
+
+###### The flag registers again make everything hard
+
+Unfortunately, the technique of using `orq`-instructions has a serious flaw on
+x86. The very thing that makes it easy to accumulate state, the flag registers
+containing predicates, causes serious problems here because they may be alive
+and used by the loading instruction or subsequent instructions. On x86, the
+`orq` instruction **sets** the flags and will override anything already there.
+This makes inserting them into the instruction stream very hazardous.
+Unfortunately, unlike when hardening the loaded value, we have no fallback here
+and so we must have a fully general approach available.
+
+The first thing we must do when generating these sequences is try to analyze
+the surrounding code to prove that the flags are not in fact alive or being
+used. Typically, it has been set by some other instruction which just happens
+to set the flags register (much like ours!) with no actual dependency. In those
+cases, it is safe to directly insert these instructions. Alternatively we may
+be able to move them earlier to avoid clobbering the used value.
+
+However, this may ultimately be impossible. In that case, we need to preserve
+the flags around these instructions:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        pushfq
+        orq     %rax, %rcx              # Mask the pointer if misspeculating.
+        orq     %rax, %rdx              # Mask the index if misspeculating.
+        popfq
+        movl    (%rcx,%rdx), %edi
+```
+
+Using the `pushf` and `popf` instructions saves the flags register around our
+inserted code, but comes at a high cost. First, we must store the flags to the
+stack and reload them. Second, this causes the stack pointer to be adjusted
+dynamically, requiring a frame pointer be used for referring to temporaries
+spilled to the stack, etc.
+
+On newer x86 processors we can use the `lahf` and `sahf` instructions to save
+all of the flags besides the overflow flag in a register rather than on the
+stack. We can then use `seto` and `add` to save and restore the overflow flag
+in a register. Combined, this will save and restore flags in the same manner as
+above but using two registers rather than the stack. That is still very
+expensive if slightly less expensive than `pushf` and `popf` in most cases.
+
+
+###### A flag-less alternative on Haswell, Zen and newer processors
+
+Starting with the BMI2 x86 instruction set extensions available on Haswell and
+Zen processors, there is an instruction for shifting that does not set any
+flags: `shrx`. We can use this and the `lea` instruction to implement analogous
+code sequences to the above ones. However, these are still very marginally
+slower, as there are fewer ports able to dispatch shift instructions in most
+modern x86 processors than there are for `or` instructions.
+
+Fast, single register addressing mode:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        shrxq   %rax, %rsi, %rsi        # Shift away bits if misspeculating.
+        movl    (%rsi), %edi
+```
+
+This will collapse the register to zero or one, and everything but the offset
+in the addressing mode to be less than or equal to 9. This means the full
+address can only be guaranteed to be less than `(1 << 31) + 9`. The OS may wish
+to protect an extra page of the low address space to account for this
+
+
+##### Optimizations
+
+A very large portion of the cost for this approach comes from checking loads in
+this way, so it is important to work to optimize this. However, beyond making
+the instruction sequences to *apply* the checks efficient (for example by
+avoiding `pushfq` and `popfq` sequences), the only significant optimization is
+to check fewer loads without introducing a vulnerability. We apply several
+techniques to accomplish that.
+
+
+###### Don't check loads from compile-time constant stack offsets
+
+We implement this optimization on x86 by skipping the checking of loads which
+use a fixed frame pointer offset.
+
+The result of this optimization is that patterns like reloading a spilled
+register or accessing a global field don't get checked. This is a very
+significant performance win.
+
+
+###### Don't check dependent loads
+
+A core part of why this mitigation strategy works is that it establishes a
+data-flow check on the loaded address. However, this means that if the address
+itself was already loaded using a checked load, there is no need to check a
+dependent load provided it is within the same basic block as the checked load,
+and therefore has no additional predicates guarding it. Consider code like the
+following:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        movq    (%rcx), %rdi
+        movl    (%rdi), %edx
+```
+
+This will get transformed into:
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        orq     %rax, %rcx              # Mask the pointer if misspeculating.
+        movq    (%rcx), %rdi            # Hardened load.
+        movl    (%rdi), %edx            # Unhardened load due to dependent addr.
+```
+
+This doesn't check the load through `%rdi` as that pointer is dependent on a
+checked load already.
+
+
+###### Protect large, load-heavy blocks with a single lfence
+
+It may be worth using a single `lfence` instruction at the start of a block
+which begins with a (very) large number of loads that require independent
+protection *and* which require hardening the address of the load. However, this
+is unlikely to be profitable in practice. The latency hit of the hardening
+would need to exceed that of an `lfence` when *correctly* speculatively
+executed. But in that case, the `lfence` cost is a complete loss of speculative
+execution (at a minimum). So far, the evidence we have of the performance cost
+of using `lfence` indicates few if any hot code patterns where this trade off
+would make sense.
+
+
+###### Tempting optimizations that break the security model
+
+Several optimizations were considered which didn't pan out due to failure to
+uphold the security model. One in particular is worth discussing as many others
+will reduce to it.
+
+We wondered whether only the *first* load in a basic block could be checked. If
+the check works as intended, it forms an invalid pointer that doesn't even
+virtual-address translate in the hardware. It should fault very early on in its
+processing. Maybe that would stop things in time for the misspeculated path to
+fail to leak any secrets. This doesn't end up working because the processor is
+fundamentally out-of-order, even in its speculative domain. As a consequence,
+the attacker could cause the initial address computation itself to stall and
+allow an arbitrary number of unrelated loads (including attacked loads of
+secret data) to pass through.
+
+
+#### Interprocedural Checking
+
+Modern x86 processors may speculate into called functions and out of functions
+to their return address. As a consequence, we need a way to check loads that
+occur after a misspeculated predicate but where the load and the misspeculated
+predicate are in different functions. In essence, we need some interprocedural
+generalization of the predicate state tracking. A primary challenge to passing
+the predicate state between functions is that we would like to not require a
+change to the ABI or calling convention in order to make this mitigation more
+deployable, and further would like code mitigated in this way to be easily
+mixed with code not mitigated in this way and without completely losing the
+value of the mitigation.
+
+
+##### Embed the predicate state into the high bit(s) of the stack pointer
+
+We can use the same technique that allows hardening pointers to pass the
+predicate state into and out of functions. The stack pointer is trivially
+passed between functions and we can test for it having the high bits set to
+detect when it has been marked due to misspeculation. The callsite instruction
+sequence looks like (assuming a misspeculated state value of `-1`):
+```
+        ...
+
+.LBB0_4:                                # %danger
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        shlq    $47, %rax
+        orq     %rax, %rsp
+        callq   other_function
+        movq    %rsp, %rax
+        sarq    63, %rax                # Sign extend the high bit to all bits.
+```
+
+This first puts the predicate state into the high bits of `%rsp` before calling
+the function and then reads it back out of high bits of `%rsp` afterward. When
+correctly executing (speculatively or not), these are all no-ops. When
+misspeculating, the stack pointer will end up negative. We arrange for it to
+remain a canonical address, but otherwise leave the low bits alone to allow
+stack adjustments to proceed normally without disrupting this. Within the
+called function, we can extract this predicate state and then reset it on
+return:
+```
+other_function:
+        # prolog
+        callq   other_function
+        movq    %rsp, %rax
+        sarq    63, %rax                # Sign extend the high bit to all bits.
+        # ...
+
+.LBB0_N:
+        cmovneq %r8, %rax               # Conditionally update predicate state.
+        shlq    $47, %rax
+        orq     %rax, %rsp
+        retq
+```
+
+This approach is effective when all code is mitigated in this fashion, and can
+even survive very limited reaches into unmitigated code (the state will
+round-trip in and back out of an unmitigated function, it just won't be
+updated). But it does have some limitations. There is a cost to merging the
+state into `%rsp` and it doesn't insulate mitigated code from misspeculation in
+an unmitigated caller.
+
+There is also an advantage to using this form of interprocedural mitigation: by
+forming these invalid stack pointer addresses we can prevent speculative
+returns from successfully reading speculatively written values to the actual
+stack. This works first by forming a data-dependency between computing the
+address of the return address on the stack and our predicate state. And even
+when satisfied, if a misprediction causes the state to be poisoned the
+resulting stack pointer will be invalid.
+
+
+##### Rewrite API of internal functions to directly propagate predicate state
+
+(Not yet implemented.)
+
+We have the option with internal functions to directly adjust their API to
+accept the predicate as an argument and return it. This is likely to be
+marginally cheaper than embedding into `%rsp` for entering functions.
+
+
+##### Use `lfence` to guard function transitions
+
+An `lfence` instruction can be used to prevent subsequent loads from
+speculatively executing until all prior mispredicted predicates have resolved.
+We can use this broader barrier to speculative loads executing between
+functions. We emit it in the entry block to handle calls, and prior to each
+return. This approach also has the advantage of providing the strongest degree
+of mitigation when mixed with unmitigated code by halting all misspeculation
+entering a function which is mitigated, regardless of what occured in the
+caller. However, such a mixture is inherently more risky. Whether this kind of
+mixture is a sufficient mitigation requires careful analysis.
+
+Unfortunately, experimental results indicate that the performance overhead of
+this approach is very high for certain patterns of code. A classic example is
+any form of recursive evaluation engine. The hot, rapid call and return
+sequences exhibit dramatic performance loss when mitigated with `lfence`. This
+component alone can regress performance by 2x or more, making it an unpleasant
+tradeoff even when only used in a mixture of code.
+
+
+##### Use an internal TLS location to pass predicate state
+
+We can define a special thread-local value to hold the predicate state between
+functions. This avoids direct ABI implications by using a side channel between
+callers and callees to communicate the predicate state. It also allows implicit
+zero-initialization of the state, which allows non-checked code to be the first
+code executed.
+
+However, this requires a load from TLS in the entry block, a store to TLS
+before every call and every ret, and a load from TLS after every call. As a
+consequence it is expected to be substantially more expensive even than using
+`%rsp` and potentially `lfence` within the function entry block.
+
+
+##### Define a new ABI and/or calling convention
+
+We could define a new ABI and/or calling convention to explicitly pass the
+predicate state in and out of functions. This may be interesting if none of the
+alternatives have adequate performance, but it makes deployment and adoption
+dramatically more complex, and potentially infeasible.
+
+
+## High-Level Alternative Mitigation Strategies
+
+There are completely different alternative approaches to mitigating variant 1
+attacks. [Most](https://lwn.net/Articles/743265/)
+[discussion](https://lwn.net/Articles/744287/) so far focuses on mitigating
+specific known attackable components in the Linux kernel (or other kernels) by
+manually rewriting the code to contain an instruction sequence that is not
+vulnerable. For x86 systems this is done by either injecting an `lfence`
+instruction along the code path which would leak data if executed speculatively
+or by rewriting memory accesses to have branch-less masking to a known safe
+region. On Intel systems, `lfence` [will prevent the speculative load of secret
+data](https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/Intel-Analysis-of-Speculative-Execution-Side-Channels.pdf).
+On AMD systems `lfence` is currently a no-op, but can be made
+dispatch-serializing by setting an MSR, and thus preclude misspeculation of the
+code path ([mitigation G-2 +
+V1-1](https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf)).
+
+However, this relies on finding and enumerating all possible points in code
+which could be attacked to leak information. While in some cases static
+analysis is effective at doing this at scale, in many cases it still relies on
+human judgement to evaluate whether code might be vulnerable. Especially for
+software systems which receive less detailed scrutiny but remain sensitive to
+these attacks, this seems like an impractical security model. We need an
+automatic and systematic mitigation strategy.
+
+
+### Automatic `lfence` on Conditional Edges
+
+A natural way to scale up the existing hand-coded mitigations is simply to
+inject an `lfence` instruction into both the target and fallthrough
+destinations of every conditional branch. This ensures that no predicate or
+bounds check can be bypassed speculatively. However, the performance overhead
+of this approach is, simply put, catastrophic. Yet it remains the only truly
+"secure by default" approach known prior to this effort and serves as the
+baseline for performance.
+
+One attempt to address the performance overhead of this and make it more
+realistic to deploy is [MSVC's /Qspectre
+switch](https://blogs.msdn.microsoft.com/vcblog/2018/01/15/spectre-mitigations-in-msvc/).
+Their technique is to use static analysis within the compiler to only insert
+`lfence` instructions into conditional edges at risk of attack. However,
+[initial](https://arstechnica.com/gadgets/2018/02/microsofts-compiler-level-spectre-fix-shows-how-hard-this-problem-will-be-to-solve/)
+[analysis](https://www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html)
+has shown that this approach is incomplete and only catches a small and limited
+subset of attackable patterns which happen to resemble very closely the initial
+proofs of concept. As such, while its performance is acceptable, it does not
+appear to be an adequate systematic mitigation.
+
+
+## Performance Overhead
+
+The performance overhead of this style of comprehensive mitigation is very
+high. However, it compares very favorably with previously recommended
+approaches such as the `lfence` instruction. Just as users can restrict the
+scope of `lfence` to control its performance impact, this mitigation technique
+could be restricted in scope as well.
+
+However, it is important to understand what it would cost to get a fully
+mitigated baseline. Here we assume targeting a Haswell (or newer) processor and
+using all of the tricks to improve performance (so leaves the low 2gb
+unprotected and +/- 2gb surrounding any PC in the program). We ran both
+Google's microbenchmark suite and a large highly-tuned server built using
+ThinLTO and PGO. All were built with `-march=haswell` to give access to BMI2
+instructions, and benchmarks were run on large Haswell servers. We collected
+data both with an `lfence`-based mitigation and load hardening as presented
+here. The summary is that mitigating with load hardening is 1.77x faster than
+mitigating with `lfence`, and the overhead of load hardening compared to a
+normal program is likely between a 10% overhead and a 50% overhead with most
+large applications seeing a 30% overhead or less.
+
+| Benchmark                              | `lfence` | Load Hardening | Mitigated Speedup |
+| -------------------------------------- | -------: | -------------: | ----------------: |
+| Google microbenchmark suite            |   -74.8% |         -36.4% |          **2.5x** |
+| Large server QPS (using ThinLTO & PGO) |   -62%   |         -29%   |          **1.8x** |
+
+Below is a visualization of the microbenchmark suite results which helps show
+the distribution of results that is somewhat lost in the summary. The y-axis is
+a log-scale speedup ratio of load hardening relative to `lfence` (up -> faster
+-> better). Each box-and-whiskers represents one microbenchmark which may have
+many different metrics measured. The red line marks the median, the box marks
+the first and third quartiles, and the whiskers mark the min and max.
+
+![Microbenchmark result visualization](speculative_load_hardening_microbenchmarks.png)
+
+We don't yet have benchmark data on SPEC or the LLVM test suite, but we can
+work on getting that. Still, the above should give a pretty clear
+characterization of the performance, and specific benchmarks are unlikely to
+reveal especially interesting properties.
+
+
+### Future Work: Fine Grained Control and API-Integration
+
+The performance overhead of this technique is likely to be very significant and
+something users wish to control or reduce. There are interesting options here
+that impact the implementation strategy used.
+
+One particularly appealing option is to allow both opt-in and opt-out of this
+mitigation at reasonably fine granularity such as on a per-function basis,
+including intelligent handling of inlining decisions -- protected code can be
+prevented from inlining into unprotected code, and unprotected code will become
+protected when inlined into protected code. For systems where only a limited
+set of code is reachable by externally controlled inputs, it may be possible to
+limit the scope of mitigation through such mechanisms without compromising the
+application's overall security. The performance impact may also be focused in a
+few key functions that can be hand-mitigated in ways that have lower
+performance overhead while the remainder of the application receives automatic
+protection.
+
+For both limiting the scope of mitigation or manually mitigating hot functions,
+there needs to be some support for mixing mitigated and unmitigated code
+without completely defeating the mitigation. For the first use case, it would
+be particularly desirable that mitigated code remains safe when being called
+during misspeculation from unmitigated code.
+
+For the second use case, it may be important to connect the automatic
+mitigation technique to explicit mitigation APIs such as what is described in
+http://wg21.link/p0928 (or any other eventual API) so that there is a clean way
+to switch from automatic to manual mitigation without immediately exposing a
+hole. However, the design for how to do this is hard to come up with until the
+APIs are better established. We will revisit this as those APIs mature.
diff --git a/docs/speculative_load_hardening_microbenchmarks.png b/docs/speculative_load_hardening_microbenchmarks.png

new file mode 100644 (file)

index 0000000..b6f7d05

Binary files /dev/null and b/docs/speculative_load_hardening_microbenchmarks.png differ
author	Chandler Carruth <chandlerc@gmail.com>
	Wed, 18 Jul 2018 14:05:14 +0000 (14:05 +0000)
committer	Chandler Carruth <chandlerc@gmail.com>
	Wed, 18 Jul 2018 14:05:14 +0000 (14:05 +0000)
docs/SpeculativeLoadHardening.md	[new file with mode: 0644]	patch \| blob
docs/speculative_load_hardening_microbenchmarks.png	[new file with mode: 0644]	patch \| blob