Benchmarking and Analysing NIST PQC Lattice-Based Signature Scheme Standards on ARM Cortex M7

James Howe
Senior Research Scientist
CONTENTS

01 Introduction and Motivation
02 Benchmark and Profile Results
03 Constant-Time Issues
INTRODUCTION AND MOTIVATION
What are the PQC standards we have?

CRYSALS-Kyber is the only KEM and CRYSALS-Dilithium is the primary signature.

“The security of Kyber has been thoroughly analyzed [...] based on a strong framework of results in lattice-based cryptography. Kyber has excellent performance overall in software, hardware and many hybrid settings.”

“Dilithium is a signature scheme with high efficiency, relatively simple implementation, a strong theoretical security basis, and an encouraging cryptanalytic history.”
What are the PQC standards we have?

We also have two other PQ signatures:

- **Falcon**, also from lattices, different performance profile.
- More complex implementation, emulates or uses FPU.
- Offers significantly smaller signature sizes and fast verification.

*Falcon* was chosen for standardization because NIST has confidence in its security (under the assumption that it is correctly implemented) and because its small bandwidth may be necessary in certain applications.
“NIST understands that some applications will not work as they are currently designed if the signature and the data being signed cannot fit in a single internet packet.”

“For this reason, NIST decided to standardize FALCON as well. Given FALCON’s overall better performance when signature generation does not need to be performed on constrained devices, many applications may prefer to use FALCON over Dilithium, even in cases in which Dilithium’s signature size would not be a barrier to implementation.”

Figure 7. Signature Benchmarks on ARM Cortex-M4 processor
Current State on ARM Cortex M4

Without double precision, Falcon emulates floats.

Thus we get performance profiles like this on Cortex M4.

But can we get this closer using similar device with full FPU?

We wanted to challenge this belief that Falcon signing is much slower than Dilithium.

Important decision in, e.g., RISC-V CPU and SoC implementations.

Also, does FPU open questions on constant time?
What’s the big deal?
Constant-time and Correctness

01
Emulated floating-point implementation can be done

02
Only using integer operations with \texttt{uint32\_t} and \texttt{uint64\_t} types

03
This is constant-time, provided that the underlying platform offers constant-time opcodes for:

• Multiplication of two 32-bit unsigned integers into a 64-bit result.
• Left-shift or right-shift of a 32-bit unsigned integer by a potentially secret shift count in the 0...31 range.
Why the ARM Cortex M7?

- NIST selected Cortex M4 as benchmark MCU; and the Cortex M7 is a very similar core.
- Both have ARMv7-M architecture.
- Cortex M7 has all ISA features available in the Cortex M4.
- M7 has 6-stage pipeline (vs 3) and better memory features and branch predicting.
- **M7 has 64-bit FPU, M4 has 32-bit**
- Falcon requires 53-bit floating-point precision.
- Using floating-points is rare in cryptography → side channels?

© SB Technology, Inc.
02 BENCHMARKING AND PROFILING
Benchmarking Premise

- We benchmarked both Dilithium and Falcon on ARM Cortex M7.
- Both used open-source implementations, i.e., pqm4.
- Benchmarks took averages over 1000 runs.
- All results henceforth are clock cycles, for timings see paper.
- We mainly use STM32F767ZI NUCLEO-144 development board.
- Using recent GNU ARM embedded toolchain: GCC version 10.2.1 20201103

```
using -O2 -mcpu=cortex-m7 -march=-march=armv7e-m+fpv5+fp.dp
```
Dilithium Benchmarking (M4 vs M7)

Overall, the performance of Dilithium wasn’t interesting.

Improvements range between 1.09–1.19x

Essentially accounts for the slightly better MCU: Cortex M7 vs the Cortex M4.

Table 1: Benchmarking results of Dilithium on the ARM Cortex M7 using the STM32F767ZI NUCLEO-144 development board. Results in KCycles.

<table>
<thead>
<tr>
<th>Parameter Set</th>
<th>Operation</th>
<th>Min</th>
<th>Avg</th>
<th>Max</th>
<th>SDev/SErr</th>
<th>Avg (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dilithium-2</td>
<td>Key Gen</td>
<td>1,390</td>
<td>1,437</td>
<td>1,479</td>
<td>81/3</td>
<td>6.7</td>
</tr>
<tr>
<td>M7 vs M4</td>
<td>Key Gen</td>
<td>1.13x</td>
<td><strong>1.10x</strong></td>
<td>1.06x</td>
<td>-/-</td>
<td><strong>1.40x</strong></td>
</tr>
<tr>
<td>Dilithium-2</td>
<td>Sign</td>
<td>1,835</td>
<td>3,658</td>
<td>16,440</td>
<td>604/17</td>
<td>16.9</td>
</tr>
<tr>
<td>M7 vs M4</td>
<td>Sign</td>
<td>1.19x</td>
<td><strong>1.09x</strong></td>
<td>0.64x</td>
<td>-/-</td>
<td><strong>1.40x</strong></td>
</tr>
<tr>
<td>Dilithium-2</td>
<td>Verify</td>
<td>1,428</td>
<td>1,429</td>
<td>1,432</td>
<td>27.8/0.9</td>
<td>6.6</td>
</tr>
<tr>
<td>M7 vs M4</td>
<td>Verify</td>
<td>1.12x</td>
<td><strong>1.12x</strong></td>
<td>1.12x</td>
<td>-/-</td>
<td><strong>1.42x</strong></td>
</tr>
<tr>
<td>Dilithium-3</td>
<td>Key Gen</td>
<td>2,563</td>
<td>2,566</td>
<td>2,569</td>
<td>37.6/1.2</td>
<td>11.9</td>
</tr>
<tr>
<td>M7 vs M4</td>
<td>Key Gen</td>
<td>1.12x</td>
<td><strong>1.13x</strong></td>
<td>1.12x</td>
<td>-/-</td>
<td><strong>1.44x</strong></td>
</tr>
<tr>
<td>Dilithium-3</td>
<td>Sign</td>
<td>2,981</td>
<td>6,009</td>
<td>26,208</td>
<td>65/9</td>
<td>20.7</td>
</tr>
<tr>
<td>M7 vs M4</td>
<td>Sign</td>
<td>1.12x</td>
<td><strong>1.19x</strong></td>
<td>0.78x</td>
<td>-/-</td>
<td><strong>2.06x</strong></td>
</tr>
<tr>
<td>Dilithium-3</td>
<td>Verify</td>
<td>2,452</td>
<td>2,453</td>
<td>2,456</td>
<td>26.5/0.8</td>
<td>11.4</td>
</tr>
<tr>
<td>M7 vs M4</td>
<td>Verify</td>
<td>1.12x</td>
<td><strong>1.12x</strong></td>
<td>1.11x</td>
<td>-/-</td>
<td><strong>1.43x</strong></td>
</tr>
<tr>
<td>Dilithium-5</td>
<td>KeyGen</td>
<td>4,312</td>
<td>4,368</td>
<td>4,436</td>
<td>54.4/1.7</td>
<td>20.2</td>
</tr>
<tr>
<td>Dilithium-5</td>
<td>Sign</td>
<td>5,020</td>
<td>8,157</td>
<td>35,653</td>
<td>99k/3k</td>
<td>37.8</td>
</tr>
<tr>
<td>Dilithium-5</td>
<td>Verify</td>
<td>4,282</td>
<td>4,287</td>
<td>4,292</td>
<td>46.5/1.5</td>
<td>19.8</td>
</tr>
</tbody>
</table>
Benchmarking Results (FPU vs EMU on M7)

Falcon sees a drastic speedup, expectedly

Improvements range between >6-8x overall

Key generation is least impacted, >1.5x speedup overall.

Signing times show most improvements:
- Sign dynamic >6x speedup, close to Dilithium performance.
- Sign tree >4.5x speedup, comfortably faster than Dilithium

Verify not impacted, doesn’t require floats.
Benchmarking (Dilithium vs Falcon)

Comparing Dilithium and Falcon now shows a much different performance profile.

Falcon-512 now slightly faster than Dilithium2, for both signing and signing+verify runtimes.

Falcon-1024 also slightly faster than Dilithium5 signing and much faster when combining verify.
Profiling Falcon (M4 vs M7)

Performance improvements inside Falcon:

For key generation:
- iFFT/FFT multiplication 16x improved
- Going from 10m to 0.5m cycles

For both signing modes:
- Fast Fourier sampling >5x improved.
- Going from 16m to <3m cycles.

Verify times were unchanged.

Expand private key improved 12x.
Going from 11m to <1m cycles.
03
CONSTANT OR ISOCHRONOUS RUNTIME
Constant-Time Validation

Floating-point arithmetic is rare in cryptography! Thus we thought it was worth looking at...

We used inline assembly to

- Minimize the unwanted optimizations from the compiler / clobbered registers where necessary.
- This minimizes the effect of surrounding instructions on the operations of interest.
- Which occurred when we tried using C.
- Ensures that all execution is from cache.

This example is for double precision multiplication, i.e., vmul.f64, this is repeated for each instruction.

We tested 4 STM32 development boards.

```asm
volatile (  
"vldr d5, %2\n"
"vldr d6, %3\n"
"dmb\n"
"isb\n"
"ldr r1, %1\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"vmul.f64 d4, d5, d6\n"
"ldr r2, %1\n"
"subs %0, r2, r1\n"
: "=r"(cycles) : "m"(DWT->CYCCNT), "m"(r1), "m"(r2) : "r1", "r2", "d4", "d5", "d6";
```
Constant-Time Validation

Assembly code uses **two random inputs** for each function.

We found **timing issues** in all double precision FPU instructions across all 4 STM32 boards.

In addition (vadd.f64) runtimes had **16 clocks on avg, standard deviation of 4.1**.

If we generated random values in the same range, such they had the same exponents, the **runtimes were constant and consistent at 10 clock cycles**.

Moreover, when we mixed randomness from two fixed exponent ranges we observed **constant and consistent runtimes of 19 clock cycles**.
Constant-Time Validation

Also tested the ARM Cortex A53 as a previous paper uses Raspberry Pi 3.

Issue found when casting from types double to int64_t, op rounds towards zero.

No native instruction to do this on ARMv7.

This can be non-constant time

In LLVM, it isn’t, and leaks the sign.

We reported this to the Falcon team and proposed the following fix shown on the right.

```c
int64_t cast(double a) {
    union {
        double d;
        int64_t u;
        int64_t i;
    } x;
    uint64_t mask;
    uint32_t high, low;
    x.d = a;
    mask = x.i >> 63;
    x.u &= 0x7fffffffL;
    // a / 0x1p32f;
    high = x.d / 4294967296.f;
    // high * 0x1p32f;
    low = x.d - (double)high * 4294967296.f;
    x.u = (((int64_t)high << 32) | low);
    return (x.u & ((uint64_t)-1 - mask))
        | ((-x.u) & mask);
}
```
Takeaways

1. Falcon is super fast on the Cortex M7.
2. Unknown if timing issues can be exploited.
3. Users should consider this thoroughly for each use case.

For example,

Cloudflare currently recommend using Falcon in offline situations.