Filename: 261-aez-crypto.txt
Title: AEZ for relay cryptography
Author: Nick Mathewson
Created: 28 Oct 2015
Status: Obsolete

0. History

   I wrote the first draft of this around October.  This draft takes a
   more concrete approach to the open questions from last time around.

1. Summary and preliminaries

   This proposal describes an improved algorithm for circuit
   encryption, based on the wide-block SPRP AEZ. I also describe the
   attendant bookkeeping, including CREATE cells, and several
   variants of the proposal.

   For more information about AEZ, see
           http://web.cs.ucdavis.edu/~rogaway/aez/

   For motivations, see proposal 202.

2. Specifications

2.1. New CREATE cell types.

   We add a new CREATE cell type that behaves as an ntor cell but which
   specifies that the circuit will be created to use this mode of
   encryption.

   [TODO: Can/should we make this unobservable?]

   The ntor handshake is performed as usual, but a different PROTOID is
   used:
        "ntor-curve25519-sha256-aez-1"

   To derive keys under this handshake, we use SHAKE256 to derive the
   following output:

     struct shake_output {
         u8 aez_key[48];
         u8 chain_key[32];
         u8 chain_val_forward[16];
         u8 chain_val_backward[16];
     };

   The first two two fields are constant for the lifetime of the
   circuit.

2.2. New relay cell payload

   We specify the following relay cell payload format, to be used when
   the exit node circuit hop was created with the CREATE format in 2.1
   above:

     struct relay_cell_payload {
        u32 zero_1;
        u16 zero_2;
        u16 stream_id;
        u16 length IN [0..498];
        u8 command;
        u8 data[498]; // payload_len - 11
     };

   Note that the payload length is unchanged.  The fields are now
   rearranged to be aligned.  The 'recognized' and 'length' fields are
   replaced with zero_1, zero_2, and the high 7 bits of length, for a
   minimum of 55 bits of unambigious verification.  (Additional
   verification can be done by checking the other fields for
   correctness; AEZ users can exploit plaintext redundancy for
   additional cryptographic checking.)

   When encrypting a cell for a hop that was created using one of these
   circuits, clients and relays encrypt them using the AEZ algorithm
   with the following parameters:

       Let Chain denote chain_val_forward if this is a forward cell
          or chain_val_backward otherwise.

       tau = 0

       # We set tau=0 because want no per-hop ciphertext expansion.  Instead
       # we use redundancy in the plaintext to authenticate the data.

       Nonce =
         struct {
           u64 cell_number;
           u8 is_forward;
           u8 is_early;
         }

       # The cell number is the number of relay cells that have
       # traveled in this direction on this circuit before this cell.
       # ie, it's zero for the first cell, two for the second, etc.
       #
       # is_forward is 1 for outbound cells, 0 for inbound cells.
       # is_early is 1 for cells packaged as RELAY_EARLY, 0 for
       #   cells packaged as RELAY.
       #
       # Technically these two values would be more at home in AD
       # than in Nonce; but AEZ doesn't actually distinguish N and AD
       # internally.

       Define CELL_CHAIN_BYTES = 32

       AD = [ XOR(prev_plaintext[:CELL_CHAIN_BYTES],
                  prev_ciphertext[:CELL_CHAIN_BYTES]),
              Chain ]

       # Using the previous cell's plaintext/ciphertext as additional data
       # guarantees that any corrupt ciphertext received will corrupt the
       # plaintext, which will corrupt all future plaintexts.

       Set Chain = AES(chain_key, Chain) xor Chain.

       # This 'chain' construction is meant to provide forward
       # secrecy.  Each chain value is replaced after each cell with a
       # (hopefully!) hard-to-reverse construction.

   This instantiates a wide-block cipher, tweaked based on the cell
   index and direction.  It authenticates part of the previous cell's
   plaintext, thereby ensuring that if the previous cell was corrupted,
   this cell will be unrecoverable.

3. Design considerations

3.1. Wide-block pros and cons?

   See proposal 202, section 4.

3.2. Given wide-block, why AEZ?

   It's a reasonably fast probably secure wide-block cipher.  In
   particular, it's performance-competitive with AES_CTR, and far better
   than what we're doing now.  See performance appendix.

   It seems secure-ish too.  Several cryptographers I know seem to
   think it's likely secure enough, and almost surely at least as
   good as AES.

   [There are many other competing wide-block SPRP constructions if
   you like.  Many require blocks be an integer number of blocks, or
   aren't tweakable.  Some are slow.  Do you know a good one?]

3.3. Why _not_ AEZ?

   There are also some reasons to consider avoiding AEZ, even if we do
   decide to use a wide-block cipher.

   FIRST it is complicated to implement.  As the specification says,
   "The easiness claim for AEZ is with respect to ease and versatility
   of use, not implementation."

   SECOND, it's still more complicated to implement well (fast,
   side-channel-free) on systems without AES acceleration.  We'll need
   to pull the round functions out of fast assembly AES, which is
   everybody's favorite hobby.

   THIRD, it's really horrible to try to do it in hardware.

   FOURTH, it is comparatively new.  Although several cryptographers
   like it, and it is closely related to a system with a security proof,
   you never know.

   FIFTH, something better may come along.


4. Alternative designs

4.1. Two keys.

   We could have a separate AEZ key for forward and backward encryption.
   This would use more space, however.

4.2. Authenticating things differently

   In computing the AD, we could replace xor with concat.

   In computing the AD, we could replace CELL_CHAIN_BYTES with 16, or
   509.

   (Another thing we might dislike about the current proposal is
   that it appears to requires us to remember 32 bytes of plaintext
   until we get another cell.  But that part is fixable: note that
   in the structure of AEZ, the AD is processed in the AEZ-hash()
   function, and then no longer used.  We can compute the AEZ-hash()
   to be used for the next cell after each cell is en/de crypted.)

4.3. Other hashes.

   We could update the ntor definition used in this to use a better hash
   than SHA256 inside.

4.4. Less predictable plaintext.

   A positively silly option would be to reserve the last X bytes of
   each relay cell's plaintext for random bytes, if they are not used
   for payload.  This might help a little, in a really doofy way.


A.1. Performance notes: memory requirements

  Let's ignore Tor overhead here, but not openssl overhead.

  IN THE CURRENT PROTOCOL, the total memory required at each relay is: 2
  sha1 states, 2 aes states.

  Each sha1 state uses 96 bytes.  Each aes state uses 244 bytes.  (Plus
  32 bytes counter-mode overhead.)  This works out to 704 bytes at each
  hop.

  IN THE PROTOCOL ABOVE, using an optimized AEZ implementation, we'll
  need 128 bytes for the expanded AEZ key schedule.  We'll need another
  244 bytes for the AES key schedule for the chain key.  And there's 32
  bytes of chaining values.  This gives us 404 bytes at each hop, for a
  savings of 42%.

  If we used separate AES and AEZ keys in each direction, we would be
  looking at 776 bytes, for a space increase of 10%.

A.2. Performance notes: CPU requirements on AESNI hosts

  The cell_ops benchmark in bench.c purports to tell us how long it
  takes to encrypt a tor cell.  But it wasn't really telling the truth,
  since it only did one SHA1 operation every 2^16 cells, when
  entries and exits really do one SHA1 operation every end-to-end cell.

  I expanded it to consider the slow (SHA1) case as well.  I ran this on
  my friendly OSX laptop (2.9 GHz Intel Core i5) with AESNI support:

                  Inbound cells: 169.72 ns per cell.
                 Outbound cells: 175.74 ns per cell.
       Inbound cells, slow case: 934.42 ns per cell.
      Outbound cells, slow case: 928.23 ns per cell.


  Note that So for an n-hop circuit, each cell does the slow case and
  (n-1) fast cases at the entry; the slow case at the exit, and the fast
  case at each middle relay.

  So for 3 hop circuits, the total current load on the network is
  roughly 425 ns per hop, concentrated at the exit.

  Then I started messing around with AEZ benchmarks, using the
  aesni-optimized version of AEZ on the AEZ website.  (Further
  optimizations are probably possible.)  For the AES256, I
  used the usual aesni-based aes construction.

  Assuming xor is free in comparison to other operations, and
  CELL_CHAIN_BYTES=32, I get roughly 270 ns per cell for the entire
  operation.

  If we were to pick CELL_CHAIN_BYTES=509, we'd be looking at around 303
  ns per cell.

  If we were to pick CELL_CHAIN_BYTES=509 and replace XOR with CONCAT,
  it would be around 355 ns per cell.

  If we were to keep CELL_CHAIN_BYTES=32, and remove the
  AES256-chaining, I see values around 260 ns per cell.

  (This is all very spotty measurements, with some xors left off, and
  not much effort done at optimization beyond what the default
  optimized AEZ does today.)

A.3. Performance notes: what if we don't have AESNI?

  Here I'll test on a host with sse2 and ssse3, but no aesni instructions.
  From Tor's benchmarks I see:

               Inbound cells: 1218.96 ns per cell.
              Outbound cells: 1230.12 ns per cell.
    Inbound cells, slow case: 2099.97 ns per cell.
   Outbound cells, slow case: 2107.45 ns per cell.

  For a 3-hop circuit, that gives on average 1520 ns per cell.

  [XXXX Do a real benchmark with a fast AEZ backend. First, write one.
   Preliminary results are a bit disappointing, though, so I'll
   need to invetigate alternatives as well.]