Filename: 302-padding-machines-for-onion-clients.txt
Title: Hiding onion service clients using padding
Author: George Kadianakis, Mike Perry
Created: Thursday 16 May 2019
Status: Closed
Implemented-In: 0.4.1.1-alpha
NOTE: Please look at section 3 of padding-spec.txt now, not this document.
0. Overview
Tor clients use "circuits" to do anonymous communications. There are various
types of circuits. Some of them are for navigating the normal Internet,
others are for fetching Tor directory information, others are for connecting
to onion services, while others are simply for measurements and testing.
It's currently possible for MITM type of adversaries (like tor-network-level
and local-area-network adversaries) to distinguish Tor circuit types from
each other using a wide array of metadata and distinguishers.
In this proposal, we study various techniques that can be used to
distinguish client-side onion service circuits and provide WTF-PAD circuit
padding machines (using prop#254) to hide them against certain adversaries.
1. Motivation
We are writing this proposal for various reasons:
1) We believe that in an ideal setting MITM adversaries should not be able
to distinguish circuit types by inspecting traffic. Tor traffic should
look amorphous to an outside observer to maximize uncertainty and
anonymity properties.
Client-side onion service circuits are an easy target for this proposal,
because we believe we can improve their privacy with low bandwidth
overhead.
2) We want to start experimenting with the WTF-PAD subsystem of Tor, and
this use-case provides us with a good testbed.
3) We hope that by actually starting to use the WTF-PAD subsystem of Tor, we
will encourage more researchers to start experimenting with it.
2. Scope of the proposal [SCOPE]
Given the above, this proposal sets forth to use the WTF-PAD system to hide
client-side onion service circuits against the classifiers of paper by Kwon
et al. above.
By client-side onion service circuits we refer to these two types of circuits:
- Client-side introduction circuits: Circuit from client to the introduction point
- Client-side rendezvous circuits: Circuit from client to the rendezvous point
Service-side onion service circuits are not in scope for this proposal, and
this is because hiding those would require more bandwidth and also more
advanced WTF-PAD features.
Furthermore, this proposal only aims to cloak the naive distinguishing
features mentioned in the [KNOWN_DISTINGUISHERS] section, and can by no
means guarantee that client-side onion service circuits are totally
indistinguishable by other means.
The machines specified in this proposal are meant to be lightweight and
created for a specific purpose. This means that they can be easily extended
with additional states to do more advanced hiding.
3. Known distinguishers against onion service circuits [KNOWN_DISTINGUISHERS]
Over the past years it's been assumed that motivated adversaries can
distinguish onion-service traffic from normal Tor traffic given their
special characteristics.
As far as we know, there has been relatively little research-level work done
to this direction. The main article published in this area is the USENIX
paper "Circuit Fingerprinting Attacks: Passive Deanonymization of Tor Hidden
Services" by Kwon et al. [0]
The above paper deals with onion service circuits in sections 3.2 and 5.1.
It uses the following three "naive" circuit features to distinguish circuits:
1) Circuit construction sequence
2) Number of incoming and outgoing cells
3) Duration of Activity ("DoA")
All onion service circuits have particularly loud signatures to the above
characteristics, but WTF-PAD (prop#254) gives us tools to effectively
silence those signatures to the point where the paper's classifiers won't
work.
4. Hiding circuit features using WTF-PAD
According to section [KNOWN_DISTINGUISHERS] there are three circuit features
we are attempting to hide. Here is how we plan to do this using the WTF-PAD
system:
1) Circuit construction sequence
The USENIX paper uses the directions of the first 10 cells sent in a
circuit to fingerprint them. Client-side onion service circuits have
unique circuit construction sequences and hence they can be fingeprinted
using just the first 10 cells.
We use WTF-PAD to destroy this feature of onion service circuits by
carefully sending padding cells (relay DROP cells) during circuit
construction and making them look exactly like most general tor circuits
up till the end of the circuit construction sequence.
2) Number of incoming and outgoing cells
The USENIX paper uses the amount of incoming and outgoing cells to
distinguish circuit types. For example, client-side introduction circuits
have the same amount of incoming and outgoing cells, whereas client-side
rendezvous circuits have more incoming than outgoing cells.
We use WTF-PAD to destroy this feature by changing the number of cells
sent in introduction circuits. We leave rendezvous circuits as is, since
the actual rendezvous traffic flow usually resembles well normal Tor
circuits.
3) Duration of Activity ("DoA")
The USENIX paper uses the period of time during which circuits send and
receive cells to distinguish circuit types. For example, client-side
introduction circuits are really short lived, wheras service-side
introduction circuits are very long lived. OTOH, rendezvous circuits have
the same median lifetime as general Tor circuits which is 10 minutes.
We use WTF-PAD to destroy this feature of client-side introduction
circuits by setting a special WTF-PAD option, which keeps the circuits
open for 10 minutes completely mimicking the DoA of general Tor circuits.
4.1. A dive into general circuit construction sequences [CIRCCONSTRUCTION]
In this section we give an overview of how circuit construction looks like
to a network or guard-level adversary. We use this knowledge to make the
right padding machines that can make intro and rend circuits look like these
general circuits.
In particular, most general Tor circuits used to surf the web or download
directory information, start with the following 6-cell relay cell sequence (cells
surrounded in [brackets] are outgoing, the others are incoming):
[EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [BEGIN] -> CONNECTED
When this is done, the client has established a 3-hop circuit and also
opened a stream to the other end. Usually after this comes a series of DATA
cell that either fetches pages, establishes an SSL connection or fetches
directory information:
[DATA] -> [DATA] -> DATA -> DATA
The above stream of 10 relay cells defines the grand majority of general
circuits that come out of Tor browser during our testing, and it's what we
are gonna use to make introduction and rednezvous circuits blend in.
Please note that in this section we only investigate relay cells and not
connection-level cells like CREATE/CREATED or AUTHENTICATE/etc. that are
used during the link-layer handshake. The rationale is that connection-level
cells depend on the type of guard used and are not an effective fingerprint
for a network/guard-level adversary.
5. WTF-PAD machines
For the purposes of this proposal we will make use of four WTF-PAD machines
as follows:
- Client-side introduction circuit hiding machine (origin-side)
- Client-side introduction circuit hiding machine (relay-side)
- Client-side rendezvous circuit hiding machine (origin-side)
- Client-side rendezvous circuit hiding machine (relay-side)
In the following sections we will analyze these machines.
5.1. Client-side introduction circuit hiding machines [INTRO_CIRC_HIDING]
These two machines are meant to hide client-side introduction circuits. The
origin-side machine sits on the client and sends padding towards the
introduction circuit, whereas the relay-side machine sits on the middle-hop
(second hop of the circuit) and sends padding towards the client. The
padding from the origin-side machine terminates at the middle-hop and does
not get forwarded to the actual introduction point.
Both of these machines only get activated for introduction circuits, and
only after an INTRODUCE1 cell has been sent out.
This means that before the machine gets activated our cell flow looks like this:
[EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [INTRODUCE1]
Comparing the above with section [CIRCCONSTRUCTION], we see that the above
cell sequence matches the one from general circuits up to the first 7 cells.
However, in normal introduction circuits this is followed by an
INTRODUCE_ACK and then the circuit gets teared down, which does not match
the sequence from [CIRCCONSTRUCTION].
Hence when our machine is used, after sending an [INTRODUCE1] cell, we also
send a [PADDING_NEGOTIATE] cell, which gets answered by a PADDING_NEGOTIATED
cell and an INTRODUCE_ACKED cell. This makes us match the [CIRCCONSTRUCTION]
sequence up to the first 10 cells.
After that, we continue sending padding from the relay-side machine so as to
fake a directory download, or an SSL connection setup. We also want to
continue sending padding so that the connection stays up longer to destroy
the "Duration of Activity" fingerprint.
To calculate the padding overhead, we see that the origin-side machine just
sends a single [PADDING_NEGOATIATE] cell, wheras the origin-side machine
sends a PADDING_NEGOTIATED cell and between 7 to 10 DROP cells. This means
that the average overhead of this machine is 11 padding cells.
In terms of WTF-PAD terminology, these machines have three states (START,
OBF, END). They move from the START to OBF state when the first
non-padding cell is received on the circuit, and they stay in the OBF
state until all the padding gets depleted. The OBF state is controlled by
a histogram which specifies the parameters described in the paragraphs
above. After all the padding finishes, it moves to END state.
We also set a special WTF-PAD flag which keeps the circuit open even after
the introduction is performed. In particular, with this feature the circuit
will stay alive for the same durations as normal web circuits before they
expire (usually 10 minutes).
5.2. Client-side rendezvous circuit hiding machines
The rendezvous circuit machines apply on client-side rendezvous circuits and
only after the rendezvous point has been established (REND_ESTABLISHED has
been received). Up to that point, the following cell sequence has been
observed on the circuit:
[EXTEND2] -> EXTENDED2 -> [EXTEND2] -> EXTENDED2 -> [ESTABLISH_REND] -> REND_ESTABLISHED
which matches the general circuit construction sequence [CIRCCONSTRUCTION]
up to the first 6 cells. However after that, normal rendezvous circuits
receive a RENDEZVOUS2 cell followed by a [BEGIN] and a CONNECTED, which does
not fit the circuit construction sequence we are trying to imitate.
Hence our machine gets activated right after REND_ESTABLISHED is received,
and continues by sending a [PADDING_NEGOTIATE] and a [DROP] cell, before
receiving a PADDING_NEGOTIATED and a DROP cell, effectively blending into
the general circuit construction sequence on the first 10 cells.
After that our machine gets deactivated, and we let the actual rendezvous
circuit shape the traffic flow. Since rendezvous circuits usually immitate
general circuits (their purpose is to surf the web), we can expect that they
will look alike.
In terms of overhead, this machine is quite light. Both sides send 2 padding
cells, for a total of 4 padding cells.
6. Overhead analysis
Given the parameters above, intro circuit machines have an overhead of 11
padding cells, and rendezvous circuit machines have an overhead of 4
cpadding ells. . This means that for every intro and rendezvous circuit
there will be an overhead of 15 padding cells in average, which is about
7.5kb.
In the PrivCount paper [1] we learn that the Tor network sees about 12
million successful descriptor fetches per day. We can use this figure to
assume that the Tor network also sees about 12 million intro and rendezvous
circuits per day. Given the 7.5kb overhead of each of these circuits, we get
that our padding machines infer an additional 94GB overhead per day on the
network, which is about 3.9GB per hour.
XXX Isn't this kinda intense????? Using the graphs from metrics we see that
the Tor network has total capacity of 300 Gbit/s which is about 135000GB per
hour, so 3.9GB per hour is not that much, but still...
7. Discussion
7.1. Alternative approaches
These machines try to hide onion service client-side circuits by obfuscating
their looks. This is a reasonable approach, but if the resulting circuits
look unlike any other Tor circuits, they would still be fingerprintable just
by that fact.
Another approach we could take is make normal client circuits look like
onion service circuits, or just make normal clients establish fake onion
service circuits periodically. The hope here is that the adversary won't be
able to distinguish fake onion service circuits from real ones. This
approach has not been taken yet, mainly because it requires additional
WTF-PAD features and poses greater overhead risks.
7.2. Future work
As discussed in [SCOPE], this proposal only aims to hide some very specific
features of client-side onion service circuits. There is lots of work to be
done here to see what other features can be used to distinguish such
circuits, and also what other classifiers can be built using deep learning
and whatnot.
A. Acknowledgements
This research was supported by NSF grants CNS-1526306 and CNS-1619454.
---
[0]: https://www.usenix.org/node/190967
https://blog.torproject.org/technical-summary-usenix-fingerprinting-paper
[1]: "Understanding Tor Usage with Privacy-Preserving Measurement"
by Akshaya Mani, T Wilson-Brown, Rob Jansen, Aaron Johnson, and Micah Sherr
In Proceedings of the Internet Measurement Conference 2018 (IMC 2018).