15:58:34 <meskio> #startmeeting tor anti-censorship meeting
15:58:34 <MeetBot> Meeting started Thu Sep  8 15:58:34 2022 UTC.  The chair is meskio. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:58:34 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:58:39 <meskio> Hello everybody!!
15:58:44 <cohosh> hi
15:58:52 <meskio> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep
15:59:01 <meskio> feel free to add what you've been working on and put items on the agenda
15:59:53 <shelikhoo> hi~
16:00:00 <meskio> First of all an announcement: there will not be a meeting next week, some of us will be a the tormeeting talking in person :)
16:00:09 <meskio> next meeting will be Sept 22
16:01:27 <meskio> shelikhoo: I kept the point about webtunnel, is there anything to talk about that? or should we move on?
16:02:17 <shelikhoo> the next step need to be discussed is about how to get rdsys to distribute connection info
16:02:33 <shelikhoo> I would like to discuss about this part
16:02:40 <meskio> +1
16:04:49 <meskio> can we just use goptlib to add the needed params to the bridge descriptors so rdsys discovers them?
16:05:12 <anadahz> (hi)
16:06:09 <shelikhoo> yes, and is this info send to rdsys only?
16:07:12 <meskio> AFAIK it gets sent to the bridge authority and the bridge authority sents it to rdsys, metrics.tpo will clear it up befure publishing it
16:08:44 <shelikhoo> yes, just to confirm this info is not going to be something one get can from tor in any other way...
16:09:59 <meskio> I think you need to use SmethoArgs from goptlib, but you can test it by creating a bridge and looking into what gets published in metrics.tpo and what you have in polyanthum
16:10:35 <shelikhoo> yes, or just try to get info from existing obfs4 bridge...
16:10:49 <meskio> +1
16:12:00 <meskio> do we have a plan here?
16:12:41 <shelikhoo> The issue is that tor is quite complex, and i fear I could miss something
16:13:06 <cohosh> you're worried about the parameters getting leaked?
16:13:34 <shelikhoo> yes, I am unsure who will be able to receive these parameters
16:13:57 <shelikhoo> since I am not familiar with the design of other part of tor
16:13:58 <dcf1> use SmethodArgs, that's what obfs4proxy does to distribute its secret credentials
16:14:03 <cohosh> and the consequence of them getting leaked is the bridge could be discovered and blocked?
16:14:09 <shelikhoo> yes
16:14:18 <cohosh> okay fwiw obfs4 has the same threat model
16:14:25 <shelikhoo> it will contain the domain name
16:14:32 <shelikhoo> which can be blocked
16:14:41 <shelikhoo> okay, that is reassuring
16:15:17 <shelikhoo> I think in this case, the client will pass these info via cmd, which will be then be send to tor
16:15:21 <dcf1> metrics removes such extra parameters before publishing:
16:15:23 <dcf1> https://metrics.torproject.org/bridge-descriptors.html#transport
16:15:53 <dcf1> shelikhoo: using SOCKS args (i.e. key=val in a bridge line) is generally better and more flexible than cmd line arguments
16:16:18 <dcf1> oh sorry, you're talking about server side, never mind
16:16:26 <shelikhoo> dcf1: these are for sever... yes
16:16:43 <dcf1> You can use ServerTransportOptions in torrc then
16:17:16 <dcf1> Then get them from Bindaddr.Args
16:17:32 <dcf1> TOR_PT_SERVER_TRANSPORT_OPTIONS is how it works internally if you want to read about it
16:17:45 <shelikhoo> yes! I think this is better than cmd
16:17:58 <shelikhoo> let's do it this way
16:18:01 <meskio> that is nice :)
16:18:31 <dcf1> One limitation is that you cannot run multiple instances of the same transport with different options... but that's a limitation of torrc syntax
16:19:27 <shelikhoo> this should be fine for us... I think most user will only run one webtunnel managed by one tor
16:19:50 <meskio> you can always have several tor process if you really have that usecase...
16:19:59 <meskio> with different torrc files
16:20:30 <dcf1> meskio: you have to run separate instances in practice anyway, for accurate metrics
16:20:40 <meskio> yep
16:21:37 <meskio> should we move on to the next topic?
16:22:06 <shelikhoo> nothing more from me on this topic
16:22:16 <meskio> "Proposal for outreachy"
16:22:41 <meskio> I have submited a proposal to do some work around other distributors for gettor at Outreachy
16:22:43 <meskio> https://gitlab.torproject.org/tpo/team/-/issues/67#note_2834285
16:23:06 <meskio> outreachy is an intership program focused on underrepresented people
16:23:25 <meskio> so we might have an intern dec-febr
16:23:47 <meskio> I will mentor them, but I'll be happy to get any help there
16:23:59 <cohosh> nice :)
16:24:09 <meskio> or if someone wants to co-mentor this is possible officialy and I'll be happy to share the load
16:24:49 <meskio> we might see in the comming months contributions arriving to rdsys from candidates, I'll need to create more newcomers tickets for that...
16:26:02 <meskio> I think we can move to the next topic
16:26:24 <shelikhoo> yes
16:26:30 <meskio> "A new format for placeholder addresses in PT bridge lines"
16:26:37 <dcf1> okay, quick note about placeholder addresses
16:27:12 <dcf1> we have been using placeholder addresses with incrementing port numbers :1, :2, :3, etc., because tor requires all PT bridges to have different IP:port, or it gets them confused
16:27:56 <dcf1> this causes a problem when tor is configured with ReachableAddresses or FascistFirewall, because tor thinks the PT is going to try to actually make a TCP connection to those placeholder addresses
16:28:22 <dcf1> and it says "port 1 is not one of the ports permitted by FascistFirewall, therefore I will not attempt this bridge connection"
16:28:35 <dcf1> imo it's a bug in core tor, but it's WONTFIX for years now
16:29:15 <meskio> it might not get fixed until arti
16:29:23 <dcf1> so the proposal is to move this counter into the IP address and always use port 80 for placeholder addresses, to make them more likely to work with ReachableAddresses and FascistFirewall
16:30:49 <dcf1> Way back in flash proxy, which also needed a placeholder address, I tried to make the placeholder look as different from an actual usable IP address as possible, to reduce confusion
16:31:20 <dcf1> since then, the placeholders have been slowly morphing to look more and more like real IP addresses :)
16:31:38 <dcf1> the progression was something like:
16:31:57 <dcf1> 0.0.0.0:0 <- no good, tor uses the all-zero address as a sentinel internally
16:32:09 <dcf1> 0.0.0.1:1 <- no good, 0.0.0.X is used by SOCKS
16:32:35 <dcf1> 0.0.1.0:1 <- we used this for a while, but it ran into problems with 0/8 being considered "internal" by tor
16:32:59 <dcf1> 192.0.2.1:1 <- what we use now, using a special non-routable IP range reserved for documentation
16:33:13 <dcf1> 192.0.2.1:80 <- new proposal
16:34:20 <dcf1> there may not be much to discuss, if it seems okay I'm planning to open a merge request in tor-browser-build
16:34:54 <shelikhoo> is it possible for us to use a IPv6 address in this role
16:35:10 <shelikhoo> in this way, we could just randomly generate an address
16:35:39 <dcf1> IPv6 is an interesting idea, I hadn't thought of that. It must be possible, because there is an IPv6 default obfs4 bridge
16:35:40 <shelikhoo> without the need to make sure different bridges have different "address"
16:36:24 <dcf1> I am not sure randomly generated is a good idea, though. One reason for using "artifical looking" placeholder addresses is to reduce the risk in case tor tries to do anything with the address, for example connect to it
16:37:05 <shelikhoo> there are some reserved IP block in IPv6 we could use as a prefix
16:37:12 <dcf1> tor did, in fact, have a bug, where if you configured a "Bridge snowflake" line, but did not configure a "ClientTransportPlugin snowflake" line, it would initiate a direct TCP connection to the placeholder IP address!
16:37:55 <dcf1> I'm not sure if that bug has been fixed, but in that case, the decision to use a non-routable address like 0.0.3.0 was a good one, it prevented tor from making random outgoing TCP connections in the case of a misconfiguration
16:38:30 <shelikhoo> Yes, there should be similar address in IPv6 as well
16:38:35 <dcf1> shelikhoo: but yes, you are right, IPv6 could give some more flexibility
16:38:39 <shelikhoo> yes
16:39:19 <dcf1> okay well I will propose a merge request with the 192.0.2.(16(n−1)+t):80 format, it's good enough for our needs for now
16:39:43 <shelikhoo> yes...
16:39:55 <meskio> yes, I think is fine for now, but we should keep an eye because there is not a huge space for snowflake bridges, and maybe one day we want to revisit it
16:39:58 <dcf1> an IPv6 placeholder would need more testing, for example I would be wary of tor internally doing something like, "no IPv6 interfaces detected, therefore I will not use this bridge that has an IPv6 address"
16:40:46 <meskio> (16 bridges, should be fine for a while)
16:41:01 <shelikhoo> yes... I hope in arti, anti-censorship part can be take into consideration in the initial design
16:41:08 <dcf1> I did notice in a ticket about PT support in arti there is a line about "transports that don't use bridge addresses", so maybe this is something that will be less of a problem in the future
16:41:44 <meskio> AFAIK the arti team knows that this kind of bridges exist
16:42:25 <meskio> anything else on this topic?
16:42:34 <dcf1> that's all from me
16:42:37 <shelikhoo> nothing from me
16:42:49 <meskio> any other topics before the reading group?
16:43:18 <dcf1> https://gitlab.torproject.org/tpo/core/arti/-/blob/main/doc/BridgeIssues.md is the doc I was thinking of
16:44:17 <dcf1> "Problem 6: Existing bridge-line format"
16:44:32 <dcf1> "Make addresses optional"
16:44:56 <meskio> nice :)
16:45:11 <dcf1> Okay I'll kick off the reading group
16:45:16 <meskio> thanks
16:45:30 <dcf1> Our paper is "An Empirical Analysis of Plugin-Based Tor Traffic over SSH Tunnel" https://ieeexplore.ieee.org/document/9020938
16:46:47 <dcf1> It is a bit difficult to tease out exactly what this paper is about and what experiments were done, it's sort of scattered
16:47:31 <dcf1> Briefly, it's about looking at various properties of Tor pluggable transports (FTE, meek, obfs3, ScrambleSuit, obfs4), when those transports are configured to use an SSH upstream proxy before reaching their bridge
16:47:40 <dcf1> The topology is like:
16:48:12 <dcf1> client ---> SSH proxy ---> Tor guard ---> Tor middle ---> Tor exit ---> dest
16:48:22 <dcf1> They do 3 different experiments:
16:48:53 <dcf1> 1. distinguish different PTs (inside SSH) from each other, and from "normal" SSH
16:49:21 <dcf1> 2. distinguish different application protocols inside SSH-tunneled obfs4 (for this experiment they use only obfs4)
16:50:24 <dcf1> 3. do ML-based traffic correlation on the PT traffic -- for this experiment only, they do not look at the SSH link, but rather the links between the SSH proxy and Tor guard, and between the Tor exit and dest
16:51:01 <dcf1> as far as I could tell, experiment 3 does not use the SSH part of the topology at all, so I don't see why it matters, though I suppose having passed through an SSH tunnel could have some effect on traffic features
16:51:51 <dcf1> they do show a graph (Fig. 2) that shows SSH-tunneled obfs4 begin differently shaped from plain obfs4 (though this graph is suspect for reasons I will get into)
16:52:44 <dcf1> BTW see Table I (page 4) for a breakdown of the experiments. First 6 rows are experiment #1 as I have called it, next 5 rows are experiment #2, last 3 rows are experiment #3.
16:53:24 <dcf1> The most interesting thing to me, I think, is the feature selection in section III-B
16:54:00 <dcf1> they borrowed features from the paper "Deciphering malware's use of TLS (without decryption)", which is a pretty well-known and foundational work on encrypted traffic analysis (it was a precursor to a product called ETA by cisco)
16:54:23 <dcf1> I gave a rump session talk on this line of research at PETS 2017: https://www.bamsoftware.com/talks/pets-2017-menace/index.html
16:54:33 * eta by cisco
16:55:31 <dcf1> There is a probability distribution over all byte values (256 features), plus binned packet sizes and interarrival times, 456 features in total
16:56:04 <dcf1> They also use the cisco tool Joy to separate pcaps into upstream and downstream flows: https://github.com/cisco/joy
16:57:24 <dcf1> For experiment #3, the correlation experiment, they don't use the 256 byte distribution features, because there they are looking at obfs4, not SSH, and the uniform distribution doesn't provide any useful information
16:58:16 <dcf1> They run all these features through some Scikit-learn classifiers, and get high accuracy, as usual with papers of this kind
16:58:41 <dcf1> (though what they call low false positive rates are pretty high: 1-3%!)
16:59:12 <cohosh> (especially considering a low base rate)
16:59:29 <dcf1> I have some critical comments about the paper, but I will let anyone else chime in
16:59:41 <meskio> I find weird they pick SSH proxying, is proxying tor over ssh actually any common?
17:00:07 <dcf1> yes that stood out to me too
17:00:37 <dcf1> the introduction (and throughout) shows a fairly poor understanding of pluggable transports and why people might use them
17:00:46 <meskio> as you say I was not expecting that using ssh or not make any difference to categorize tor traffic
17:00:56 <cohosh> every once in a while we get users asking us how to do something like this and we try to dissuade them from it
17:00:56 <dcf1> ""Considering it is not safe to totally trust one single node, more users rely on fronting proxies to forward their traffic to the entry node of Tor."
17:01:00 <shelikhoo> I think there is one issue that is common to both obfs4 and VMess that it does not hide the original traffic's general traffic shape
17:01:12 <meskio> they might be searching for an excuse to write something "new" on the field...
17:01:25 <dcf1> Well, this is totally a supported configuration in Tor Browser. It's what is supposed to happen when you set a proxy in the connection settings dialog.
17:01:50 <dcf1> tor sets the environment variable TOR_PT_PROXY, and the PT client is supposed to understand that (send back PROXY OK) and act on it.
17:01:55 <shelikhoo> like downloading a 4 mb file from HTTP 1.0 server will always result in a connection in the same shape
17:02:15 <dcf1> I find the justification weak, but there is some intuition for it in section II-C:
17:02:52 <shelikhoo> that after proxy handshake, the client send a small request, then the server send a 4 MB chunk back
17:03:04 <dcf1> by looking only at the SSH part of the link (except, confusingly, in experiment #3, but ignore that), they remove some features that could be potentially useful, like the TLS ciphersuites of meek, or the entropy profile of FTE
17:03:42 <dcf1> so in a sense, they are making the challenge *harder* for themselves, by forcing themselves to only look at flow features like size and timing (and byte distribution, but ignore that)
17:04:14 <dcf1> shelikhoo: yes, and they kind of admit as much somewhere, let me find it...
17:04:29 <dcf1> "the effect of application identification is better than plugin identification"
17:05:17 <dcf1> which is basically saying that the application features overshadow the PT features
17:05:29 <dcf1> (also disappointing that they don't mention different iat-mode for obfs4)
17:05:57 <dcf1> I can kind of imagine that an SSH tunnel could affect packet sizes, in this sense:
17:06:20 <dcf1> obfs4 is going to try to send packets that are MTU-sized, when possible (at least in iat-mode=0)
17:06:55 <dcf1> the overhead added by the SSH proxy will cause those packets to be re-segmented when they are sent back out
17:07:18 <dcf1> I was thinking of this when looking at Fig. 2, which made me think, why are there so many data points > 1500 in Fig. 2?
17:07:32 <cohosh> going back to what you said about table I, that means they only did open world experiments for the 3rd experiment with the "campus net" background traffic, but all other experiments were closed world and only classified traffic that was tor traffic?
17:07:36 <dcf1> I find it suspicious, does anyone have a good explanation?
17:08:17 <dcf1> They even mention an MTU of 1500 bytes in III-B; what conditions were they running in that they could measure *average* packet sizes greater than 1500?
17:08:42 <dcf1> That kind of thing can happen on loopback localhost connections, but it's also possible there's an error in their experiment.
17:09:23 <cohosh> o.O that is weird
17:09:36 <anadahz> Does actually iat-mode > 0 makes any difference on the detection of obfs4?
17:09:56 <dcf1> cohosh: that, I'm not sure about. They don't really say anything about what their traffic mix was, how they define "normal" SSH, they don't report classification rates for "normal" SSH.
17:10:32 <dcf1> anadahz: iat-mode=1 and iat-mode=2 do affect the packet size distibution a lot, it causes obfs4 to send more sub-MTU packets than normal, at least
17:11:07 <dcf1> anadahz: see https://www.bamsoftware.com/talks/pets-2017-menace/index.html "this is your Tor on obfs4 with timing obfuscation" and "this is your Tor on obfs4 with aggressive timing obfuscation"
17:12:01 <dcf1> obfs4's timing obfuscation clearly leaves a lot of features (like the big gap after the handshake), but I'll bet a classifier trained on one would not work on another.
17:12:25 <anadahz> thx dcf1
17:12:42 <dcf1> I guess what I'm getting at is unfortunately there's not a lot for us to take from this paper
17:13:07 <dcf1> other than a general notion that there are researchers doing this kind of thing and this is their level of awareness
17:13:46 <dcf1> the experiments are not well defined, the evaluation is unconvincing, one gets the feeling it would fall apart if not done by the authors themselves
17:14:37 <cohosh> agreed
17:14:49 <meskio> yes, wich is good news for us :)
17:14:59 <dcf1> one interesting aspect is that the paper is written from a fairly sympathetic point of view towards tor and pluggable transports
17:15:01 <shelikhoo> add this with other patents done against V2Ray, I think we need to consider hiding application connection shape in the future when designing anti-censorship protocols
17:15:18 <dcf1> it is written more like a defense paper than an attack paper, even though it is about adversarial detection
17:15:41 <cohosh> even though they didn't go into detail about this "campus net" background traffic, it sounds somewhat similar to what large university based research groups are trying to do to analyze this type of traffic analysis
17:15:57 <cohosh> CU Boulder has done this, for example
17:15:58 <meskio> but in a defense paper I would expect some recomendations on changes to do at PTs to fix them...
17:16:19 <dcf1> yes, just because there are 1000s of weak ML classification papers, doesn't mean there are not some good one
17:16:51 <dcf1> Also I appreciated the paragraph about safety and privacy in section IV-A, which I think is actually on the mark.
17:16:54 <shelikhoo> there is already deployed ssh -D connection blocking system that allow shell and sftp, but block socket forwarding over ssh tunnel
17:17:13 <dcf1> "Our research does not raise any privacy issues during data collection. All traffic captured in the experiments is generated by ourselves, and the self-built Obfs4 bridge is set to be unpublished in the configuration file called torrc."
17:17:25 <dcf1> "In addition, considering the shortage of available Tor bridges, we only request to the Tor project one time for each kind of plugins."
17:17:50 <dcf1> shelikhoo: that's the other big cloud hanging over the communication model in this paper
17:18:26 <dcf1> besides "how common is it to use an SSH proxy with tor in this way?", it's "well, now SSH *it* your pluggable transport"
17:18:49 <dcf1> "why are you trying to distinguish different PTs inside the SSH tunnel, why do you care?"
17:20:12 <meskio> fair questions
17:20:24 <dcf1> that's all I have about this paper, anything else?
17:21:01 <meskio> not from my side
17:21:11 <shelikhoo> It would be interesting to see if there is anything to detect the type of application over kcp
17:21:46 <dcf1> I hadn't read this one in advance before suggesting it, but I can probably find more papers of this kind that may have more to teach us
17:22:17 <meskio> yes, we can give it a try to others, I hope we can find papers that are a bit less thin
17:22:23 <shelikhoo> right now most stream in stream proxy don't hide traffic shape and reveals inner application type
17:23:01 <dcf1> shelikhoo: yes, that's true. A few years ago there was not really evidence that this kind of detection was happening, now there is starting to be a little evidence.
17:23:05 <anadahz> Though in IV-A "We send emails to bridges@torproject.org to acquire available Tor bridges when collecting the traffic of different plugins"
17:23:08 <dcf1> still not a lot, but it's better to be ahead of these things.
17:23:16 <shelikhoo> KCP or mux maybe able to make more difficult to get what is inside
17:23:21 <anadahz> So they did use other bridges instead of their own.
17:24:11 <dcf1> anadahz: it's not totally clear, but I interpret it to mean they used their own obfs4 bridge (maybe only for experiment #3); for the other transports they used BridgeDB bridges, but only 1 of each transport
17:24:14 <shelikhoo> I think we should invest in creating a proxy protocol that don't reveal it is a proxy, and don't reveal the tunnel application type
17:24:31 <shelikhoo> I think we should invest in creating a proxy protocol that don't reveal it is a proxy, and don't reveal the tunneled application type
17:24:32 <dcf1> to me there's no ethics or privacy problem with doing that
17:24:55 <shelikhoo> that's all from me
17:25:07 <anadahz> dcf1: yes, it makes sense
17:25:10 <dcf1> shelikhoo: have you seen https://people.torproject.org/~dcf/obfs4-timing/ ?
17:25:37 <dcf1> it shows how to get almost arbitrary shaping by using a "pull" data model internally, rather than a "push" model
17:25:41 <dcf1> https://lists.torproject.org/pipermail/tor-dev/2017-June/012310.html
17:26:14 <dcf1> For timing+size obfuscation, I think it's the right way to structure an application internally
17:26:37 <dcf1> similar shaping is possible with ShadowSocks AEAD, using zero-byte payload packets for padding when needed
17:26:38 <meskio> the problem with those mechanisms that they increase the latency, isn't it?
17:26:50 <meskio> in places like china that might mean to become even less usable
17:26:54 <dcf1> meskio: not necessarily, it depends on the shaping model
17:27:14 <dcf1> if your traffic model has large gaps without sending, then yes, increased latency is unavoidable
17:27:32 <shelikhoo> I have not read it yet... will read it after the meeting
17:27:35 * cohosh gotta go
17:27:47 <dcf1> if you traffic model is constant bitrate 1 Mbps, then you are sending mostly padding, and filling in with actual data when available, in that case it doesn't slow down the useful payload
17:27:52 <cohosh> thanks for the paper suggestion, dcf1!
17:28:02 <dcf1> bye cohosh
17:28:10 <meskio> ciao cohosh
17:28:15 <shelikhoo> bye cohosh~
17:28:34 <meskio> yes, but then you might care about mobile connections where you pay per MB
17:28:48 <meskio> is a hard ballance, but very interesting to investigate
17:29:03 <shelikhoo> there is a lot of issue when it comes to padding and traffic shaping
17:29:07 <dcf1> meskio: yes, I am not saying that constant bit rate is a good model for circumvention either, I'm saying everything depends on what model you choose
17:29:18 <shelikhoo> but it should be investigated
17:29:24 <dcf1> *but* that it's possible to conform to any given model, once you have chosen a model
17:29:25 <anadahz> bye cohosh
17:29:34 <meskio> :)
17:29:57 <meskio> should we wrap it up?
17:30:01 <dcf1> yes
17:30:12 <meskio> #endmeeting