15:58:24 <meskio> #startmeeting tor anti-censorship meeting
15:58:24 <MeetBot> Meeting started Thu Mar  9 15:58:24 2023 UTC.  The chair is meskio. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:58:24 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:58:29 <meskio> hello everybody!!
15:58:33 <meskio> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep
15:58:36 <meskio> feel free to add what you've been working on and put items on the agenda
15:58:37 <shelikhoo> Hi!
15:58:50 <onyinyang[m]> hihi!
15:58:56 <hackerncoder> Hello!
15:58:58 <itchyonion> hello!
15:59:23 <meskio> is there anything to discuss on the valdikss point from last week? can I remove it from the agenda?
16:00:11 <shelikhoo> okay, there is one thing I would like to let everyone know is that:
16:00:11 <shelikhoo> I have already bring up the "what is the status of snowflake-02 bridge inclusion in orbot?" issue at last S96 meeting, and received no reply
16:00:27 <shelikhoo> over
16:00:56 <meskio> I guess silence is a yes, I'll remove it
16:01:26 <meskio> shelikhoo: I think I have an email from someone from orbot asking me about it, but failed to reply it yet
16:01:30 * meskio goes to look
16:02:17 <meskio> I have an email from fabiola asking me about it, not sure what they need to do
16:02:27 <meskio> I'll reply her today
16:03:49 <meskio> it looks like she doesn't know what we are asking her, I'll tell her and check what is their status
16:05:20 <meskio> I'll inform when I hear something from them
16:05:27 <meskio> anything else on this topic?
16:05:43 <shelikhoow> Nothing from me...
16:06:03 <meskio> cool, lets move on
16:06:17 <meskio> creating repos at gitlab.tpo
16:06:44 <dcf1> Sorry to admit it, but figuring this out has been an obsatcle in my moving goptlib to gitlab.
16:06:47 <meskio> dcf1: I think you can go ahead an create the repos, we just need to agree that is the right place, but worst case scenario we move it around
16:07:02 <meskio> I can also create the repo if needed myself
16:07:12 <dcf1> Ok, I will give it a try and ask if there is trouble.
16:07:19 <meskio> :)
16:07:29 <meskio> just pass by #tor-anticensorship if needs help there
16:07:43 * meskio goes to check dcf1 permitions in anti-censorship project
16:08:24 <meskio> dcf1: you do have the same kind of permitions I have, so you should be able to do it
16:08:34 <dcf1> okay, thank you
16:08:38 <meskio> thank you for taking care of that migration
16:09:39 <meskio> anything else for discussion?
16:10:32 <meskio> I see we have the 'ampcache snowflake fallback' in actions, but I'm not sure anybody is actually working on this
16:11:17 <shelikhoow> Resynchronization with Upsteamed Remove HelloVerify countermeasure (https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40258#note_2883726)
16:11:27 <shelikhoow> there is one last-call item from me
16:12:04 <meskio> ahhpretty cool that this got accepted upstream
16:12:17 <dcf1> good work on that, yes
16:12:20 <shelikhoow> currently we are looking toward to resynchronize with upstreamed change of snowflake remove hello verify countermeasure
16:12:25 <shelikhoow> yes! thanks!
16:12:34 <shelikhoow> and that would means updating the dependency
16:12:35 <shelikhoow> and
16:12:57 <shelikhoo> stop support one of version of go toolchain we are testing in CI
16:13:04 <shelikhoo> do we wants to proceed with that
16:13:10 <shelikhoo> or wait a while
16:13:12 <shelikhoo> ?
16:13:25 <meskio> what go version is that?
16:13:49 <dcf1> is it the oldest version currently?
16:14:10 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/131#note_2869188
16:14:19 <shelikhoo> go1.15
16:14:37 <shelikhoo> "The only problem I'm having with this is that it no longer builds with go1.15 due to the x/crypto dependency update." from cohosh
16:15:03 <meskio> that is the version of go in debian stable, but 2.19 is comming with the next stable comming out in few months
16:15:09 <meskio> and 2.19 is in debian backports
16:15:16 <dcf1> https://go.dev/doc/devel/release#go1.15 released 2020-08-11
16:15:26 <meskio> (I tend to use debian stable as a reference of what is the oldest version people still use)
16:16:37 <meskio> s/2./1./
16:17:48 <meskio> do we know what is the minimum go version that works?
16:17:55 <shelikhoo> yes, but the thing is, do we wants to wait a while, or drop support to go1.15 now?
16:18:42 <dcf1> for me the biggest consideration is if we are causing pain for standalone proxy operators
16:20:20 <meskio> the standalone proxy will come with the next debian stable
16:20:40 <meskio> but for bullseye operators it might become harder to build
16:20:57 <meskio> not sure how many people does actually build it themselves instead of using the docker images
16:21:28 <dcf1> I have no clue
16:22:24 <meskio> we could provide binaries for the standalone proxy
16:22:29 <meskio> (statically compiled)
16:23:24 <meskio> I mean, we can move forward to deprecate go1.15 and if people complain start to produce binaries
16:24:32 <shelikhoo> I think what actually going to happen is that debian packager come to complain
16:24:45 <shelikhoo> or stop updating debian package
16:24:53 <shelikhoo> or something rely on debian
16:25:12 <shelikhoo> users will just build from source with most recent toolchain
16:25:15 <meskio> the maintainer of the debian package is me, I will not complain :)
16:25:20 <shelikhoo> hahaha
16:25:56 <shelikhoo> then there should be nothing prevent us from moving forward
16:26:11 <shelikhoo> and if something goes wrong we can fix it then
16:26:21 <meskio> +1
16:26:32 <shelikhoo> since I don't think there will be any irreversible damage by doing this
16:26:37 <shelikhoo> yes
16:26:53 <shelikhoo> okay nothing more from me about this topic
16:27:06 <meskio> cool
16:27:14 <meskio> anything else before we go to the reading group?
16:28:15 <meskio> for the reading group we have "Detecting Tor Bridge from Sampled Traffic in Backbone Networks"
16:28:32 <meskio> https://www.ndss-symposium.org/wp-content/uploads/madweb2021_23011_paper.pdf
16:28:43 <meskio> I didn't manage to read it this time :(
16:28:53 <meskio> but I'm still interested and will in the next days
16:29:03 <meskio> dcf1: I see you added a summary to the pad
16:29:12 <dcf1> Maybe I will copy it here for the record.
16:29:16 <dcf1> This paper is about detecting Tor-in-obfs4 when you only have a traffic sample; e.g., you only get to look at every 100th packet that passes through a router that handles both obfs4 and non-obfs4 flows. Traffic sampling means you cannot use features like "look at the first n packets of a flow" or "compare the timing of two consecutive packets". Instead, you can only look at aggregate statistical
16:29:22 <dcf1> features and have to be memory-efficient.
16:29:25 <dcf1> The system collects 12 statistics (Table III in the appendix) and stores them in a data structure called a nest count Bloom filter (NCBF), which essentially is just a composition of 12 counting Bloom filters (https://en.wikipedia.org/wiki/Counting_Bloom_filter). The statistics are things like "number of non-empty upstream packets" (C₂) and "number of downstream packets with payload length between 62
16:29:31 <dcf1> and 465" (C₁₁). From these 12 statistics, they derive 14 features (mostly ratios of statistics) and feed them to a random forest classifier.
16:29:34 <dcf1> For evaluation they use a 15-minute sample of backbone traffic provided by a third party, MAWI (https://mawi.wide.ad.jp/mawi/ditl/ditl2019-G/201904090000.html) and insert their own self-collected obfs4 traffic into it. They say the detection has few false negatives (finds almost all obfs4 bridges), but too many false positives to be usable directly for blocking decisions; they mention the need for
16:29:40 <dcf1> "secondary testing" of suspected bridges.
16:30:01 <dcf1> I think it is a good idea to start reading this paper from the appendix, where they show the features. I don't know why they hid them there.
16:30:56 <dcf1> When I first saw this paper's title I was excited; then I skimmed it once and was less excited; then when I read it in detail I found that I was actually interested again.
16:31:06 <onyinyang[m]> I was wondering the same thing: re: hiding the features
16:31:53 <meskio> :)
16:31:57 <dcf1> Traffic sampling is a pretty interesting constraint; a lot of the traditional website fingerprinting features are not available. And it's something that is likely to be more practical.
16:32:34 <dcf1> (Though in the video Q&A, the presenter says that this paper is not ready for production and seemed self-conscious about the quality of the code.)
16:32:51 <shelikhoo> my first comment is that, they really cared a lot efficiency, not just accuracy. It is like it is aimed to be actually deployed instead just for research
16:33:17 <dcf1> One thing that had me confused for a long time is that this paper talks a lot about eigenvalues and eigenvectors, but I think I have that figured out. It is just a mistranslation.
16:33:44 <shelikhoo> I have seen a lot of other more accurate research that requires deep learning that achieve better result at lab environment
16:33:50 <dcf1> A Chinese speaker can check me, but I believe 特征向量 can mean both eigenvector (linear algebra) and feature vector (machine learning). They have just used the wrong one.
16:35:02 <onyinyang[m]> shelikhoo: yes, that was my read to, focused on being practical for deployment
16:35:02 <shelikhoo> dcf1: yes, the same chinese word are used to represent two meanings
16:35:12 <onyinyang[m]> I was wondering what differences were introduced between the Tor traffic they sampled (July 2020) and now/when the paper was published? They mentioned that packed padding that was added may have obscured their results, but I don't know how much the browser update (or PT update?) would contribute to this
16:35:24 <onyinyang[m]> *packet padding
16:35:50 <onyinyang[m]> *and may obscure future results
16:36:20 <dcf1> onyinyang[m]: I'm not sure myself what might have changed with respect to the statistics they mention.
16:36:38 <itchyonion> dcf you are correct on the Chinese word
16:37:00 <dcf1> "We collected traffic mainly around July 2020, and once the obfuscation protocol changed within six months, we needed to re-screen the features for our method."
16:37:46 <dcf1> It sounds like they had some difficulty with it and the features are fairly fragile.
16:37:58 <onyinyang[m]> Yeah
16:38:05 <itchyonion> "Once the obfuscation protocol has updated for obfuscating packet lengths, such as adding padding in the information control part, shielding ultra-small data packets, our method needs to be updated accordingly."
16:38:12 <shelikhoo> in addition to that, I was thinking the about feature they detected, are they detecting tor in a tunnel or just obfs4
16:38:26 <itchyonion> although that might be a suggestion, not what has happened
16:38:46 <dcf1> shelikhoo: correct, I think it depends highly on it being Tor inside the tunnel.
16:38:51 <meskio> has obfs4 actually being changed in 2020?
16:39:02 <dcf1> Since they are not using entropy or any other features that might be characteristic of obfs4 itself.
16:39:41 <meskio> ahh, so the changes are not in obfs4, but in Tor
16:39:48 <dcf1> meskio: I would guess that rather something in Tor or Tor Browser changed. There was an anti-website fingerprinting mitigation (randomized pipelining) implemented at some point in the browser, not sure when that was.
16:40:00 <meskio> makes sense
16:40:19 <shelikhoo> dcf1: yes, and deepcorr will basically force tor to change how padding or something like that, I think this research itself isn't that worrying on its own
16:40:46 <shelikhoo> the issue is more about protocol design that we need to pad or otherwise hide traffic shape information
16:41:17 <dcf1> Interesting to me was the size thresholds they use for partitioning packet sizes: 0, 1–61, 62–465, 466–1050, 1050–∞
16:41:22 <shelikhoo> because attacker is not only looking at the content of traffic, but also the shape of it
16:41:44 <dcf1> I didn't check to see if those thresholds come from other references [6] [7] [9] they cite in the appendix.
16:42:13 <shelikhoo> combine this with metered network, there is a lots of work needed to find a balance and trade off
16:42:38 <onyinyang[m]> dcf1: I didn't check that either. I just assumed they must make sense for some reason ^_^;
16:42:55 <dcf1> Something I have been disappointed to see in these papers is that they never mention that obfs4's (and ScrambleSuit's) built-in size and timing obfuscation features are keyed to the server identity.
16:43:17 <dcf1> I.e., they are designed so that if you train a classifier against one server or set of servers, it doesn't necessarily transfer to other servers.
16:43:52 <dcf1> It is kind of moot for obfs4, because most deployed obfs4 bridges have iat-mode=0 which disables size and timing obfuscation, but you would think researchers would at least mention it, if they had done their homework.
16:44:12 <meskio> maybe they didn't
16:44:20 <shelikhoo> yeah, and the pattern to attack V2Ray never mentioned there is more than more transport in it
16:44:38 <shelikhoo> so I think it is like they are just looking to get funding and award
16:45:14 <dcf1> shelikhoo: quite possible. One of the coauthors, Cheng Guang, is very senior, runs a big lab, and has a lot of grants.
16:45:35 <shelikhoo> yeah, and the patent to attack V2Ray never mentioned there is more than more transport in it
16:45:59 <dcf1> By the way, I spent a good amount of time trying to understand the nest count Bloom filter (NCBF) that is central to this paper, and at the end I was unimpressed.
16:46:44 <dcf1> The name should not have "nested" in it. "Nested" made me think there was a recursive or hierarchical structure. But it is essentially just an array of counting Bloom filters.
16:46:58 <dcf1> 12 of them, to be exact, corresponding to the 12 statistics they track.
16:47:18 <shelikhoo> I think it is just a way to make things more efficient, but not really really useful for us when it comes to design a better protocol
16:47:27 <shelikhoo> the things in appendix is more interesting
16:47:36 <itchyonion> "the traffic will be sampled and only a portion of the packets can be obtained from each flow." Then they compute the percentage of certain packets. I'm surprised this works as well as they state in the paper because there should be huge variance just by the way they sample it
16:47:39 <dcf1> The statement "each counting block is one CBF" I'm pretty sure is just false. Each counting block is a 12-element array of uint8, it is not a Bloom filter on its own.
16:49:19 <dcf1> One thing to their credit is they do some analysis in III-C "Sampling theory" and V-A "Parameter adjustment" to estimate how large the Bloom filters should be, though I did not read these sections closely enough to spot any possible errors.
16:51:31 <dcf1> If you remember "Security Notions for Fully Encrypted Protocols" from FOCI (https://www.petsymposium.org/foci/2023/foci-2023-0004.php), one of their requirements is that wire packets be arbitrarily sizeable.
16:52:13 <dcf1> Reading this kind of research from the "opposition" is interesting for answering the question of "how then should the packets be shaped?"
16:53:13 <dcf1> Oh, a couple of questions I wrote down:
16:53:53 <dcf1> 1. Maybe it would be better to key the hash table on 3-tuples (proto, server IP, server port) rather than 5-tuples (proto, server IP, server port, client IP, client port).
16:54:30 <dcf1> After all, they are trying to detect obfs4 bridges (servers), not obfs4 clients. Currently, their NCBF is counting statistics for server–client pairs, not just servers.
16:54:55 <dcf1> It may also help with the data sparsity problem, if the same bridge is used by many clients, their statistics would get accumulated.
16:55:15 <onyinyang[m]> that's a good point
16:55:38 <dcf1> 2. Even a 0.01% ratio of obfs4 to non-obfs4 (Table III) on page 8 seems too high to be realistic.
16:57:14 <dcf1> itchyonion: about sampling, they say that they use "systematic sampling", which means strictly looking at every Nth packet, without randomization (IV-B).
16:57:44 <dcf1> The presentation video also emphasizes this point, which makes me wonder if it's just an implementation detail or if it affects the results.
16:58:33 <shelikhoo> that's a lot of false positive, but as shown in the case of blocking of fully encrypted protocol, I think they are beginning to allow collateral damage happen... so protocol still need to design with this kind of attack in mind...
16:59:22 <dcf1> Any last thoughts on what we can learn from this paper?
16:59:30 <shelikhoo> nothing more from me
16:59:53 <itchyonion> Interesting thought about only collect packets based on (proto, server IP, server port). Reminds me of an industrial protocol called OPC with a dynamic port mechanism: the client has to ask the server what port to use
17:00:45 <dcf1> itchyonion: oh yes, and that reminds me, they don't talk about it, but it seems to me that they would have to somehow store all those tuples in a side data structure, because the tuples are not recoverable directly from the contents of the NCBF.
17:01:19 <dcf1> If you want to look at more papers with this flavor, you can skim Cheng Guang's bibliographies:
17:01:22 <dcf1> https://dblp.org/pid/99/4812-1.html
17:01:25 <dcf1> https://xueshu.baidu.com/scholarID/CN-BN74JDQJ
17:02:07 <dcf1> Packet size seems to be an interest, e.g. "Length Matters: Fast Internet Encrypted Traffic Service Classification based on Multi-PDU Lengths"
17:02:27 <dcf1> That is all from me.
17:04:05 <meskio> I guess we can close the meeting here
17:04:13 <meskio> #endmeeting