15:58:24 #startmeeting tor anti-censorship meeting 15:58:24 Meeting started Thu Mar 9 15:58:24 2023 UTC. The chair is meskio. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:58:24 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:58:29 hello everybody!! 15:58:33 here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep 15:58:36 feel free to add what you've been working on and put items on the agenda 15:58:37 Hi! 15:58:50 hihi! 15:58:56 Hello! 15:58:58 hello! 15:59:23 is there anything to discuss on the valdikss point from last week? can I remove it from the agenda? 16:00:11 okay, there is one thing I would like to let everyone know is that: 16:00:11 I have already bring up the "what is the status of snowflake-02 bridge inclusion in orbot?" issue at last S96 meeting, and received no reply 16:00:27 over 16:00:56 I guess silence is a yes, I'll remove it 16:01:26 shelikhoo: I think I have an email from someone from orbot asking me about it, but failed to reply it yet 16:01:30 * meskio goes to look 16:02:17 I have an email from fabiola asking me about it, not sure what they need to do 16:02:27 I'll reply her today 16:03:49 it looks like she doesn't know what we are asking her, I'll tell her and check what is their status 16:05:20 I'll inform when I hear something from them 16:05:27 anything else on this topic? 16:05:43 Nothing from me... 16:06:03 cool, lets move on 16:06:17 creating repos at gitlab.tpo 16:06:44 Sorry to admit it, but figuring this out has been an obsatcle in my moving goptlib to gitlab. 16:06:47 dcf1: I think you can go ahead an create the repos, we just need to agree that is the right place, but worst case scenario we move it around 16:07:02 I can also create the repo if needed myself 16:07:12 Ok, I will give it a try and ask if there is trouble. 16:07:19 :) 16:07:29 just pass by #tor-anticensorship if needs help there 16:07:43 * meskio goes to check dcf1 permitions in anti-censorship project 16:08:24 dcf1: you do have the same kind of permitions I have, so you should be able to do it 16:08:34 okay, thank you 16:08:38 thank you for taking care of that migration 16:09:39 anything else for discussion? 16:10:32 I see we have the 'ampcache snowflake fallback' in actions, but I'm not sure anybody is actually working on this 16:11:17 Resynchronization with Upsteamed Remove HelloVerify countermeasure (https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40258#note_2883726) 16:11:27 there is one last-call item from me 16:12:04 ahhpretty cool that this got accepted upstream 16:12:17 good work on that, yes 16:12:20 currently we are looking toward to resynchronize with upstreamed change of snowflake remove hello verify countermeasure 16:12:25 yes! thanks! 16:12:34 and that would means updating the dependency 16:12:35 and 16:12:57 stop support one of version of go toolchain we are testing in CI 16:13:04 do we wants to proceed with that 16:13:10 or wait a while 16:13:12 ? 16:13:25 what go version is that? 16:13:49 is it the oldest version currently? 16:14:10 https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/131#note_2869188 16:14:19 go1.15 16:14:37 "The only problem I'm having with this is that it no longer builds with go1.15 due to the x/crypto dependency update." from cohosh 16:15:03 that is the version of go in debian stable, but 2.19 is comming with the next stable comming out in few months 16:15:09 and 2.19 is in debian backports 16:15:16 https://go.dev/doc/devel/release#go1.15 released 2020-08-11 16:15:26 (I tend to use debian stable as a reference of what is the oldest version people still use) 16:16:37 s/2./1./ 16:17:48 do we know what is the minimum go version that works? 16:17:55 yes, but the thing is, do we wants to wait a while, or drop support to go1.15 now? 16:18:42 for me the biggest consideration is if we are causing pain for standalone proxy operators 16:20:20 the standalone proxy will come with the next debian stable 16:20:40 but for bullseye operators it might become harder to build 16:20:57 not sure how many people does actually build it themselves instead of using the docker images 16:21:28 I have no clue 16:22:24 we could provide binaries for the standalone proxy 16:22:29 (statically compiled) 16:23:24 I mean, we can move forward to deprecate go1.15 and if people complain start to produce binaries 16:24:32 I think what actually going to happen is that debian packager come to complain 16:24:45 or stop updating debian package 16:24:53 or something rely on debian 16:25:12 users will just build from source with most recent toolchain 16:25:15 the maintainer of the debian package is me, I will not complain :) 16:25:20 hahaha 16:25:56 then there should be nothing prevent us from moving forward 16:26:11 and if something goes wrong we can fix it then 16:26:21 +1 16:26:32 since I don't think there will be any irreversible damage by doing this 16:26:37 yes 16:26:53 okay nothing more from me about this topic 16:27:06 cool 16:27:14 anything else before we go to the reading group? 16:28:15 for the reading group we have "Detecting Tor Bridge from Sampled Traffic in Backbone Networks" 16:28:32 https://www.ndss-symposium.org/wp-content/uploads/madweb2021_23011_paper.pdf 16:28:43 I didn't manage to read it this time :( 16:28:53 but I'm still interested and will in the next days 16:29:03 dcf1: I see you added a summary to the pad 16:29:12 Maybe I will copy it here for the record. 16:29:16 This paper is about detecting Tor-in-obfs4 when you only have a traffic sample; e.g., you only get to look at every 100th packet that passes through a router that handles both obfs4 and non-obfs4 flows. Traffic sampling means you cannot use features like "look at the first n packets of a flow" or "compare the timing of two consecutive packets". Instead, you can only look at aggregate statistical 16:29:22 features and have to be memory-efficient. 16:29:25 The system collects 12 statistics (Table III in the appendix) and stores them in a data structure called a nest count Bloom filter (NCBF), which essentially is just a composition of 12 counting Bloom filters (https://en.wikipedia.org/wiki/Counting_Bloom_filter). The statistics are things like "number of non-empty upstream packets" (C₂) and "number of downstream packets with payload length between 62 16:29:31 and 465" (C₁₁). From these 12 statistics, they derive 14 features (mostly ratios of statistics) and feed them to a random forest classifier. 16:29:34 For evaluation they use a 15-minute sample of backbone traffic provided by a third party, MAWI (https://mawi.wide.ad.jp/mawi/ditl/ditl2019-G/201904090000.html) and insert their own self-collected obfs4 traffic into it. They say the detection has few false negatives (finds almost all obfs4 bridges), but too many false positives to be usable directly for blocking decisions; they mention the need for 16:29:40 "secondary testing" of suspected bridges. 16:30:01 I think it is a good idea to start reading this paper from the appendix, where they show the features. I don't know why they hid them there. 16:30:56 When I first saw this paper's title I was excited; then I skimmed it once and was less excited; then when I read it in detail I found that I was actually interested again. 16:31:06 I was wondering the same thing: re: hiding the features 16:31:53 :) 16:31:57 Traffic sampling is a pretty interesting constraint; a lot of the traditional website fingerprinting features are not available. And it's something that is likely to be more practical. 16:32:34 (Though in the video Q&A, the presenter says that this paper is not ready for production and seemed self-conscious about the quality of the code.) 16:32:51 my first comment is that, they really cared a lot efficiency, not just accuracy. It is like it is aimed to be actually deployed instead just for research 16:33:17 One thing that had me confused for a long time is that this paper talks a lot about eigenvalues and eigenvectors, but I think I have that figured out. It is just a mistranslation. 16:33:44 I have seen a lot of other more accurate research that requires deep learning that achieve better result at lab environment 16:33:50 A Chinese speaker can check me, but I believe 特征向量 can mean both eigenvector (linear algebra) and feature vector (machine learning). They have just used the wrong one. 16:35:02 shelikhoo: yes, that was my read to, focused on being practical for deployment 16:35:02 dcf1: yes, the same chinese word are used to represent two meanings 16:35:12 I was wondering what differences were introduced between the Tor traffic they sampled (July 2020) and now/when the paper was published? They mentioned that packed padding that was added may have obscured their results, but I don't know how much the browser update (or PT update?) would contribute to this 16:35:24 *packet padding 16:35:50 *and may obscure future results 16:36:20 onyinyang[m]: I'm not sure myself what might have changed with respect to the statistics they mention. 16:36:38 dcf you are correct on the Chinese word 16:37:00 "We collected traffic mainly around July 2020, and once the obfuscation protocol changed within six months, we needed to re-screen the features for our method." 16:37:46 It sounds like they had some difficulty with it and the features are fairly fragile. 16:37:58 Yeah 16:38:05 "Once the obfuscation protocol has updated for obfuscating packet lengths, such as adding padding in the information control part, shielding ultra-small data packets, our method needs to be updated accordingly." 16:38:12 in addition to that, I was thinking the about feature they detected, are they detecting tor in a tunnel or just obfs4 16:38:26 although that might be a suggestion, not what has happened 16:38:46 shelikhoo: correct, I think it depends highly on it being Tor inside the tunnel. 16:38:51 has obfs4 actually being changed in 2020? 16:39:02 Since they are not using entropy or any other features that might be characteristic of obfs4 itself. 16:39:41 ahh, so the changes are not in obfs4, but in Tor 16:39:48 meskio: I would guess that rather something in Tor or Tor Browser changed. There was an anti-website fingerprinting mitigation (randomized pipelining) implemented at some point in the browser, not sure when that was. 16:40:00 makes sense 16:40:19 dcf1: yes, and deepcorr will basically force tor to change how padding or something like that, I think this research itself isn't that worrying on its own 16:40:46 the issue is more about protocol design that we need to pad or otherwise hide traffic shape information 16:41:17 Interesting to me was the size thresholds they use for partitioning packet sizes: 0, 1–61, 62–465, 466–1050, 1050–∞ 16:41:22 because attacker is not only looking at the content of traffic, but also the shape of it 16:41:44 I didn't check to see if those thresholds come from other references [6] [7] [9] they cite in the appendix. 16:42:13 combine this with metered network, there is a lots of work needed to find a balance and trade off 16:42:38 dcf1: I didn't check that either. I just assumed they must make sense for some reason ^_^; 16:42:55 Something I have been disappointed to see in these papers is that they never mention that obfs4's (and ScrambleSuit's) built-in size and timing obfuscation features are keyed to the server identity. 16:43:17 I.e., they are designed so that if you train a classifier against one server or set of servers, it doesn't necessarily transfer to other servers. 16:43:52 It is kind of moot for obfs4, because most deployed obfs4 bridges have iat-mode=0 which disables size and timing obfuscation, but you would think researchers would at least mention it, if they had done their homework. 16:44:12 maybe they didn't 16:44:20 yeah, and the pattern to attack V2Ray never mentioned there is more than more transport in it 16:44:38 so I think it is like they are just looking to get funding and award 16:45:14 shelikhoo: quite possible. One of the coauthors, Cheng Guang, is very senior, runs a big lab, and has a lot of grants. 16:45:35 yeah, and the patent to attack V2Ray never mentioned there is more than more transport in it 16:45:59 By the way, I spent a good amount of time trying to understand the nest count Bloom filter (NCBF) that is central to this paper, and at the end I was unimpressed. 16:46:44 The name should not have "nested" in it. "Nested" made me think there was a recursive or hierarchical structure. But it is essentially just an array of counting Bloom filters. 16:46:58 12 of them, to be exact, corresponding to the 12 statistics they track. 16:47:18 I think it is just a way to make things more efficient, but not really really useful for us when it comes to design a better protocol 16:47:27 the things in appendix is more interesting 16:47:36 "the traffic will be sampled and only a portion of the packets can be obtained from each flow." Then they compute the percentage of certain packets. I'm surprised this works as well as they state in the paper because there should be huge variance just by the way they sample it 16:47:39 The statement "each counting block is one CBF" I'm pretty sure is just false. Each counting block is a 12-element array of uint8, it is not a Bloom filter on its own. 16:49:19 One thing to their credit is they do some analysis in III-C "Sampling theory" and V-A "Parameter adjustment" to estimate how large the Bloom filters should be, though I did not read these sections closely enough to spot any possible errors. 16:51:31 If you remember "Security Notions for Fully Encrypted Protocols" from FOCI (https://www.petsymposium.org/foci/2023/foci-2023-0004.php), one of their requirements is that wire packets be arbitrarily sizeable. 16:52:13 Reading this kind of research from the "opposition" is interesting for answering the question of "how then should the packets be shaped?" 16:53:13 Oh, a couple of questions I wrote down: 16:53:53 1. Maybe it would be better to key the hash table on 3-tuples (proto, server IP, server port) rather than 5-tuples (proto, server IP, server port, client IP, client port). 16:54:30 After all, they are trying to detect obfs4 bridges (servers), not obfs4 clients. Currently, their NCBF is counting statistics for server–client pairs, not just servers. 16:54:55 It may also help with the data sparsity problem, if the same bridge is used by many clients, their statistics would get accumulated. 16:55:15 that's a good point 16:55:38 2. Even a 0.01% ratio of obfs4 to non-obfs4 (Table III) on page 8 seems too high to be realistic. 16:57:14 itchyonion: about sampling, they say that they use "systematic sampling", which means strictly looking at every Nth packet, without randomization (IV-B). 16:57:44 The presentation video also emphasizes this point, which makes me wonder if it's just an implementation detail or if it affects the results. 16:58:33 that's a lot of false positive, but as shown in the case of blocking of fully encrypted protocol, I think they are beginning to allow collateral damage happen... so protocol still need to design with this kind of attack in mind... 16:59:22 Any last thoughts on what we can learn from this paper? 16:59:30 nothing more from me 16:59:53 Interesting thought about only collect packets based on (proto, server IP, server port). Reminds me of an industrial protocol called OPC with a dynamic port mechanism: the client has to ask the server what port to use 17:00:45 itchyonion: oh yes, and that reminds me, they don't talk about it, but it seems to me that they would have to somehow store all those tuples in a side data structure, because the tuples are not recoverable directly from the contents of the NCBF. 17:01:19 If you want to look at more papers with this flavor, you can skim Cheng Guang's bibliographies: 17:01:22 https://dblp.org/pid/99/4812-1.html 17:01:25 https://xueshu.baidu.com/scholarID/CN-BN74JDQJ 17:02:07 Packet size seems to be an interest, e.g. "Length Matters: Fast Internet Encrypted Traffic Service Classification based on Multi-PDU Lengths" 17:02:27 That is all from me. 17:04:05 I guess we can close the meeting here 17:04:13 #endmeeting