15:57:35 <shelikhoo> #startmeeting tor anti-censorship meeting 15:57:35 <shelikhoo> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep 15:57:35 <shelikhoo> feel free to add what you've been working on and put items on the agenda 15:57:35 <MeetBot> Meeting started Thu Mar 16 15:57:35 2023 UTC. The chair is shelikhoo. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:57:35 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 15:57:42 <shelikhoo> Hi~ Hi~ 15:58:05 <meskio> hello 15:58:24 <itchyonion> hello 15:58:37 <onyinyang[m]> hihi o/ 15:59:07 <gaba> hi 16:02:58 <shelikhoo> okay, I didn't see a lot of changes of agenda, so i will start the meeting now 16:03:15 <shelikhoo> the first discussion topic is 16:03:16 <shelikhoo> Analysis of speed deficiency of Snowflake in China, 2023 Q1 https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40251 16:03:40 <shelikhoo> The packet loss issue is actually quite common for users in China 16:04:13 <shelikhoo> we have done similar analysis like 16:04:14 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/team/-/issues/65#note_2840280 16:04:22 <shelikhoo> in the past 16:04:35 <shelikhoo> and now it will be snowflake's turn to deal with this issue 16:05:40 <shelikhoo> I have wrote a rather long comment on possible step to take to deal with issue: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40251#note_2886528 16:06:26 <shelikhoo> but please do let me know now if you are thinking about a method to fix this problem that is not mentioned in this issue 16:06:34 <meskio> from reading your options plan 2 sounds like the low hanging fruit, but maybe it is because I don't really know what it is to tune there 16:06:53 <meskio> 3.1 looks like a no go with the current overloaded server 16:07:00 <meskio> 3.2 looks hacky 16:07:09 <meskio> I'm not sure 1.1 will solve anything 16:07:16 <meskio> and 1.2 sounds hard to do 16:07:43 <meskio> but I have to admit I'm not sure I have all the picture in my head 16:08:16 <shelikhoo> yes, the issue with plan 2 is that the performance improvement with tuning kcp options are actually capped 16:08:49 <shelikhoo> so we can try it, but it is hard to say if it will be enough to fix our problem 16:09:31 <shelikhoo> (I am unable to predict if it will work sufficiently well) 16:09:42 <meskio> yes, if that fails I think I would go for 1.2, but maybe dcf1 have a more informed opinion 16:10:25 <meskio> and maybe we need to leave time for people to read your comment... 16:10:35 <dcf1> The hypothesis is that the slowness is due to packet loss on the client--proxy link? 16:10:49 <dcf1> I am looking at the analysis scripts at https://gist.github.com/xiaokangwang/14ac48ef9fc2ce8dd04f92ed9c0928de 16:11:16 <itchyonion> do we know if changing to a different proxy actually packet loss rate? IIRC GFW used to do something like targeting a suspicious client within China with any overseas connection. In that case matching to a different proxy might not work 16:11:18 <shelikhoo> yes, as we can seen in the result, the packet loss can be as high as 20% 16:11:20 <dcf1> It looks like it is calcualting the loss rate by subtracting the maximum DTLS sequence number seen from the total number of DTLS packets seen 16:11:34 <dcf1> `tail -n 1` = maximum, `wc -l` = total number 16:11:35 <shelikhoo> yes, it is how I was doing it 16:12:47 <dcf1> Have you done any manual spot-checks of the pcap files to see if the automated statistics are right? 16:13:16 <dcf1> I ask, just because offhand I wouldn't swear that doing it this way is justified, there may be some assumption that is not actually satisfied. 16:13:32 <shelikhoo> yes, it is a way for me to automate the manually process of looking it at wireshark 16:14:03 <dcf1> I was just looking at https://www.rfc-editor.org/rfc/rfc9147 to see how sequence numbers are used; on first glance it seems okay, but these kinds of things can sometimes have surprises. 16:14:05 <shelikhoo> so, i have at least manually inspected one of them 16:14:47 <dcf1> For example, it would be good to know the nature of packet loss: is it random per packet, or bursty? I would want to know: what is the mean an st.dev of consecutive losses? 16:15:14 <dcf1> I guess i should rephrase as "The mean and standard deviation of the length of bursts of losses" 16:15:36 <shelikhoo> I will need another experiment to answer this question... 16:15:55 <dcf1> shelikhoo: no, I dont think so 16:16:07 <shelikhoo> yes? 16:16:12 <dcf1> Just put the sequence of sequence numbers in a list, and find the first differences 16:16:44 <dcf1> You expect [1, 1, 1, 1, ...] during times of no loss, or larger numbers where there is a run of lost sequence numbers 16:17:22 <dcf1> Take the first differences, and subtract 1 from every element, I guess, to get the lengths of runs of losses. 16:17:55 <dcf1> I'm asking this because it would be good to really understand the nature of the problem, and be really sure the right thing is being measured, before embarking on a plan to fix it. 16:18:57 <dcf1> The first-order qualitative question is, are there frequent drops of 1 packet, or infrequent losses of 100 packets in a row, something like that? 16:20:49 <dcf1> Probably from the pcap you can even recover the length in seconds of each period of packet loss. 16:21:51 <shelikhoo> yes... I didn't keep the original pcap file, so I will need to find a suitable capture... which means I won't be able to answer these questions live 16:22:09 <dcf1> Aha, I understand now, sorry. 16:23:29 <dcf1> My only other observation is that I don't see how forward error correction could help, but maybe I don't understand quite how it works. 16:23:57 <shelikhoo> yes, anyway, I will bring more data, and a comment about how forward error correction would work to the meeting next time... 16:24:03 <dcf1> Because the the client--proxy links and proxy--server links are already wrapped in reliable transport protocols, we would be adding FEC inside an already reliable channel. 16:24:44 <dcf1> And the question of how bursty packet losses are could also inform an estimate of how much FEC might be expected to help things. 16:24:45 <shelikhoo> client - proxy uses an reliable channel.. 16:24:57 <shelikhoo> client - proxy uses an UNreliable channel 16:25:21 <dcf1> client--proxy is SCTP in DTLS, SCTP does retransmission etc. 16:25:29 <shelikhoo> webrtc(SCTP to be specific) can be set to not retransmit packet 16:25:34 <shelikhoo> and deliver packet out of order 16:25:43 <shelikhoo> to application 16:26:14 <dcf1> I did not know about that. Do we use it that way in Snowflake? 16:27:28 <shelikhoo> it is called "unreliable mode" https://pkg.go.dev/github.com/pion/webrtc/v3#DataChannel.MaxRetransmits 16:29:28 <dcf1> Ok. I might have missed an issue, is that mode used in Snowflake? Or could it be? 16:30:27 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/blob/main/proxy/lib/snowflake.go#L504 16:30:54 <shelikhoo> https://pkg.go.dev/github.com/pion/webrtc/v3#DataChannelInit 16:31:18 <shelikhoo> Maybe no...? but we can enable "unreliable mode" 16:31:49 <dcf1> Okay, I guess I did not miss anything, it looks like we do not use this mode currently. 16:32:16 <dcf1> Anyway, the nature of packet losses is good to understand before starting a project to enable FEC, I think. 16:32:56 <shelikhoo> yes, I will have more investigation, and we can discuss it again once we have gained more understanding about this issue 16:33:05 <dcf1> Oh, another good thing to check would be if there are any duplicate DTLS sequence numbers in the pcap, because that might happen if more than one proxy is attempted during the bootstrap. 16:33:27 <shelikhoo> yes! 16:33:43 <shelikhoo> and I will keep the packet capture this time... 16:34:25 <dcf1> And you have a good point, RTC 8831 on WebRTC data channels does make some mention of "partially reliable message transport", maybe this is something to investigate. 16:34:29 <dcf1> https://www.rfc-editor.org/rfc/rfc8831#name-sctp-over-dtls-over-udp-con 16:34:42 <shelikhoo> Yes... 16:34:42 <shelikhoo> anything more on this topic? 16:35:27 <dcf1> I apologize for not making it more clear, but I am impressed by your work on this, shelikhoo 16:36:02 <meskio> +1 16:36:07 <itchyonion> +1 16:36:11 <onyinyang[m]> +1 16:36:20 <shelikhoo> yes! thanks!!!!! 16:36:28 <shelikhoo> snowflake-server buffer reuse bug postmortem 16:36:28 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40260 16:36:34 <dcf1> Also there may be a change since this week's QueuePacketConn fix, it had an effect on bandwidth: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40262#note_2886925 16:37:04 <dcf1> I want to have a short discussion about the important bug that was fixed this week 16:37:18 <dcf1> https://forum.torproject.net/t/security-advisory-cross-user-tls-traffic-mixing-in-snowflake-server-until-2023-03-13/6915 16:37:41 <dcf1> See the outline in the pad 16:38:09 <dcf1> I feel ownership of this bug, because it was my initial analysis and my commit 16:38:15 <dcf1> but there is a larger point 16:38:37 <dcf1> which is that high-performing teams take incidents like this as an opportunity to improve their processes 16:38:56 <dcf1> it is not about assigning blame, it is about looking for places in the processes where there were missing safeguards 16:39:27 <dcf1> the consequences in this case were minor, but that makes it a good opportunity for practice, with low stakes 16:40:00 <meskio> amazing work there dcf1 and cohosh to replicate that issue 16:40:26 <dcf1> The summary is that there was a bug with improperly reused buffers in snowflake-server that could cause one clietn to get fragments of another client's Tor TLS stream 16:41:03 <dcf1> It was done early in October 2022, during the great increase of users from Iran, when there was a panic to suddenly increase performance 16:41:30 <shelikhoo> yes! especially when this issue is not really easy to replicate because of the nature of memory corruption... 16:41:35 <itchyonion> I believe I was the reviewer for this ticket. I could definitely do a better job and spend more time on it. I've not done a lot of work on SF servers so far. Is this change something I can run and test locally before the MR? (although I think I'd need more than one client to reproduce the error if I understand it correctly) 16:41:47 <dcf1> Paradoxically this well-meaning fix appears to have instead *decreased* performance, at least for large numbers of users, because the corrupted KCP packets would cause retransmits in side the tunnel, and the corrupted TLS streams would cause disconnection. 16:42:26 <dcf1> The question is, was was missing that let this bug get committed, and to persist for such a long time (5 months)? What could be added to make something like this easier to detect in the future? 16:43:17 <dcf1> Thanks itchyonion, I guess the larger question is: what do we need so that reviewers can feel confident in making their approval decisions? 16:44:00 <dcf1> This merge request probably should have included a test to check the assumption that the buffers were not really reused, even if it was made during a time of high pressure. 16:44:19 <gaba> would a better test coverage help with this kind of problems? 16:44:29 <dcf1> This week, during the fixes, we added such a test. itchyonion, so maybe we can say that missing test cases is a place to improve? 16:44:37 <itchyonion> hazea41 was able to discover this from only looking at client side logs. I think having another log level, something like debug/trace would make things easier. 16:45:05 <shelikhoo> I think one of the thing we can do is that make sure any anomaly get a warning print to verbose log 16:45:31 <dcf1> Yes, while debugging this issue, I added some code to log KCP's own internal error counters 16:45:36 <dcf1> https://gitlab.torproject.org/dcf/snowflake/-/commit/9f43843b59b9753686be836f2c55f209ba29c1e9 16:46:03 <dcf1> After the fix, it made the "KCPInErrors" counter go to zero. 16:46:08 <dcf1> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40262#note_2886032 16:46:32 <dcf1> So as shelikhoo suggests, seems like we should log whenever this counter is non-zero. 16:47:01 <shelikhoo> yes, more testing won't be enough when we are ignoring the errors 16:47:02 <dcf1> That way we get reports from people on the Tor forum, which will happen sooner than a dedicated security researcher's investigation. 16:47:44 <dcf1> The other aspect to think about is what other problems might go undetected because of holes in processes 16:48:10 <dcf1> not just this bug in particular, but other similar bugs. are there easy places where we can make small changes to increase the detection of errors? 16:49:01 <shelikhoo> and right now, I think there isn't a integration testing at CI, where all piece of snowflake is all running in production like setting 16:49:15 <shelikhoo> we only have unit testing, which is often not enough 16:49:17 <dcf1> shelikhoo: that's a good point 16:49:43 <shelikhoo> https://github.com/xiaokangwang/snowflake-mu-docker/blob/master/docker-compose.yaml 16:49:51 <dcf1> We can end this topic here, I want to leave time for the last discussion item. 16:50:07 <shelikhoo> I have something like this for my distributed snowflake server testing 16:50:17 <shelikhoo> but it is not really run in the ci 16:50:20 <shelikhoo> over 16:51:06 <shelikhoo> okay the last topic 16:51:06 <shelikhoo> Docker Registry is removing obfs4, snowflake image: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/89#note_2886686 16:51:15 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/team/-/issues/121 16:51:32 <shelikhoo> we are getting evicted from docker hub! 16:51:36 <shelikhoo> X~X 16:51:48 <onyinyang[m]> -1 docker 16:52:17 <shelikhoo> we could try to apply for some open source exception, which we may qualify 16:52:18 <meskio> yes, we have applied to their free service for open source projects 16:52:34 <shelikhoo> but we also need to try to look for alternatives 16:52:55 <shelikhoo> we can wait Tor's Gitlab to support container registry 16:53:11 <shelikhoo> or host it on Github's container registry 16:53:37 <shelikhoo> over for now 16:53:45 <meskio> I would prefer to use our own, so we don't depend on github 16:53:53 <meskio> we don't have much presence in github anyway 16:53:59 <meskio> but let's see what TPA can do 16:55:00 <shelikhoo> yes... let's see what happens, there is nothing we could do right now as we are waiting for replies 16:55:29 <meskio> if we get rejected from docker we'll have 30 days to deal with it 16:55:40 <shelikhoo> unless we wants to host a container registry on our own... 16:55:55 <meskio> I'll keep you posted on what they say, but I expect it will take them time to review the tons of applications they are currently getting 16:56:33 <meskio> hosting it our own you mean TPA doing it? or ACT doing it? 16:56:38 <shelikhoo> okay next action topic: move the ampcache snowflake fallback forward 16:56:41 <shelikhoo> ACT do it 16:56:49 <shelikhoo> if TPA won't finish it in time 16:56:59 <meskio> ahh, I didn't think about that, I was hoping that will be TPA doing it, but we can always do that 16:57:08 <meskio> I do host registries for other projects, is not much work 16:57:13 <shelikhoo> yes, let's think about that later.. 16:57:17 <meskio> +12 16:57:26 <shelikhoo> okay next action topic: move the ampcache snowflake fallback forward 16:57:30 <meskio> ups, but yeah 12 16:57:58 <shelikhoo> anything about this ampcache other than do it? 16:58:04 <shelikhoo> announcement: Sponsor 28 ended 16:58:07 <shelikhoo> yeah! 16:58:16 <gaba> \o/ 16:58:26 <shelikhoo> anything more we wish to discuss in this meeting? 16:58:28 <itchyonion> not much to add. I should have more time to focus on Tor side of things from now on 16:59:06 <shelikhoo> sorry for meeting overtime... 16:59:07 <shelikhoo> #endmeeting