15:57:35 <shelikhoo> #startmeeting tor anti-censorship meeting
15:57:35 <shelikhoo> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep
15:57:35 <shelikhoo> feel free to add what you've been working on and put items on the agenda
15:57:35 <MeetBot> Meeting started Thu Mar 16 15:57:35 2023 UTC.  The chair is shelikhoo. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:57:35 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:57:42 <shelikhoo> Hi~ Hi~
15:58:05 <meskio> hello
15:58:24 <itchyonion> hello
15:58:37 <onyinyang[m]> hihi o/
15:59:07 <gaba> hi
16:02:58 <shelikhoo> okay, I didn't see a lot of changes of agenda, so i will start the meeting now
16:03:15 <shelikhoo> the first discussion topic is
16:03:16 <shelikhoo> Analysis of speed deficiency of Snowflake in China, 2023 Q1 https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40251
16:03:40 <shelikhoo> The packet loss issue is actually quite common for users in China
16:04:13 <shelikhoo> we have done similar analysis like
16:04:14 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/team/-/issues/65#note_2840280
16:04:22 <shelikhoo> in the past
16:04:35 <shelikhoo> and now it will be snowflake's turn to deal with this issue
16:05:40 <shelikhoo> I have wrote a rather long comment on possible step to take to deal with issue: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40251#note_2886528
16:06:26 <shelikhoo> but please do let me know now if you are thinking about a method to fix this problem that is not mentioned in this issue
16:06:34 <meskio> from reading your options plan 2 sounds like the low hanging fruit, but maybe it is because I don't really know what it is to tune there
16:06:53 <meskio> 3.1 looks like a no go with the current overloaded server
16:07:00 <meskio> 3.2 looks hacky
16:07:09 <meskio> I'm not sure 1.1 will solve anything
16:07:16 <meskio> and 1.2 sounds hard to do
16:07:43 <meskio> but I have to admit I'm not sure I have all the picture in my head
16:08:16 <shelikhoo> yes, the issue with plan 2 is that the performance improvement with tuning kcp options are actually capped
16:08:49 <shelikhoo> so we can try it, but it is hard to say if it will be enough to fix our problem
16:09:31 <shelikhoo> (I am unable to predict if it will work sufficiently well)
16:09:42 <meskio> yes, if that fails I think I would go for 1.2, but maybe dcf1 have a more informed opinion
16:10:25 <meskio> and maybe we need to leave time for people to read your comment...
16:10:35 <dcf1> The hypothesis is that the slowness is due to packet loss on the client--proxy link?
16:10:49 <dcf1> I am looking at the analysis scripts at https://gist.github.com/xiaokangwang/14ac48ef9fc2ce8dd04f92ed9c0928de
16:11:16 <itchyonion> do we know if changing to a different proxy actually packet loss rate? IIRC GFW used to do something like targeting a suspicious client within China with any overseas connection. In that case matching to a different proxy might not work
16:11:18 <shelikhoo> yes, as we can seen in the result, the packet loss can be as high as 20%
16:11:20 <dcf1> It looks like it is calcualting the loss rate by subtracting the maximum DTLS sequence number seen from the total number of DTLS packets seen
16:11:34 <dcf1> `tail -n 1` = maximum, `wc -l` = total number
16:11:35 <shelikhoo> yes, it is how I was doing it
16:12:47 <dcf1> Have you done any manual spot-checks of the pcap files to see if the automated statistics are right?
16:13:16 <dcf1> I ask, just because offhand I wouldn't swear that doing it this way is justified, there may be some assumption that is not actually satisfied.
16:13:32 <shelikhoo> yes, it is a way for me to automate the manually process of looking it at wireshark
16:14:03 <dcf1> I was just looking at https://www.rfc-editor.org/rfc/rfc9147 to see how sequence numbers are used; on first glance it seems okay, but these kinds of things can sometimes have surprises.
16:14:05 <shelikhoo> so, i have at least manually inspected one of them
16:14:47 <dcf1> For example, it would be good to know the nature of packet loss: is it random per packet, or bursty? I would want to know: what is the mean an st.dev of consecutive losses?
16:15:14 <dcf1> I guess i should rephrase as "The mean and standard deviation of the length of bursts of losses"
16:15:36 <shelikhoo> I will need another experiment to answer this question...
16:15:55 <dcf1> shelikhoo: no, I dont think so
16:16:07 <shelikhoo> yes?
16:16:12 <dcf1> Just put the sequence of sequence numbers in a list, and find the first differences
16:16:44 <dcf1> You expect [1, 1, 1, 1, ...] during times of no loss, or larger numbers where there is a run of lost sequence numbers
16:17:22 <dcf1> Take the first differences, and subtract 1 from every element, I guess, to get the lengths of runs of losses.
16:17:55 <dcf1> I'm asking this because it would be good to really understand the nature of the problem, and be really sure the right thing is being measured, before embarking on a plan to fix it.
16:18:57 <dcf1> The first-order qualitative question is, are there frequent drops of 1 packet, or infrequent losses of 100 packets in a row, something like that?
16:20:49 <dcf1> Probably from the pcap you can even recover the length in seconds of each period of packet loss.
16:21:51 <shelikhoo> yes... I didn't keep the original pcap file, so I will need to find a suitable capture... which means I won't be able to answer these questions live
16:22:09 <dcf1> Aha, I understand now, sorry.
16:23:29 <dcf1> My only other observation is that I don't see how forward error correction could help, but maybe I don't understand quite how it works.
16:23:57 <shelikhoo> yes, anyway, I will bring more data, and a comment about how forward error correction would work to the meeting next time...
16:24:03 <dcf1> Because the the client--proxy links and proxy--server links are already wrapped in reliable transport protocols, we would be adding FEC inside an already reliable channel.
16:24:44 <dcf1> And the question of how bursty packet losses are could also inform an estimate of how much FEC might be expected to help things.
16:24:45 <shelikhoo> client - proxy uses an reliable channel..
16:24:57 <shelikhoo> client - proxy uses an UNreliable channel
16:25:21 <dcf1> client--proxy is SCTP in DTLS, SCTP does retransmission etc.
16:25:29 <shelikhoo> webrtc(SCTP to be specific) can be set to not retransmit packet
16:25:34 <shelikhoo> and deliver packet out of order
16:25:43 <shelikhoo> to application
16:26:14 <dcf1> I did not know about that. Do we use it that way in Snowflake?
16:27:28 <shelikhoo> it is called "unreliable mode" https://pkg.go.dev/github.com/pion/webrtc/v3#DataChannel.MaxRetransmits
16:29:28 <dcf1> Ok. I might have missed an issue, is that mode used in Snowflake? Or could it be?
16:30:27 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/blob/main/proxy/lib/snowflake.go#L504
16:30:54 <shelikhoo> https://pkg.go.dev/github.com/pion/webrtc/v3#DataChannelInit
16:31:18 <shelikhoo> Maybe no...? but we can enable "unreliable mode"
16:31:49 <dcf1> Okay, I guess I did not miss anything, it looks like we do not use this mode currently.
16:32:16 <dcf1> Anyway, the nature of packet losses is good to understand before starting a project to enable FEC, I think.
16:32:56 <shelikhoo> yes, I will have more investigation, and we can discuss it again once we have gained more understanding about this issue
16:33:05 <dcf1> Oh, another good thing to check would be if there are any duplicate DTLS sequence numbers in the pcap, because that might happen if more than one proxy is attempted during the bootstrap.
16:33:27 <shelikhoo> yes!
16:33:43 <shelikhoo> and I will keep the packet capture this time...
16:34:25 <dcf1> And you have a good point, RTC 8831 on WebRTC data channels does make some mention of "partially reliable message transport", maybe this is something to investigate.
16:34:29 <dcf1> https://www.rfc-editor.org/rfc/rfc8831#name-sctp-over-dtls-over-udp-con
16:34:42 <shelikhoo> Yes...
16:34:42 <shelikhoo> anything more on this topic?
16:35:27 <dcf1> I apologize for not making it more clear, but I am impressed by your work on this, shelikhoo
16:36:02 <meskio> +1
16:36:07 <itchyonion> +1
16:36:11 <onyinyang[m]> +1
16:36:20 <shelikhoo> yes! thanks!!!!!
16:36:28 <shelikhoo> snowflake-server buffer reuse bug postmortem
16:36:28 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40260
16:36:34 <dcf1> Also there may be a change since this week's QueuePacketConn fix, it had an effect on bandwidth: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40262#note_2886925
16:37:04 <dcf1> I want to have a short discussion about the important bug that was fixed this week
16:37:18 <dcf1> https://forum.torproject.net/t/security-advisory-cross-user-tls-traffic-mixing-in-snowflake-server-until-2023-03-13/6915
16:37:41 <dcf1> See the outline in the pad
16:38:09 <dcf1> I feel ownership of this bug, because it was my initial analysis and my commit
16:38:15 <dcf1> but there is a larger point
16:38:37 <dcf1> which is that high-performing teams take incidents like this as an opportunity to improve their processes
16:38:56 <dcf1> it is not about assigning blame, it is about looking for places in the processes where there were missing safeguards
16:39:27 <dcf1> the consequences in this case were minor, but that makes it a good opportunity for practice, with low stakes
16:40:00 <meskio> amazing work there dcf1 and cohosh to replicate that issue
16:40:26 <dcf1> The summary is that there was a bug with improperly reused buffers in snowflake-server that could cause one clietn to get fragments of another client's Tor TLS stream
16:41:03 <dcf1> It was done early in October 2022, during the great increase of users from Iran, when there was a panic to suddenly increase performance
16:41:30 <shelikhoo> yes! especially when this issue is not really easy to replicate because of the nature of memory corruption...
16:41:35 <itchyonion> I believe I was the reviewer for this ticket. I could definitely do a better job and spend more time on it. I've not done a lot of work on SF servers so far. Is this change something I can run and test locally before the MR? (although I think I'd need more than one client to reproduce the error if I understand it correctly)
16:41:47 <dcf1> Paradoxically this well-meaning fix appears to have instead *decreased* performance, at least for large numbers of users, because the corrupted KCP packets would cause retransmits in side the tunnel, and the corrupted TLS streams would cause disconnection.
16:42:26 <dcf1> The question is, was was missing that let this bug get committed, and to persist for such a long time (5 months)? What could be added to make something like this easier to detect in the future?
16:43:17 <dcf1> Thanks itchyonion, I guess the larger question is: what do we need so that reviewers can feel confident in making their approval decisions?
16:44:00 <dcf1> This merge request probably should have included a test to check the assumption that the buffers were not really reused, even if it was made during a time of high pressure.
16:44:19 <gaba> would a better test coverage help with this kind of problems?
16:44:29 <dcf1> This week, during the fixes, we added such a test. itchyonion, so maybe we can say that missing test cases is a place to improve?
16:44:37 <itchyonion> hazea41 was able to discover this from only looking at client side logs. I think having another log level, something like debug/trace would make things easier.
16:45:05 <shelikhoo> I think one of the thing we can do is that make sure any anomaly get a warning print to verbose log
16:45:31 <dcf1> Yes, while debugging this issue, I added some code to log KCP's own internal error counters
16:45:36 <dcf1> https://gitlab.torproject.org/dcf/snowflake/-/commit/9f43843b59b9753686be836f2c55f209ba29c1e9
16:46:03 <dcf1> After the fix, it made the "KCPInErrors" counter go to zero.
16:46:08 <dcf1> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40262#note_2886032
16:46:32 <dcf1> So as shelikhoo suggests, seems like we should log whenever this counter is non-zero.
16:47:01 <shelikhoo> yes, more testing won't be enough when we are ignoring the errors
16:47:02 <dcf1> That way we get reports from people on the Tor forum, which will happen sooner than a dedicated security researcher's investigation.
16:47:44 <dcf1> The other aspect to think about is what other problems might go undetected because of holes in processes
16:48:10 <dcf1> not just this bug in particular, but other similar bugs. are there easy places where we can make small changes to increase the detection of errors?
16:49:01 <shelikhoo> and right now, I think there isn't a integration testing at CI, where all piece of snowflake is all running in production like setting
16:49:15 <shelikhoo> we only have unit testing, which is often not enough
16:49:17 <dcf1> shelikhoo: that's a good point
16:49:43 <shelikhoo> https://github.com/xiaokangwang/snowflake-mu-docker/blob/master/docker-compose.yaml
16:49:51 <dcf1> We can end this topic here, I want to leave time for the last discussion item.
16:50:07 <shelikhoo> I have something like this for my distributed snowflake server testing
16:50:17 <shelikhoo> but it is not really run in the ci
16:50:20 <shelikhoo> over
16:51:06 <shelikhoo> okay the last topic
16:51:06 <shelikhoo> Docker Registry is removing obfs4, snowflake image: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/89#note_2886686
16:51:15 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/team/-/issues/121
16:51:32 <shelikhoo> we are getting evicted from docker hub!
16:51:36 <shelikhoo> X~X
16:51:48 <onyinyang[m]> -1 docker
16:52:17 <shelikhoo> we could try to apply for some open source exception, which we may qualify
16:52:18 <meskio> yes, we have applied to their free service for open source projects
16:52:34 <shelikhoo> but we also need to try to look for alternatives
16:52:55 <shelikhoo> we can wait Tor's Gitlab to support container registry
16:53:11 <shelikhoo> or host it on Github's container registry
16:53:37 <shelikhoo> over for now
16:53:45 <meskio> I would prefer to use our own, so we don't depend on github
16:53:53 <meskio> we don't have much presence in github anyway
16:53:59 <meskio> but let's see what TPA can do
16:55:00 <shelikhoo> yes... let's see what happens, there is nothing we could do right now as we are waiting for replies
16:55:29 <meskio> if we get rejected from docker we'll have 30 days to deal with it
16:55:40 <shelikhoo> unless we wants to host a container registry on our own...
16:55:55 <meskio> I'll keep you posted on what they say, but I expect it will take them time to review the tons of applications they are currently getting
16:56:33 <meskio> hosting it our own you mean TPA doing it? or ACT doing it?
16:56:38 <shelikhoo> okay next action topic: move the ampcache snowflake fallback forward
16:56:41 <shelikhoo> ACT do it
16:56:49 <shelikhoo> if TPA won't finish it in time
16:56:59 <meskio> ahh, I didn't think about that, I was hoping that will be TPA doing it, but we can always do that
16:57:08 <meskio> I do host registries for other projects, is not much work
16:57:13 <shelikhoo> yes, let's think about that later..
16:57:17 <meskio> +12
16:57:26 <shelikhoo> okay next action topic: move the ampcache snowflake fallback forward
16:57:30 <meskio> ups, but yeah 12
16:57:58 <shelikhoo> anything about this ampcache other than do it?
16:58:04 <shelikhoo> announcement: Sponsor 28 ended
16:58:07 <shelikhoo> yeah!
16:58:16 <gaba> \o/
16:58:26 <shelikhoo> anything more we wish to discuss in this meeting?
16:58:28 <itchyonion> not much to add. I should have more time to focus on Tor side of things from now on
16:59:06 <shelikhoo> sorry for meeting overtime...
16:59:07 <shelikhoo> #endmeeting