15:59:40 <cohosh> #startmeeting tor anti-censorship meeting 15:59:40 <MeetBot> Meeting started Thu Jan 6 15:59:40 2022 UTC. The chair is cohosh. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:59:40 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 15:59:46 <cohosh> hey everyone 15:59:52 <shelikhoo> Hi~ 15:59:56 <cohosh> happy new year :) 16:00:07 <shelikhoo> Happy New Year~ 16:00:10 <cohosh> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep 16:01:43 <ggus> hello o/ 16:01:53 <cohosh> looks like there is a lot on the agenda for today 16:02:36 <cohosh> dcf1: you have already written a bunch there, do you want to lead the discussion? 16:02:59 <dcf1> I have a lot to say about tor scaling on the snowflake bridge, but it's all written in the pad, and we can discuss in email more 16:03:15 <dcf1> I'll give everyone a sec to read what's in the pad while I write a few sentences here 16:04:00 <dcf1> I think the snowflake bridge has reached its capacity (you can feel it being slower the last week or so), and I believe the specific bottleneck is the tor process 16:04:05 <dcf1> https://lists.torproject.org/pipermail/tor-relays/2021-December/020156.html 16:04:53 <dcf1> with arma's help I worked on a way to run multiple instances of tor with the same fingerprint, which permits tor to scale, and I'm currently running such a bridge (with obfs4proxy) 16:05:16 <cohosh> wow, nice work investigating all of this 16:05:45 <dcf1> we could probably do something similar with the snowflake bridge, which might get us past the most immediate scaling concerns, but it's a non-trivial configuration change 16:06:23 <cohosh> i have a followup question on whether we should spend our time making a single bridge scale better vs. switching our efforts towards figuring out how to run more than one snowflake bridge 16:06:53 <dcf1> My reason for thinking tor is the bottleneck is that the tor process is constantly at 100% CPU, while the snowflake-server process ranges from say 300% to 500%. I think that adding more CPUs will not help, because tor will still only use 1 of them. 16:07:17 <dcf1> There is a ticket about multiple bridges: https://bugs.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/28651 16:08:01 <dcf1> One problem, as I understand it, is that the proposed ideas require some changes in the way tor handles bridge lines, so it would not be immediately deployable 16:08:13 <anadahz> o/ 16:08:31 <cohosh> dcf1: that's the proposed ideas for multiple bridges you mean? 16:08:43 <dcf1> another aspect is that multiple instances of tor do not necessarily need to run on the same host; we could have snowflake-server on one IP address load-balancing to various instances of tor (having the same fingerprint) on other IP addresses 16:09:23 <dcf1> the main difficulty with having multiple bridges is that the tor client's bridge line contains a fingerprint, and will refuse to connect if the bridge's fingerprint is not as expected 16:10:32 <dcf1> so some ideas are: 1) let the tor client recognize any of a small set of bridge fingerprints (not currently supported), 2 use multiple bridge lines with different fingerprints (not ideal because tor tries to use all at once, not one at a time), 3 remove the fingerprint from the bridge line (but then the first hop is unauthenticated) 16:11:08 <dcf1> I don't mean to shut the door on that, though, it's possible there is some other solution we've not thought of 16:11:56 <dcf1> So the hack that Roger and I were discussing is to actually run multiple bridges (even on the same host), but give them all the same fingerprint, so that the bridge lines of snowflake clients will continue to work 16:14:13 <shelikhoo> I think this load balancing change does not require any client side modification, and will give us more time to find a more permanent solution to the scaling issue. 16:14:40 <cohosh> so you can have the tor instances on multiple hosts all with the same fingerprint? 16:14:58 <dcf1> oh another idea is 4) have the client or the proxy inform the broker or some kind of meta-bridge of what bridge fingerprint it wants to use, so that the meta-bridge can route their traffic to the correct instance of tor. This would still need a way of us shipping multiple fingerprints and having the client choose one randomly. 16:15:25 <dcf1> cohosh: yes, I have tested them running on the same host, but I believe it would work even if they are on different hosts. 16:15:52 <cohosh> okay cool, that sounds like a good short term solution 16:16:02 <dcf1> However I do not see a need to run the instances of tor on different hosts at this point; the point of doing this is to scale beyond 1 CPU, so we could run like 4 instances on the same host 16:16:12 <dcf1> The steps of expansion I see are: 16:16:37 <dcf1> 1 snowflake-server + 1 tor on the same host (what we have now) 16:16:51 <dcf1> -> 1 snowflake-server + N tor on the same host 16:17:03 <dcf1> -> 1 snowflake-server on one host + N tor on a different host 16:17:18 <dcf1> -> M snowflake-server on different hosts + N tor on different hosts 16:17:37 * cohosh nods 16:18:07 <dcf1> The last step, if we ever need to go that far, is a tricky one because it will require synchronizing KCP state etc. across snowflake-servers, but I don't think we're at that point yet 16:18:51 <cohosh> just to clarify, the last step you have in mind also involves the same fingerprint/bridge line for each snowflake-server instance? 16:19:10 <cohosh> or is it the snowflake#28651 step? 16:19:12 <dcf1> Hmm, I hadn't thought about that, I guess it could work either way 16:19:43 <dcf1> Well I think all these changes fall under the scope of #28651 16:19:51 <cohosh> this is getting perhaps too far ahead for today but there's a meta question in here as to whether we want to control the load balancing, or have the client decide which bridge/instances to use 16:20:03 <cohosh> dcf1: fair enough, that's a good place to discuss these things 16:20:29 <dcf1> So what I propose is 16:21:09 <dcf1> First I want to try the experiment that arma suggested that I don't actually understand yet https://lists.torproject.org/pipermail/tor-relays/2022-January/020196.html 16:21:55 <dcf1> Then we choose a day to deveote a few hours with someone else on the team to setting up a staging bridge where we can do this configuration and document the steps 16:22:33 <dcf1> and perhaps set up an alternate broker to try it ourselves 16:23:05 <dcf1> I still feel the idea is too experimental (and non-atomic) to just deploy it on the production bridge 16:23:19 <cohosh> yeah, especially since we have so many users 16:23:57 <cohosh> cool, i'm down to help out pretty much any day 16:24:43 <shelikhoo> Currently, I already have a private snowflake broker for Russia's blocking testing. I think I can try to expend it with this multiple tor setup to see how it works out. 16:25:45 <dcf1> let's pencil in Tuesday 2022-01-11 and I'll follow up. I'll try to have a freshly installed host to start with by then. 16:26:04 <cohosh> nice 16:26:08 <shelikhoo> Yes 16:26:13 <dcf1> shelikhoo: that's great, I tried to put complete isntructions at https://lists.torproject.org/pipermail/tor-relays/2022-January/020183.html 16:26:56 <shelikhoo> Thanks! I will share if I made any progress on this. 16:27:39 <shelikhoo> I will try to create a systemd based deployment, will this work on the production environment? 16:27:49 <dcf1> A couple amendments I would make (we can do these when we make proper documentation): remove "PublishServerDescriptor 0" and replace it with "BridgeDistribution none", and change the transport name from extor_static_cookie to snowflake in the ServerTransportPlugin line (doesn't really matter, just looks weird in metrics relay search) 16:28:20 <dcf1> shelikhoo: yes, systemd will work on our production, and that would be helpful because I don't know how to write systemd units 16:28:39 <shelikhoo> def1: Yes! 16:29:00 <cohosh> :D 16:29:15 <shelikhoo> dcf1: Yes! 16:29:28 <dcf1> Also the mailing list post thinks some of the commands are email addresses; "tor at o1" should be "tor@o1" 16:30:10 <dcf1> There's also a copy of the thread at https://forum.torproject.net/t/tor-relays-how-to-reduce-tor-cpu-load-on-a-single-bridge/1483 (unfortunately breaks the ascii diagrams) 16:31:06 <cohosh> really amazing work dcf1 16:31:17 <dcf1> I guess now that I know tor-relays gets mirrored to the forum I should write my emails there in markdown 16:31:31 <ggus> dcf1: i can edit and fix the diagrams :) 16:31:43 <dcf1> thanks ggus 16:31:43 <cohosh> yeah i think there is a push to move to the forum 16:32:01 <ggus> or, if you create an account on the forum, you can edit your own emails. 16:32:12 <dcf1> ok, good to know 16:32:51 <cohosh> should we move on to prioritizing other near-term snowflake tasks? 16:33:05 <dcf1> sure 16:33:28 <dcf1> I listed 2 ultra-easy tasks, I think they are both one-line changes 16:33:32 <cohosh> yeah 16:34:11 <dcf1> If someone has not had the experience of redeploying snowflake-server yet, it could be a good way to practice 16:34:42 <cohosh> :) 16:34:55 <cohosh> shelikhoo: i think we need to make an account for you but i can do that today 16:35:37 <shelikhoo> Yes, I will write the pull request for that first. 16:36:01 <shelikhoo> And when the account arrive, it should be ready to go. 16:36:12 <cohosh> cool, thanks! 16:37:00 <cohosh> for client side blocking resistance you listed snowflake#40054 16:37:30 <dcf1> Yes, I think this is one we may want to get out ahead of, as it's a likely next blocking feature 16:37:39 <cohosh> there's also still an outstanding dtls issue: https://github.com/pion/dtls/issues/408 16:38:00 <dcf1> maxbee is working on the uTLS integration. 16:38:17 <cohosh> i started working on the dtls fix and will probably finish that up in the next few days 16:38:22 <dcf1> I lately have a branch for uTLS in dnstt which I'll link on the ticket, it's slightly refined relative to the meek code 16:38:36 <cohosh> yeah, we might not want to assume they can finish it since i know they have been busy lately 16:40:38 <cohosh> finish it soon rather 16:41:03 <dcf1> ok 16:41:12 <cohosh> dcf1: do you have any insight as to whether the snowflake outage last week was due to load issues? 16:41:29 <dcf1> no, I was wondering about it too 16:41:33 <cohosh> hm okay 16:41:58 <cohosh> i'm sad to report i didn't do much investigation before just restarting everything 16:42:04 <cohosh> so i guess we'll just have to keep an eye out 16:42:09 <dcf1> no worries, that was the right call 16:43:57 <cohosh> there's also the existing probetest outage 16:44:11 <cohosh> how do we feel about deploying some profiling code? 16:44:18 <cohosh> we've done it in the past iirc 16:44:41 <dcf1> that's fine with me, especially if we are restarting it anyway 16:44:51 <cohosh> cool, i might just do that this week 16:45:18 <dcf1> I saw some reports online (may have been the same person) saying that their snowflake browser addon was not getting any traffic 16:45:41 <dcf1> do you think that a probetest outage + specific NAT configuration could be the cause? or is that not possible 16:46:09 <cohosh> that would surprise me since a failed probetest should still make the proxy poll as though it had a restrited NAT 16:46:23 <cohosh> and i think those are still being used 16:46:41 <dcf1> okay, yeah 16:46:47 <shelikhoo> Do we have the information about the root cause of these outage? It is possible for us to load balance the traffic to the probetest with fallover so that we can comfortably examine the failed state of it? 16:47:20 <dcf1> no, we do not no the cause at this point. a symptom is that the process starts using all of 1 CPU 16:47:41 <shelikhoo> The service can remain online to users as long as there is enough of instance of probetest running 16:47:48 <dcf1> as it is doing currently: %CPU 100.3 %MEM 4.6 TIME+ 10442:39 COMMAND probetest 16:48:02 <cohosh> i think it's the probetest locking up somewhere 16:48:12 <cohosh> but not sure 16:48:27 <dcf1> I think that a non-running probetest is effectively equivalent to a locked-up probetest, so there's no harm in experimenting while it's in a failed state, I think 16:48:27 <cohosh> because when it fails it appears to fail for everyone at once 16:49:21 <shelikhoo> at least we should restart it with SIGQUIT, so that it generate a stack dump 16:49:29 <shelikhoo> https://pkg.go.dev/os/signal 16:49:50 <dcf1> good idea 16:50:18 <cohosh> +1 16:50:40 <dcf1> you can also attach gdb to the live process with "gdb -p" and get a stacktrace of all threads with "thread apply all bt" or similar https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html 16:50:53 <dcf1> then detach and the program still runs 16:53:42 <cohosh> nice 16:53:49 <cohosh> shelikhoo: i think the last item on the agenda is yours? 16:54:56 <shelikhoo> Yes, there is some report that turkmenistan have blocks the fronting domain we are currently running 16:55:03 <shelikhoo> we could consider replace it 16:55:34 <shelikhoo> experimenting while it's in a failed state result in longer down time for user unless there is failover with load balancer 16:56:21 <shelikhoo> so without a load balancer, it might be better to just restart it quickly and get a stack dump in the process 16:56:38 <dcf1> yes, that sounds fine 16:57:38 <cohosh> yeah, also now that we can provide the front domain as SOCKS options, it can be configured in the bridge line 16:58:02 <shelikhoo> i have investigated and find some fronting domains we might able to use 16:58:13 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40068#note_2767823 16:58:15 <cohosh> nice! 16:58:59 <cohosh> i suppose the main risk is that a new one we choose happens to be blocked in a place where the existing one isn't 17:01:05 <cohosh> shelikhoo: perhaps a good way to start is to provide ggus with a bridge line that has a different domain that can be shared with users in TM 17:01:12 <anadahz> Is there a way to have a pool of fronting domains that snowflake config may use after x times a fronitng domain is unreachable? 17:02:55 <cohosh> anadahz: that is not currently implemented but could potentially be 17:03:30 <shelikhoo> cohosh: Yes, I will work on this task. 17:03:50 <cohosh> if we relied on that feature the downside is it would add to the already large startup latency of snowflake 17:04:35 <dcf1> P.S., also, AMP cache rendezvous currently works in Turkmenistan. We just don't have a good way in Tor browser to configure it. 17:05:21 <cohosh> dcf1: by good way do you mean that users need to add the AMP CACHE bridge line manually? 17:05:36 <cohosh> wow too much caps lol 17:05:50 <dcf1> yes, exactly. I posted instructions (https://ntc.party/t/dpi/1518/3) but did not get a response from the user in TM 17:05:55 * cohosh nods 17:06:22 <dcf1> I thought that ValdikSS had tested it with an actual host in TM too, though I cannot find the post. I could be wrong about that. 17:06:33 <cohosh> there is a pending tor browser feature to have slightly better country specific defaults, but until then it's an everybody or nobody change 17:07:14 <dcf1> (Or, we could do like we used to have with meek-google, meek-amazon, meek-azure, ... i.e. expose the possible configurations as if they were different pluggable transports) 17:07:48 <dcf1> E.g. snowflake-domainfront, snowflake-ampcache. It's kind of a user-hostile way of doing it thought. 17:07:51 <dcf1> *though 17:08:54 <cohosh> what happens if we set up torrc to use both at once since they have the same fingerprint? 17:09:07 <cohosh> because with obfs4 bridges we just configure all default bridges simultaneously 17:09:43 <dcf1> I think it tries to connect to all of them simultaneously, but then only uses one of them (not sure if it keeps the connection idle on the others) 17:10:21 <dcf1> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/28651#note_2592569 17:10:38 <dcf1> "So here, the snowflake-client would simultaneously send out three registration messages (over domain fronting or something else)." 17:10:39 <shelikhoo> if we can connecting to more than one simultaneously, we won't have issue running more than one snowflake 17:10:46 <cohosh> yeah that's true for obfs4 for different fingerprints 17:10:59 <shelikhoo> if we can accept connecting to more than one simultaneously, we won't have issue running more than one snowflak 17:11:05 <dcf1> oh, I hadn't caught the subtlety about different fingerprints. with the same FP, I don't know. 17:11:16 <cohosh> ok, maybe worth checking 17:12:23 <cohosh> alright, we're a bit over the hour mark, anything else for today? 17:12:34 <shelikhoo> none from me 17:12:42 <dcf1> I'm finished 17:13:10 <cohosh> cool, thanks everyone! and also wow on how much usage snowflake is getting 17:13:32 <anadahz> ty all! 17:14:12 <cohosh> #endmeeting