15:59:40 <cohosh> #startmeeting tor anti-censorship meeting
15:59:40 <MeetBot> Meeting started Thu Jan  6 15:59:40 2022 UTC.  The chair is cohosh. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:59:40 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:59:46 <cohosh> hey everyone
15:59:52 <shelikhoo> Hi~
15:59:56 <cohosh> happy new year :)
16:00:07 <shelikhoo> Happy New Year~
16:00:10 <cohosh> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep
16:01:43 <ggus> hello o/
16:01:53 <cohosh> looks like there is a lot on the agenda for today
16:02:36 <cohosh> dcf1: you have already written a bunch there, do you want to lead the discussion?
16:02:59 <dcf1> I have a lot to say about tor scaling on the snowflake bridge, but it's all written in the pad, and we can discuss in email more
16:03:15 <dcf1> I'll give everyone a sec to read what's in the pad while I write a few sentences here
16:04:00 <dcf1> I think the snowflake bridge has reached its capacity (you can feel it being slower the last week or so), and I believe the specific bottleneck is the tor process
16:04:05 <dcf1> https://lists.torproject.org/pipermail/tor-relays/2021-December/020156.html
16:04:53 <dcf1> with arma's help I worked on a way to run multiple instances of tor with the same fingerprint, which permits tor to scale, and I'm currently running such a bridge (with obfs4proxy)
16:05:16 <cohosh> wow, nice work investigating all of this
16:05:45 <dcf1> we could probably do something similar with the snowflake bridge, which might get us past the most immediate scaling concerns, but it's a non-trivial configuration change
16:06:23 <cohosh> i have a followup question on whether we should spend our time making a single bridge scale better vs. switching our efforts towards figuring out how to run more than one snowflake bridge
16:06:53 <dcf1> My reason for thinking tor is the bottleneck is that the tor process is constantly at 100% CPU, while the snowflake-server process ranges from say 300% to 500%. I think that adding more CPUs will not help, because tor will still only use 1 of them.
16:07:17 <dcf1> There is a ticket about multiple bridges: https://bugs.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/28651
16:08:01 <dcf1> One problem, as I understand it, is that the proposed ideas require some changes in the way tor handles bridge lines, so it would not be immediately deployable
16:08:13 <anadahz> o/
16:08:31 <cohosh> dcf1: that's the proposed ideas for multiple bridges you mean?
16:08:43 <dcf1> another aspect is that multiple instances of tor do not necessarily need to run on the same host; we could have snowflake-server on one IP address load-balancing to various instances of tor (having the same fingerprint) on other IP addresses
16:09:23 <dcf1> the main difficulty with having multiple bridges is that the tor client's bridge line contains a fingerprint, and will refuse to connect if the bridge's fingerprint is not as expected
16:10:32 <dcf1> so some ideas are: 1) let the tor client recognize any of a small set of bridge fingerprints (not currently supported), 2 use multiple bridge lines with different fingerprints (not ideal because tor tries to use all at once, not one at a time), 3 remove the fingerprint from the bridge line (but then the first hop is unauthenticated)
16:11:08 <dcf1> I don't mean to shut the door on that, though, it's possible there is some other solution we've not thought of
16:11:56 <dcf1> So the hack that Roger and I were discussing is to actually run multiple bridges (even on the same host), but give them all the same fingerprint, so that the bridge lines of snowflake clients will continue to work
16:14:13 <shelikhoo> I think this load balancing change does not require any client side modification, and will give us more time to find a more permanent solution to the scaling issue.
16:14:40 <cohosh> so you can have the tor instances on multiple hosts all with the same fingerprint?
16:14:58 <dcf1> oh another idea is 4) have the client or the proxy inform the broker or some kind of meta-bridge of what bridge fingerprint it wants to use, so that the meta-bridge can route their traffic to the correct instance of tor. This would still need a way of us shipping multiple fingerprints and having the client choose one randomly.
16:15:25 <dcf1> cohosh: yes, I have tested them running on the same host, but I believe it would work even if they are on different hosts.
16:15:52 <cohosh> okay cool, that sounds like a good short term solution
16:16:02 <dcf1> However I do not see a need to run the instances of tor on different hosts at this point; the point of doing this is to scale beyond 1 CPU, so we could run like 4 instances on the same host
16:16:12 <dcf1> The steps of expansion I see are:
16:16:37 <dcf1> 1 snowflake-server + 1 tor on the same host (what we have now)
16:16:51 <dcf1> -> 1 snowflake-server + N tor on the same host
16:17:03 <dcf1> -> 1 snowflake-server on one host + N tor on a different host
16:17:18 <dcf1> -> M snowflake-server on different hosts + N tor on different hosts
16:17:37 * cohosh nods
16:18:07 <dcf1> The last step, if we ever need to go that far, is a tricky one because it will require synchronizing KCP state etc. across snowflake-servers, but I don't think we're at that point yet
16:18:51 <cohosh> just to clarify, the last step you have in mind also involves the same fingerprint/bridge line for each snowflake-server instance?
16:19:10 <cohosh> or is it the snowflake#28651 step?
16:19:12 <dcf1> Hmm, I hadn't thought about that, I guess it could work either way
16:19:43 <dcf1> Well I think all these changes fall under the scope of #28651
16:19:51 <cohosh> this is getting perhaps too far ahead for today but there's a meta question in here as to whether we want to control the load balancing, or have the client decide which bridge/instances to use
16:20:03 <cohosh> dcf1: fair enough, that's a good place to discuss these things
16:20:29 <dcf1> So what I propose is
16:21:09 <dcf1> First I want to try the experiment that arma suggested that I don't actually understand yet https://lists.torproject.org/pipermail/tor-relays/2022-January/020196.html
16:21:55 <dcf1> Then we choose a day to deveote a few hours with someone else on the team to setting up a staging bridge where we can do this configuration and document the steps
16:22:33 <dcf1> and perhaps set up an alternate broker to try it ourselves
16:23:05 <dcf1> I still feel the idea is too experimental (and non-atomic) to just deploy it on the production bridge
16:23:19 <cohosh> yeah, especially since we have so many users
16:23:57 <cohosh> cool, i'm down to help out pretty much any day
16:24:43 <shelikhoo> Currently, I already have a private snowflake broker for Russia's blocking testing. I think I can try to expend it with this multiple tor setup to see how it works out.
16:25:45 <dcf1> let's pencil in Tuesday 2022-01-11 and I'll follow up. I'll try to have a freshly installed host to start with by then.
16:26:04 <cohosh> nice
16:26:08 <shelikhoo> Yes
16:26:13 <dcf1> shelikhoo: that's great, I tried to put complete isntructions at https://lists.torproject.org/pipermail/tor-relays/2022-January/020183.html
16:26:56 <shelikhoo> Thanks! I will share if I made any progress on this.
16:27:39 <shelikhoo> I will try to create a systemd based deployment, will this work on the production environment?
16:27:49 <dcf1> A couple amendments I would make (we can do these when we make proper documentation): remove "PublishServerDescriptor 0" and replace it with "BridgeDistribution none", and change the transport name from extor_static_cookie to snowflake in the ServerTransportPlugin line (doesn't really matter, just looks weird in metrics relay search)
16:28:20 <dcf1> shelikhoo: yes, systemd will work on our production, and that would be helpful because I don't know how to write systemd units
16:28:39 <shelikhoo> def1: Yes!
16:29:00 <cohosh> :D
16:29:15 <shelikhoo> dcf1: Yes!
16:29:28 <dcf1> Also the mailing list post thinks some of the commands are email addresses; "tor at o1" should be "tor@o1"
16:30:10 <dcf1> There's also a copy of the thread at https://forum.torproject.net/t/tor-relays-how-to-reduce-tor-cpu-load-on-a-single-bridge/1483 (unfortunately breaks the ascii diagrams)
16:31:06 <cohosh> really amazing work dcf1
16:31:17 <dcf1> I guess now that I know tor-relays gets mirrored to the forum I should write my emails there in markdown
16:31:31 <ggus> dcf1: i can edit and fix the diagrams :)
16:31:43 <dcf1> thanks ggus
16:31:43 <cohosh> yeah i think there is a push to move to the forum
16:32:01 <ggus> or, if you create an account on the forum, you can edit your own emails.
16:32:12 <dcf1> ok, good to know
16:32:51 <cohosh> should we move on to prioritizing other near-term snowflake tasks?
16:33:05 <dcf1> sure
16:33:28 <dcf1> I listed 2 ultra-easy tasks, I think they are both one-line changes
16:33:32 <cohosh> yeah
16:34:11 <dcf1> If someone has not had the experience of redeploying snowflake-server yet, it could be a good way to practice
16:34:42 <cohosh> :)
16:34:55 <cohosh> shelikhoo: i think we need to make an account for you but i can do that today
16:35:37 <shelikhoo> Yes, I will write the pull request for that first.
16:36:01 <shelikhoo> And when the account arrive, it should be ready to go.
16:36:12 <cohosh> cool, thanks!
16:37:00 <cohosh> for client side blocking resistance you listed snowflake#40054
16:37:30 <dcf1> Yes, I think this is one we may want to get out ahead of, as it's a likely next blocking feature
16:37:39 <cohosh> there's also still an outstanding dtls issue: https://github.com/pion/dtls/issues/408
16:38:00 <dcf1> maxbee is working on the uTLS integration.
16:38:17 <cohosh> i started working on the dtls fix and will probably finish that up in the next few days
16:38:22 <dcf1> I lately have a branch for uTLS in dnstt which I'll link on the ticket, it's slightly refined relative to the meek code
16:38:36 <cohosh> yeah, we might not want to assume they can finish it since i know they have been busy lately
16:40:38 <cohosh> finish it soon rather
16:41:03 <dcf1> ok
16:41:12 <cohosh> dcf1: do you have any insight as to whether the snowflake outage last week was due to load issues?
16:41:29 <dcf1> no, I was wondering about it too
16:41:33 <cohosh> hm okay
16:41:58 <cohosh> i'm sad to report i didn't do much investigation before just restarting everything
16:42:04 <cohosh> so i guess we'll just have to keep an eye out
16:42:09 <dcf1> no worries, that was the right call
16:43:57 <cohosh> there's also the existing probetest outage
16:44:11 <cohosh> how do we feel about deploying some profiling code?
16:44:18 <cohosh> we've done it in the past iirc
16:44:41 <dcf1> that's fine with me, especially if we are restarting it anyway
16:44:51 <cohosh> cool, i might just do that this week
16:45:18 <dcf1> I saw some reports online (may have been the same person) saying that their snowflake browser addon was not getting any traffic
16:45:41 <dcf1> do you think that a probetest outage + specific NAT configuration could be the cause? or is that not possible
16:46:09 <cohosh> that would surprise me since a failed probetest should still make the proxy poll as though it had a restrited NAT
16:46:23 <cohosh> and i think those are still being used
16:46:41 <dcf1> okay, yeah
16:46:47 <shelikhoo> Do we have the information about the root cause of these outage? It is possible for us to load balance the traffic to the probetest with fallover so that we can comfortably examine the failed state of it?
16:47:20 <dcf1> no, we do not no the cause at this point. a symptom is that the process starts using all of 1 CPU
16:47:41 <shelikhoo> The service can remain online to users as long as there is enough of instance of probetest running
16:47:48 <dcf1> as it is doing currently: %CPU 100.3 %MEM 4.6 TIME+ 10442:39 COMMAND probetest
16:48:02 <cohosh> i think it's the probetest locking up somewhere
16:48:12 <cohosh> but not sure
16:48:27 <dcf1> I think that a non-running probetest is effectively equivalent to a locked-up probetest, so there's no harm in experimenting while it's in a failed state, I think
16:48:27 <cohosh> because when it fails it appears to fail for everyone at once
16:49:21 <shelikhoo> at least we should restart it with SIGQUIT, so that it generate a stack dump
16:49:29 <shelikhoo> https://pkg.go.dev/os/signal
16:49:50 <dcf1> good idea
16:50:18 <cohosh> +1
16:50:40 <dcf1> you can also attach gdb to the live process with "gdb -p" and get a stacktrace of all threads with "thread apply all bt" or similar https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html
16:50:53 <dcf1> then detach and the program still runs
16:53:42 <cohosh> nice
16:53:49 <cohosh> shelikhoo: i think the last item on the agenda is yours?
16:54:56 <shelikhoo> Yes, there is some report that turkmenistan have blocks the fronting domain we are currently running
16:55:03 <shelikhoo> we could consider replace it
16:55:34 <shelikhoo> experimenting while it's in a failed state result in longer down time for user unless there is failover with load balancer
16:56:21 <shelikhoo> so without a load balancer, it might be better to just restart it quickly and get a stack dump in the process
16:56:38 <dcf1> yes, that sounds fine
16:57:38 <cohosh> yeah, also  now that we can provide the front domain as SOCKS options, it can be configured in the bridge line
16:58:02 <shelikhoo> i have investigated and find some fronting domains we might able to use
16:58:13 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40068#note_2767823
16:58:15 <cohosh> nice!
16:58:59 <cohosh> i suppose the main risk is that a new one we choose happens to be blocked in a place where the existing one isn't
17:01:05 <cohosh> shelikhoo: perhaps a good way to start is to provide ggus with a bridge line that has a different domain that can be shared with users in TM
17:01:12 <anadahz> Is there a way to have a pool of fronting domains that snowflake config may use after x times a fronitng domain is unreachable?
17:02:55 <cohosh> anadahz: that is not currently implemented but could potentially be
17:03:30 <shelikhoo> cohosh: Yes, I will work on this task.
17:03:50 <cohosh> if we relied on that feature the downside is it would add to the already large startup latency of snowflake
17:04:35 <dcf1> P.S., also, AMP cache rendezvous currently works in Turkmenistan. We just don't have a good way in Tor browser to configure it.
17:05:21 <cohosh> dcf1: by good way do you mean that users need to add the AMP CACHE bridge line manually?
17:05:36 <cohosh> wow too much caps lol
17:05:50 <dcf1> yes, exactly. I posted instructions (https://ntc.party/t/dpi/1518/3) but did not get a response from the user in TM
17:05:55 * cohosh nods
17:06:22 <dcf1> I thought that ValdikSS had tested it with an actual host in TM too, though I cannot find the post. I could be wrong about that.
17:06:33 <cohosh> there is a pending tor browser feature to have slightly better country specific defaults, but until then it's an everybody or nobody change
17:07:14 <dcf1> (Or, we could do like we used to have with meek-google, meek-amazon, meek-azure, ... i.e. expose the possible configurations as if they were different pluggable transports)
17:07:48 <dcf1> E.g. snowflake-domainfront, snowflake-ampcache. It's kind of a user-hostile way of doing it thought.
17:07:51 <dcf1> *though
17:08:54 <cohosh> what happens if we set up torrc to use both at once since they have the same fingerprint?
17:09:07 <cohosh> because with obfs4 bridges we just configure all default bridges simultaneously
17:09:43 <dcf1> I think it tries to connect to all of them simultaneously, but then only uses one of them (not sure if it keeps the connection idle on the others)
17:10:21 <dcf1> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/28651#note_2592569
17:10:38 <dcf1> "So here, the snowflake-client would simultaneously send out three registration messages (over domain fronting or something else)."
17:10:39 <shelikhoo> if we can connecting to more than one simultaneously, we won't have issue running more than one snowflake
17:10:46 <cohosh> yeah that's true for obfs4 for different fingerprints
17:10:59 <shelikhoo> if we can accept connecting to more than one simultaneously, we won't have issue running more than one snowflak
17:11:05 <dcf1> oh, I hadn't caught the subtlety about different fingerprints. with the same FP, I don't know.
17:11:16 <cohosh> ok, maybe worth checking
17:12:23 <cohosh> alright, we're a bit over the hour mark, anything else for today?
17:12:34 <shelikhoo> none from me
17:12:42 <dcf1> I'm finished
17:13:10 <cohosh> cool, thanks everyone! and also wow on how much usage snowflake is getting
17:13:32 <anadahz> ty all!
17:14:12 <cohosh> #endmeeting