15:59:14 <meskio> #startmeeting tor anti-censorship meeting
15:59:14 <MeetBot> Meeting started Thu Jan 27 15:59:14 2022 UTC.  The chair is meskio. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:59:14 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:59:18 <meskio> hello everybody!!
15:59:23 <shelikhoo> Hi~
15:59:32 <meskio> here is our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep
15:59:46 <meskio> please add what you've been working on
15:59:55 <meskio> and any items for the agenda you want to talk about
16:00:16 <cohosh> hi!
16:01:24 <dcf1> meskio: is there anything to say about the obfs4 upstream changes?
16:01:46 <meskio> mmm
16:01:55 <meskio> I haven't done much on that this week
16:02:11 <meskio> debian package is in process, I think, but I haven't check how it is
16:02:18 <dcf1> no problem, just wanted to see if it should be on the agenda
16:02:31 <meskio> and I don't know what is the status for TB to update it
16:02:49 <meskio> I don't have updates, but maybe I should put a bit of my head on this next week
16:03:10 <meskio> dcf1: do you want to start with the load balancing?
16:03:43 <dcf1> at the top of https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40095 you can see the progress so far
16:04:38 <dcf1> The DNS switch 2 days ago was pretty smooth, we just had to debug a problem with a file desciprtor limit
16:05:01 * meskio notices that gitlab seems to be down
16:05:03 <dcf1> I posted updates from monitoring it the past 2 days at https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40095#note_2772325 and the comment following
16:05:28 <dcf1> gitlab is working for me currently
16:05:48 <meskio> weird, is not loading here
16:05:55 <shelikhoo> Yes. Gitlab seems to be down for me.....
16:05:57 <dcf1> the 4 tor instances are combining to use more than 100% of a CPU core, which was the whole reason for doing any of this, so that's working
16:06:42 <meskio> nice, congrats
16:06:44 <dcf1> The 8 CPu cores are pretty much maxed out between snowflake-server, haproxy, the tors, and extor-static-cookie though
16:06:53 <shelikhoo> cohosh lost access to the meeting(message from signal)
16:06:57 <dcf1> At least during the day in Russia, at night it's a little more chill
16:07:31 <dcf1> I have an idea for preventing onion key rotation, which is the last piece of the load balancing setup we haven't solved yet
16:08:09 <dcf1> It's to make the onion key files read-only. arma2 looked at the code and says there's a chance it will work. https://forum.torproject.net/t/tor-relays-how-to-reduce-tor-cpu-load-on-a-single-bridge/1483/16
16:08:52 <dcf1> one side effect is that snowflake metrics are possibly going to be messed up
16:09:07 <gaba> hey, chives is having issues and many people dropped from irc
16:09:11 <gaba> including cohosh
16:09:19 <dcf1> the ExtORPort stuff is working, and the bridge is uploading descriptors with a mix of countries
16:09:24 <dcf1> hmm bummer
16:09:33 <dcf1> I'll type and paste again if I need to
16:09:40 <gaba> there :)
16:09:43 <cecylia> (hi, got disconnected)
16:10:02 <meskio> hello back
16:10:10 <dcf1> e.g. dirreq-v3-ips ru=8304,us=728,cn=272,by=248,de=160,gb=128
16:10:46 <dcf1> I was saying the 8 CPU cores are pretty much maxed out between snowflake-server, haproxy, the tors, and extor-static-cookie during the time of the most usage, which corresponds to daytime in Russia
16:11:10 <dcf1> I need to test an idea to make the onion key files read-only, to prevent onion key rotation https://forum.torproject.net/t/tor-relays-how-to-reduce-tor-cpu-load-on-a-single-bridge/1483/16
16:11:33 <dcf1> https://metrics.torproject.org/rs.html#details/5481936581E23D2D178105D44DB6915AB06BFB7F
16:11:53 <dcf1> a side effect of running multiple tor instances with the same identity key is that metrics might get messed up
16:12:22 <dcf1> the bandwidth graph is going down on 25 and 26 January, but that's definitely not really the case
16:12:47 <meskio> sounds good, a pitty not to have proper metrics, but we could collect them from another place if needed
16:13:09 <dcf1> the bridge is currently about 30MB/s incoming, 40 MB/s outgoing, which is more than we had before
16:13:14 <cecylia> hmm, i would have expected metrics to rise by too much because clients would get double counted, why would it fall?
16:13:38 <shelikhoo> Maybe the data overwrite each other?
16:13:38 <dcf1> cecylia: hypothesis is because it's only counting metrics from 1/4 of the instances
16:13:42 <cecylia> ahh
16:13:45 <cecylia> that makes sense
16:14:05 <dcf1> taking the most recent descriptor with a certain identity fingerprint, instead of adding them together
16:14:22 <dcf1> But I am not sure exactly what's happening
16:14:28 <cecylia> is it worth talking to the network team and trying to get at least a temporary fix in place here?
16:14:33 <dcf1> Maybe it's something the network health team can help with
16:14:39 <anadahz> dcf1: Thank you for all this work!
16:14:42 * cecylia nods
16:14:46 <dcf1> tx anadahz
16:14:48 <shelikhoo> thanks~
16:15:07 <dcf1> probably today or tomorrow I'll take down the production bridge and do the load balancing installation there
16:15:10 <meskio> yep, great work
16:15:15 <dcf1> then maybe next monday try switching the DNS back
16:15:16 <cecylia> nice \o/
16:15:50 <meskio> yeah!!
16:16:00 <dcf1> (and actually there are currently 5 instances running, the 4 load balanced + the 1 production that is running concurrently, so that might also be affecting metrics)
16:16:15 <shelikhoo> Do we have a plan in case the clients get diverging onion keys?
16:16:39 <dcf1> get an alert last night saying the bridge had already used 3 TB of bandwidth, in about 1.5 days
16:16:56 <dcf1> shelikhoo: the plan is to try making the onion keys files read-only to prevent the tor relay from rotating onion keys
16:17:27 <dcf1> an alternative is to apply a patch to the tor relay, but I have less preference for that as a custom-patched tor makes it more likely to miss a security upgrade
16:17:34 <shelikhoo> Yes, but what if it does not work and clients cached incorrect onion key?
16:17:57 <dcf1> shelikhoo: well, I will test it first, and if it does not work, we will not do it.
16:18:02 <shelikhoo> Yes!
16:18:13 <shelikhoo> Sounds great
16:18:20 <dcf1> the other thing you can do is to stop tor for 30 seconds, then erase the LastRotatedOnionKey line, then restart tor and tor will refresh the timestamp
16:18:41 <dcf1> that is effective, but it means you have to shut down the realy for ~1 minute every 3 weeks
16:19:08 <dcf1> which is not too bad on its own, the real risk is automating / remembering to do it
16:19:27 <meskio> cron job?
16:19:29 <shelikhoo> Let's hope read only trick works
16:19:43 <dcf1> yeah, I hope it works. but we can cope if it does not.
16:19:43 <shelikhoo> otherwise I can create something like systemd timer
16:20:25 <shelikhoo> which is similar to cron
16:21:17 <meskio> :)
16:21:55 <meskio> should we move on to the next topic?
16:22:23 <dcf1> ok
16:22:31 <meskio> just an announcement
16:22:44 <meskio> the last few months I've being working on replacing bridgedb backend by rdsys
16:23:00 <meskio> I plan to set up next week a test bridgedb with it
16:23:38 <meskio> only @torproject.org email addresses and people.tpo for the website will be able to reach it
16:23:40 <gaba> \o/
16:23:48 <anadahz> yeah!
16:24:00 <meskio> I'll send an email to our team list with information if others wants to try it and see if they find problems
16:24:06 <meskio> it will be distributing real bridges
16:24:15 <cohosh> :D
16:24:17 <meskio> that is why the @tpo limitation
16:24:34 <meskio> I hope to push it into production soonish in februrary, but lets see
16:24:54 <meskio> I guess there is not much to discuss about it, just to notice that I will poke for test
16:25:15 <meskio> the last point on the agenda is hetzner networking issue
16:25:16 <shelikhoo> Is gettor included in this deployment?
16:25:27 <meskio> shelikhoo: not yet
16:25:39 <meskio> but gettor will be next once rdsys is in production, thanks to your work
16:25:51 <meskio> I'll talk with you when everything is ready to start deploying gettor
16:25:58 <shelikhoo> Okay, looking for testing it in real environments....
16:26:10 <shelikhoo> no problem! let me know when this happens
16:26:19 <meskio> sure :)
16:27:49 <shelikhoo> Yes, we could move on to hetzner networking issue....
16:27:50 <meskio> about hetzner, I have no idea what in our infra is there, it seems to be back now
16:28:09 <shelikhoo> Yes, it is quite annoying
16:28:19 <shelikhoo> but not that bad
16:28:26 <cohosh> i think it's worth adding to the metrics timeline, there are some bridges there
16:28:38 <cohosh> also polyanthum (bridgedb)
16:29:06 <meskio> bridges.tpo seems to be reachable now
16:29:12 <cohosh> the outage wasn't long
16:29:22 <cohosh> but there were some routing issues the other day
16:29:35 <cohosh> so this could be a longer term event that involves intermittent disruptions
16:30:06 <anadahz> Has this has affected also Snowflake users?
16:30:15 <cohosh> i don't think it would
16:30:32 <cohosh> the bridge isn't hosted on hetzner (either the production or staging server)
16:30:33 * anadahz goes to read #tor-dev backlog
16:32:46 <meskio> I hope the problem doesn't continue
16:32:58 <meskio> anything more in this topic? something else to talk about?
16:33:03 <cohosh> you can see the outages on our anti-censorship alerts mailing list
16:33:25 <meskio> yep, there has been a bunch of them lattely
16:35:56 <meskio> if we don't have anything more we can finish this meeting
16:36:07 <meskio> a reminder, next week we have reading group
16:36:49 <dcf1> one note, I'm going to try to run some profiling on the staging bridge before we switch it back
16:37:02 <dcf1> in order to help with https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/40086
16:37:27 <cohosh> dcf1: awesome! thanks for doing that!
16:38:46 <meskio> I'll wait a minute to close the meeting if someone has something more
16:39:59 <meskio> #endmeeting