16:00:31 <GeKo> #startmeeting network-health 08/09/2021
16:00:31 <MeetBot> Meeting started Mon Aug  9 16:00:31 2021 UTC.  The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:31 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:00:35 <GeKo> okay
16:00:40 <hiro> o/
16:00:46 <GeKo> let's do some network-health sync again!
16:00:48 <GeKo> hiro: hi!
16:00:54 <irl[m]> hi
16:01:00 <hiro> hello
16:01:03 <GeKo> we still have http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep
16:01:26 <GeKo> please add your updates in case you have any and did not do so yet
16:02:04 <GeKo> mark the items you want to talk about in bold
16:02:22 <GeKo> or we can add separate discussion items if needed
16:02:48 <GeKo> gaba: ggus: feel free to join in case you are around
16:04:45 <ggus> hi
16:05:00 <GeKo> i guess while other folks enter there items i can start a bit
16:05:05 <GeKo> ggus: o/
16:05:14 <GeKo> i've been on vacation for the past weeks
16:05:29 <GeKo> so this week i'll try to catch up on All The Things
16:05:50 <irl[m]> my notes are a bit rough because i completely forgot these were even a thing, i'll add it back to my calendar for next week
16:06:08 <GeKo> we'll see how much time this permits for "regular" work :)
16:06:11 <irl[m]> i think there is still an issue to discuss on the bridgestrap metrics
16:06:32 <GeKo> do we need some anti-censorship folks for that part, too?
16:06:39 <irl[m]> bridgestrap exports metrics in the correct format but when you look at them they are often a bit meaningless
16:07:00 <irl[m]> i think the issue is that the correct anti-censorship people are away
16:07:09 <GeKo> hah, i see
16:07:17 <irl[m]> so i've got the collector implementation but i don't really want to deploy it to just have it archiving nonsense
16:07:38 <irl[m]> but, my contract will end on aug 31st, so there's a hard time limit on finishing this
16:08:22 <irl[m]> i also don't want this to hold up adding overload metrics to onionoo, but having both this and overload metrics in onionoo would make a really cool story for a blog post
16:08:45 <GeKo> irl[m]: what help do you need for the bridgestrap issue?
16:09:15 <hiro> maybe if we can come up with requirements for the onionoo VMs we can speed up rolling up the overload metrics
16:09:57 <irl[m]> https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/20 and https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/22
16:10:30 <irl[m]> these issues need to be fixed, otherwise we're just feeding in bad data which will end up confusing bridge operators more than helping them
16:11:10 <GeKo> okay
16:11:28 <GeKo> meskio is afk i think, so hrm
16:11:47 <irl[m]> hiro7[m]: i've not actually looked at the logs to see how long the runs currently take, and i've not looked at what the impact is of adding the additional descriptors, so first we should do those checks and see if it's a real problem or just one i've imagined or remembered from an old hypervisor host or something
16:12:12 <irl[m]> i could imagine that things have changed in the time i was gone
16:12:16 <gaba> hi!
16:12:53 <GeKo> irl[m]: okay, i'll think a bit about it
16:13:03 <GeKo> i am not sure if cohosh is around this week
16:13:11 <GeKo> and whether that would be enough to unblock you
16:13:36 <gaba> cohosh can you look at it soon too
16:13:41 <gaba> otherwise meskio is back next week
16:13:53 <irl[m]> when i spoke to cohosh she filed the issue and assigned it to meskio, i don't know if it's something she can look at or if it's requiring some as yet undocumented understanding
16:14:05 <GeKo> yeah
16:15:30 <GeKo> irl[m]: where are we with getting hiro accesses to all the metrics services?
16:15:50 <GeKo> if i see my backlog correctly we did not start with that process yet?
16:16:12 <irl[m]> ah i went on vacation and didn't do it
16:16:18 <irl[m]> i'll do that after the meeting
16:16:29 <GeKo> great, thanks
16:17:59 <GeKo> okay
16:18:28 <GeKo> ggus: do you have some thoughts about how to deal with EOL relays in the network?
16:18:41 <GeKo> i want to make progress on that part in the coming weeks
16:18:47 <GeKo> and would love some input
16:19:16 <GeKo> back then i contacted a ton of operators on the old LTS to get them onto a newer tor release
16:19:23 <ggus> last time we contacted all EOL relays. do you think we should do this again?
16:19:25 <GeKo> and iirc you did a thing on twitter
16:19:41 <GeKo> yeah, i think that should be part of it
16:20:08 <ggus> lemme check that ticket, one sec
16:20:14 <irl[m]> if you're contacting EOL relays anyway, we can do more than ask them to upgrade
16:20:28 <irl[m]> ask them to use the contact info sharing thing
16:20:32 <GeKo> i like to figure out what we want to do with relays whose operators we do not reach
16:20:45 <irl[m]> help us understand the network better by getting relay operators to give us that information
16:20:56 <GeKo> yeah, maybe
16:21:05 <irl[m]> https://nusenu.github.io/ContactInfo-Information-Sharing-Specification/
16:21:30 <irl[m]> "here's a list of things that would be helpful to make your relay more useful to the tor network"
16:21:49 <GeKo> i am not convinced by nusenu's idea
16:22:00 <GeKo> but sure we could think about what we could do while we are at it
16:22:13 <ggus> mmmh, 1k relays running EOL.
16:22:25 <GeKo> yeah, it's rough
16:22:51 <GeKo> but we need to get on top of that issue sooner than later i think
16:23:39 <GeKo> ggus: i like the dir-auth link arma linked to as a starting point
16:24:30 <ggus> https://gitweb.torproject.org/torspec.git/tree/attic/authority-policy.txt#n63
16:24:33 <ggus> this one?
16:24:36 <GeKo> yes
16:25:19 <GeKo> the main tricky point as i see it is striking a good balance between kicking eol relays out due to security risks
16:25:30 * arma2 looks at backlog
16:25:41 <GeKo> and keeping them in the network for performance and diversity reasons
16:26:06 <GeKo> like, assuming those 1k relays can't upgrade/won't upgrade
16:26:16 <ggus> GeKo: yes, like if a tor version fixing a TROVE was released, we would like that relays operators upgrade the network
16:26:25 <GeKo> i suppose we don't kick all of them out because they run an EOL version
16:26:40 <GeKo> yep
16:26:55 <irl[m]> if there's a security issue then that wouldn't be a recommended version by authorities though, right?
16:27:02 <irl[m]> that would just be kicked out
16:27:42 <irl[m]> the issue is that there are EOL versions with no known security issues, but that don't have many eyes on the code actually looking for them, so it's the unknown security issues we want to protect against
16:27:45 <GeKo> we have a bunch of relays with a not-recommended version running
16:27:47 <ggus> GeKo: i think it's a good starting point. filtering EOL relays by security issues, contacting them by email, and then blocking if they don't upgrade.
16:28:05 <hiro> how often do we write these operators? or do we print something in the logs?
16:28:14 <GeKo> irl[m]: that's part of the problem
16:28:22 <GeKo> hiro: we did that once so far :)
16:28:28 <GeKo> for 0.2.9 LTS
16:28:32 <gaba> once in a century :P
16:28:38 <GeKo> because it's a ton of work
16:28:43 <irl[m]> as a relay operator i've never read the logs of my relays once they've been set up
16:28:44 <ggus> we also wrote a blog post
16:28:47 <GeKo> and then we dropped the ball
16:28:52 <GeKo> right
16:29:00 <hiro> irl[m]: I didn't want to say that but I don't read my logs either
16:29:13 <arma2> backlog looks reasonable so far. i agree with geko that the ContactInfo-Information-Sharing-Specification needs more attention, maybe by the ux team, before we try to push it on everybody.
16:29:16 <irl[m]> i do check relay search, and make sure they've got reasonable consensus weight and bandwidth looks consistent
16:29:23 <GeKo> ggus: anyway, that's for you and us to think about
16:29:29 <GeKo> nothing to nail down today
16:29:41 <GeKo> but i want to get that done soonish(tm)
16:29:54 <ggus> soonish starting at the end of this month?
16:30:13 <GeKo> earlier i hope but not later :)
16:30:36 <GeKo> not done in the sense we are finally done
16:30:56 <GeKo> but in the sense we have something in the background we can act on and iterate on
16:31:03 <ggus> i'm very busy till august 18th, after that i can think about other big tasks
16:31:19 <GeKo> okay, sounds good and works for me
16:31:52 <GeKo> i'll keep thinking meanwhile a bit more for myself and we can pick this up in earnest after august 18th
16:31:56 <GeKo> sounds good?
16:32:03 <ggus> yes! perfect!
16:32:09 <GeKo> nice
16:32:35 <GeKo> do we have anything else to discuss for today?
16:33:16 <hiro> uhm I guess the overload metrics
16:33:16 <arma2> i will catch up with geko sometime later on whatever i missed here (am focusing on an anti-censorship team thing lately)
16:33:20 <hiro> just quickly
16:34:14 <GeKo> hiro: go ahead :)
16:35:02 <hiro> yeah so in the beginning I started with the idea to make a boolean out of the two overload fields in the bandwidth document in onionoo with the idea that we could display an amber dot only next to the bandwidth graph is the realy was overlaoded
16:35:47 <hiro> but discussing with irl[m] we came out with the idea that it would make sense to have it in the details document instead, which led to the problem of processing all the extrainfo descriptor in onionoo
16:36:27 <hiro> so now the consensus is to use the overload-general line in the server descriptor as a first step and then figure out how to process the extrainfo descriptors
16:36:55 <hiro> my understanding reading the code is that onionoo save it all in memory before writing it to disk... so that might be the issue?
16:37:03 <hiro> please irl[m] let me know if I am missing something
16:37:42 <irl[m]> yes everything will be in memory before writing the json files to disk, but i don't think memory is the issue
16:38:30 <irl[m]> the problem we've had in the past is where things have taken a bit longer for some reason or another (a spike in relays, for example, meaning more descriptors need to be processed) and then onionoo hasn't finished its run before the next hour, and then the next hour runs and trashes the data
16:38:52 <hiro> ah I get it
16:38:55 <irl[m]> i don't recall if we ever added a lock to stop that happening, maybe we did, but then onionoo might end up updating every 2 hours rather than hourly
16:39:41 <irl[m]> onionoo's data is already a few hours old just with delays in the pipeline so delaying it another hour really should be avoided
16:40:07 <GeKo> yeah, sounds reasonable
16:40:11 <hiro> so running the update manually the first time is excluded
16:40:37 <GeKo> hiro: did you talk about the general handling of overload data on relay-search in some s61 meeting already?
16:40:53 <GeKo> if so, it might be worth bringing that up there
16:41:00 <hiro> not yet... s61 meeting is later
16:41:02 <GeKo> given that not all s61 people are here
16:41:05 <GeKo> yeah
16:42:21 <arma2> right, i was going to say, dgoulet and mike probably have opinions too
16:42:33 <hiro> ok so yeah I'll bring this up later
16:42:33 <GeKo> we can discuss the steps there but so far, as i commented on the ticket, starting with the overload-general one and thinking abuot the other ones later does not seem unreasonable to me
16:42:35 <arma2> but also, 'how to visualize it in atlas' is squarely a metrics thing and thus squarely a network health thing
16:43:04 <GeKo> yeah
16:43:19 <GeKo> okay, anything else for today?
16:43:24 <irl[m]> it'd be really nice to be able to have a link to a blog post or knowledgebase article
16:43:25 <arma2> i'd love to get to the point where we have a graph over time of how many relays are reporting overload. that logic might be tricky if relays that don't report it are maybe not reporting it because they don't have the detection yet. should be solvable.
16:43:35 <irl[m]> like "you're out of file descriptors here is how to sysctl"
16:44:21 <irl[m]> we do this for new relays and non-recommended version relays, but the mechanism in relay search's code is a mess
16:44:22 <irl[m]> so it's not easy to add these
16:44:37 <hiro> so maybe that can come later and at this point we can offer the knowledge that the relay is actually overloaded
16:44:52 <hiro> and that will also allow us to have a nice graph of how many relays are in that state
16:44:55 <GeKo> irl[m]: yeah, the link to X should be solvable :)
16:45:02 <GeKo> and, yes, it's a good idea
16:45:08 <irl[m]> i started a react rewrite of relay search a while ago that made this a lot easier, one day it'd be nice to finish that
16:45:12 <arma2> we *almost* have that link on https://community.torproject.org/relay/setup/post-install/ or the like
16:45:24 <arma2> but the relay section of community.tpo is full of tiny pages, all nearly alike but not quite
16:45:32 <arma2> so there is some work to be done there too :)
16:46:01 <GeKo> alright. some food for thought :)
16:46:19 <GeKo> thanks for the meeting and have a nice week, everonye o/
16:46:20 <irl[m]> https://irl.sdf.org/irs/#/
16:46:23 <GeKo> #endmeetimg
16:46:29 <GeKo> #endmeeting