16:00:31 #startmeeting network-health 08/09/2021 16:00:31 Meeting started Mon Aug 9 16:00:31 2021 UTC. The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:31 Useful Commands: #action #agreed #help #info #idea #link #topic. 16:00:35 okay 16:00:40 o/ 16:00:46 let's do some network-health sync again! 16:00:48 hiro: hi! 16:00:54 hi 16:01:00 hello 16:01:03 we still have http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep 16:01:26 please add your updates in case you have any and did not do so yet 16:02:04 mark the items you want to talk about in bold 16:02:22 or we can add separate discussion items if needed 16:02:48 gaba: ggus: feel free to join in case you are around 16:04:45 hi 16:05:00 i guess while other folks enter there items i can start a bit 16:05:05 ggus: o/ 16:05:14 i've been on vacation for the past weeks 16:05:29 so this week i'll try to catch up on All The Things 16:05:50 my notes are a bit rough because i completely forgot these were even a thing, i'll add it back to my calendar for next week 16:06:08 we'll see how much time this permits for "regular" work :) 16:06:11 i think there is still an issue to discuss on the bridgestrap metrics 16:06:32 do we need some anti-censorship folks for that part, too? 16:06:39 bridgestrap exports metrics in the correct format but when you look at them they are often a bit meaningless 16:07:00 i think the issue is that the correct anti-censorship people are away 16:07:09 hah, i see 16:07:17 so i've got the collector implementation but i don't really want to deploy it to just have it archiving nonsense 16:07:38 but, my contract will end on aug 31st, so there's a hard time limit on finishing this 16:08:22 i also don't want this to hold up adding overload metrics to onionoo, but having both this and overload metrics in onionoo would make a really cool story for a blog post 16:08:45 irl[m]: what help do you need for the bridgestrap issue? 16:09:15 maybe if we can come up with requirements for the onionoo VMs we can speed up rolling up the overload metrics 16:09:57 https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/20 and https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/22 16:10:30 these issues need to be fixed, otherwise we're just feeding in bad data which will end up confusing bridge operators more than helping them 16:11:10 okay 16:11:28 meskio is afk i think, so hrm 16:11:47 hiro7[m]: i've not actually looked at the logs to see how long the runs currently take, and i've not looked at what the impact is of adding the additional descriptors, so first we should do those checks and see if it's a real problem or just one i've imagined or remembered from an old hypervisor host or something 16:12:12 i could imagine that things have changed in the time i was gone 16:12:16 hi! 16:12:53 irl[m]: okay, i'll think a bit about it 16:13:03 i am not sure if cohosh is around this week 16:13:11 and whether that would be enough to unblock you 16:13:36 cohosh can you look at it soon too 16:13:41 otherwise meskio is back next week 16:13:53 when i spoke to cohosh she filed the issue and assigned it to meskio, i don't know if it's something she can look at or if it's requiring some as yet undocumented understanding 16:14:05 yeah 16:15:30 irl[m]: where are we with getting hiro accesses to all the metrics services? 16:15:50 if i see my backlog correctly we did not start with that process yet? 16:16:12 ah i went on vacation and didn't do it 16:16:18 i'll do that after the meeting 16:16:29 great, thanks 16:17:59 okay 16:18:28 ggus: do you have some thoughts about how to deal with EOL relays in the network? 16:18:41 i want to make progress on that part in the coming weeks 16:18:47 and would love some input 16:19:16 back then i contacted a ton of operators on the old LTS to get them onto a newer tor release 16:19:23 last time we contacted all EOL relays. do you think we should do this again? 16:19:25 and iirc you did a thing on twitter 16:19:41 yeah, i think that should be part of it 16:20:08 lemme check that ticket, one sec 16:20:14 if you're contacting EOL relays anyway, we can do more than ask them to upgrade 16:20:28 ask them to use the contact info sharing thing 16:20:32 i like to figure out what we want to do with relays whose operators we do not reach 16:20:45 help us understand the network better by getting relay operators to give us that information 16:20:56 yeah, maybe 16:21:05 https://nusenu.github.io/ContactInfo-Information-Sharing-Specification/ 16:21:30 "here's a list of things that would be helpful to make your relay more useful to the tor network" 16:21:49 i am not convinced by nusenu's idea 16:22:00 but sure we could think about what we could do while we are at it 16:22:13 mmmh, 1k relays running EOL. 16:22:25 yeah, it's rough 16:22:51 but we need to get on top of that issue sooner than later i think 16:23:39 ggus: i like the dir-auth link arma linked to as a starting point 16:24:30 https://gitweb.torproject.org/torspec.git/tree/attic/authority-policy.txt#n63 16:24:33 this one? 16:24:36 yes 16:25:19 the main tricky point as i see it is striking a good balance between kicking eol relays out due to security risks 16:25:30 * arma2 looks at backlog 16:25:41 and keeping them in the network for performance and diversity reasons 16:26:06 like, assuming those 1k relays can't upgrade/won't upgrade 16:26:16 GeKo: yes, like if a tor version fixing a TROVE was released, we would like that relays operators upgrade the network 16:26:25 i suppose we don't kick all of them out because they run an EOL version 16:26:40 yep 16:26:55 if there's a security issue then that wouldn't be a recommended version by authorities though, right? 16:27:02 that would just be kicked out 16:27:42 the issue is that there are EOL versions with no known security issues, but that don't have many eyes on the code actually looking for them, so it's the unknown security issues we want to protect against 16:27:45 we have a bunch of relays with a not-recommended version running 16:27:47 GeKo: i think it's a good starting point. filtering EOL relays by security issues, contacting them by email, and then blocking if they don't upgrade. 16:28:05 how often do we write these operators? or do we print something in the logs? 16:28:14 irl[m]: that's part of the problem 16:28:22 hiro: we did that once so far :) 16:28:28 for 0.2.9 LTS 16:28:32 once in a century :P 16:28:38 because it's a ton of work 16:28:43 as a relay operator i've never read the logs of my relays once they've been set up 16:28:44 we also wrote a blog post 16:28:47 and then we dropped the ball 16:28:52 right 16:29:00 irl[m]: I didn't want to say that but I don't read my logs either 16:29:13 backlog looks reasonable so far. i agree with geko that the ContactInfo-Information-Sharing-Specification needs more attention, maybe by the ux team, before we try to push it on everybody. 16:29:16 i do check relay search, and make sure they've got reasonable consensus weight and bandwidth looks consistent 16:29:23 ggus: anyway, that's for you and us to think about 16:29:29 nothing to nail down today 16:29:41 but i want to get that done soonish(tm) 16:29:54 soonish starting at the end of this month? 16:30:13 earlier i hope but not later :) 16:30:36 not done in the sense we are finally done 16:30:56 but in the sense we have something in the background we can act on and iterate on 16:31:03 i'm very busy till august 18th, after that i can think about other big tasks 16:31:19 okay, sounds good and works for me 16:31:52 i'll keep thinking meanwhile a bit more for myself and we can pick this up in earnest after august 18th 16:31:56 sounds good? 16:32:03 yes! perfect! 16:32:09 nice 16:32:35 do we have anything else to discuss for today? 16:33:16 uhm I guess the overload metrics 16:33:16 i will catch up with geko sometime later on whatever i missed here (am focusing on an anti-censorship team thing lately) 16:33:20 just quickly 16:34:14 hiro: go ahead :) 16:35:02 yeah so in the beginning I started with the idea to make a boolean out of the two overload fields in the bandwidth document in onionoo with the idea that we could display an amber dot only next to the bandwidth graph is the realy was overlaoded 16:35:47 but discussing with irl[m] we came out with the idea that it would make sense to have it in the details document instead, which led to the problem of processing all the extrainfo descriptor in onionoo 16:36:27 so now the consensus is to use the overload-general line in the server descriptor as a first step and then figure out how to process the extrainfo descriptors 16:36:55 my understanding reading the code is that onionoo save it all in memory before writing it to disk... so that might be the issue? 16:37:03 please irl[m] let me know if I am missing something 16:37:42 yes everything will be in memory before writing the json files to disk, but i don't think memory is the issue 16:38:30 the problem we've had in the past is where things have taken a bit longer for some reason or another (a spike in relays, for example, meaning more descriptors need to be processed) and then onionoo hasn't finished its run before the next hour, and then the next hour runs and trashes the data 16:38:52 ah I get it 16:38:55 i don't recall if we ever added a lock to stop that happening, maybe we did, but then onionoo might end up updating every 2 hours rather than hourly 16:39:41 onionoo's data is already a few hours old just with delays in the pipeline so delaying it another hour really should be avoided 16:40:07 yeah, sounds reasonable 16:40:11 so running the update manually the first time is excluded 16:40:37 hiro: did you talk about the general handling of overload data on relay-search in some s61 meeting already? 16:40:53 if so, it might be worth bringing that up there 16:41:00 not yet... s61 meeting is later 16:41:02 given that not all s61 people are here 16:41:05 yeah 16:42:21 right, i was going to say, dgoulet and mike probably have opinions too 16:42:33 ok so yeah I'll bring this up later 16:42:33 we can discuss the steps there but so far, as i commented on the ticket, starting with the overload-general one and thinking abuot the other ones later does not seem unreasonable to me 16:42:35 but also, 'how to visualize it in atlas' is squarely a metrics thing and thus squarely a network health thing 16:43:04 yeah 16:43:19 okay, anything else for today? 16:43:24 it'd be really nice to be able to have a link to a blog post or knowledgebase article 16:43:25 i'd love to get to the point where we have a graph over time of how many relays are reporting overload. that logic might be tricky if relays that don't report it are maybe not reporting it because they don't have the detection yet. should be solvable. 16:43:35 like "you're out of file descriptors here is how to sysctl" 16:44:21 we do this for new relays and non-recommended version relays, but the mechanism in relay search's code is a mess 16:44:22 so it's not easy to add these 16:44:37 so maybe that can come later and at this point we can offer the knowledge that the relay is actually overloaded 16:44:52 and that will also allow us to have a nice graph of how many relays are in that state 16:44:55 irl[m]: yeah, the link to X should be solvable :) 16:45:02 and, yes, it's a good idea 16:45:08 i started a react rewrite of relay search a while ago that made this a lot easier, one day it'd be nice to finish that 16:45:12 we *almost* have that link on https://community.torproject.org/relay/setup/post-install/ or the like 16:45:24 but the relay section of community.tpo is full of tiny pages, all nearly alike but not quite 16:45:32 so there is some work to be done there too :) 16:46:01 alright. some food for thought :) 16:46:19 thanks for the meeting and have a nice week, everonye o/ 16:46:20 https://irl.sdf.org/irs/#/ 16:46:23 #endmeetimg 16:46:29 #endmeeting