13:01:21 #startmeeting network-health 2025-05-05 13:01:21 Meeting started Mon May 5 13:01:21 2025 UTC. The chair is hiro. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:01:21 Useful Commands: #action #agreed #help #info #idea #link #topic. 13:01:30 does it work? 13:01:31 ah yeah 13:01:40 just a little slower over matrix I suppose 13:01:48 works 13:01:57 here is the pad: https://pad.riseup.net/p/tor-nethealthteam-2025-keep 13:02:29 I am trying to clean up a little bit of backlog this week after the issues w collector 13:03:09 does anyone has any topic for today? 13:03:34 sarthikg told me they have had some unexpected come up today 13:03:42 so might not be around for the sync 13:04:00 hiro: the outages and metrics issues 13:04:52 yeah so I think collector should be ok... metricsdb-01 is still affected as far as I can see... onionoo I think it has recovered 13:05:30 so collector is using now 23GiB memory all the time at least 13:05:49 which is nuts 13:06:15 and it's still slowly growing over time 13:06:29 apart from spiking starting from that high plateau 13:06:37 I want to check how that is going to be when it is finished with the tarballs 13:07:29 but creating the tarballs never used that much memory before 13:07:56 yeah but we stopped (I did) the service a lot of time last week 13:08:06 so it might have been a bit behind with that 13:08:20 something I see now is that it has a lot of threads running 13:09:05 and I want to understand if that is because of it being behind or that's an issue with how I modified the scheduler... which I didn't to be honest but 🤷 13:09:23 fwiw the dir-auths have been fine since 2025-04-29 20:00:00 (give or take) 13:09:41 so there is no pressure since then from that side 13:10:00 I was having issues with collector also wed and thu 13:10:46 another thing worth in this context is writing down a timeline of that incident and check what we tried and what happened 13:10:56 i feel otherwise do we risk just running in circles 13:11:23 i recall we tried something in nov 2024 until we resorted to just bumping memory 13:11:38 but was that the same? did it help? etc. 13:11:52 yeah I changed the garbage collector algorithm a few times 13:12:02 because there are different flags with each version of java 13:12:11 and when java updates that changes how the memory is managed 13:12:39 the problem is often with the index and the tarballs 13:12:44 yeah. i think we should collect those things otherwise we start from scratch again the next time something like this happens 13:13:00 and we never understand what worked and what not 13:13:35 well last time I had tried to use a different data structure for the index and traversing it 13:13:50 like mapdb but then I ditched the patch 13:14:02 yup 13:14:02 because I was spiking in cpu like crazy and it wasn't making things better 13:14:13 so after a week I asked anarcat to pls bump memory 13:15:20 this time I tried a few different things and I think they are making things a bit better to be honest 13:15:39 one is treaming writing the index and the other one is making sure the threads empty the memory 13:15:42 *streaming 13:15:55 I think the number of relays we have also impacts things 13:16:06 and the way the service is designed is inefficient 13:16:18 because it is designed for a different network 13:16:23 yeah, i know 13:16:30 also because there are better tools nowadays 13:17:04 so if we could use object storage that would help also because of the min.io api we wouldn't have to create the index we would get that for "free" 13:17:26 yeah, i know 13:17:31 but we aren't there yet 13:17:40 my point was just to document what we tried and how it went 13:18:32 e.g. on collector02 there a big memory spikes starting 2025-04-29 around 12:00 which went away 2025-05-01 around 10:00:00 13:18:32 yeah I guess it depends if we end up merging that patch or not... maybe it makes sense to try and deploy the old version and see if it was the dirauths being late 13:21:14 yeah so the latest version of this patch has been running since apr 29 21.40 utc 13:22:37 but on the first I tried again to tune the Shenandoah garbage collector algorithm with some new flags 13:22:47 so maybe that helped 13:23:39 apart from the collector issues we seem to have issues with other services like onionperf boxes which we miss 13:23:51 i wonder how we can resolve those things 13:24:04 should we try to spread the load more to maintain those things? 13:24:25 i'd be happy to keep and eye on the onionperf things and fix things as they occur 13:24:45 I think they are all running though aren't they? I see the measuremnts 13:24:56 do you mean the issue with the updates? 13:25:23 no like https://gitlab.torproject.org/tpo/network-health/metrics/onionperf/-/issues/40077 13:26:11 why the performances graph is ok then... hold on 13:26:36 https://metrics.torproject.org/torperf.html 13:27:15 it's not okay 13:27:44 op-us8a doesn't even show up there as it's not been working for a long time 13:27:54 the public measurements that is 13:28:40 ah sorry forgot we have another one... 13:29:15 I think yeah we can sync on maintaining those 13:29:37 the alerts also ahve a lot of false positives since the new system and should be fixed 13:29:49 yeah 13:29:55 so that doesn't help either 13:29:55 and we have things like https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40125 13:30:21 https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40123 13:30:31 https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40126 13:30:44 just to name some other tickets 13:31:10 which seem to support some kind of load sharing imo 13:32:23 well debugging the metrics db is a bit more complicated to share... I had missed these while fixing collector 13:32:30 I wonder why we aren't producing those 13:33:43 ok 13:33:47 found the issue 13:33:58 let me fix that when the meeting is over 13:34:27 I think we are in a situation when metrics services have become somehow fragile 13:35:54 and we might have to stop and think how can we better maintain everything while we transition to new services 13:36:41 yeah i think so 13:37:09 or we could just say x is tier 1 prio, y is tier 2 prio and z is just best effort and communicate that clearly 13:37:22 while focusing on the transition 13:37:35 that is an idea 13:37:51 or both 13:38:48 I did operate with the assumption that we could rely on metrics.tpo and collector for a while longer though 13:39:09 but every system update (plus the current load fluctuations) we have issues 13:39:33 I'd be unhappy to have those as best effort to be honest 13:39:47 and then there are things happening on the network... 13:39:50 because we wouldn't have valuable tools working properly like relay-search and the network wide stats 13:40:04 yeah, i agree 13:41:39 nothing else from my side 13:41:41 oh 13:41:50 so yeah lets fix these I guess 13:41:58 hiro: no need to for you to do any p112 planning work 13:42:11 i am supposed to do that if necessary, i think 13:42:31 ok that sounds good 13:43:54 juga: do you have any topic to discuss? 13:44:12 hiro: i'm fine 13:44:22 continuing with sbws and chunks... 13:44:24 me and GeKo (IRC) have been discussing for a while now hehe 13:44:41 np, it was interesting 13:45:23 ok so if everyone is fine we could end the meeting 13:45:48 juga: once the infra is back (wchih seems to be the case) i plan to focus on sbws stuff 13:45:53 and p183 in general 13:46:00 GeKo: ok 13:46:13 i'm right with maatuska' sbws... 13:46:25 i really like to make progress on that part again... 13:46:32 yeah 13:46:44 i guess i do the analysis for mike first and then check the dashboards 13:46:55 sounds good 13:47:11 if there is anything more high prio for me, let me know 13:47:15 i'm still finishing with the chunk and production sbws 13:47:24 ack 13:47:24 ok, i will 13:49:56 ooook! 13:50:04 I'll end the meeting then 13:50:41 #endmeeting