13:01:21 <hiro> #startmeeting network-health 2025-05-05
13:01:21 <MeetBot> Meeting started Mon May  5 13:01:21 2025 UTC.  The chair is hiro. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:01:21 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
13:01:30 <hiro> does it work?
13:01:31 <hiro> ah yeah
13:01:40 <hiro> just a little slower over matrix I suppose
13:01:48 <GeKo> works
13:01:57 <hiro> here is the pad: https://pad.riseup.net/p/tor-nethealthteam-2025-keep
13:02:29 <hiro> I am trying to clean up a little bit of backlog this week after the issues w collector
13:03:09 <hiro> does anyone has any topic for today?
13:03:34 <hiro> sarthikg told me they have had some unexpected come up today
13:03:42 <hiro> so might not be around for the sync
13:04:00 <GeKo> hiro: the outages and metrics issues
13:04:52 <hiro> yeah so I think collector should be ok... metricsdb-01 is still affected as far as I can see... onionoo I think it has recovered
13:05:30 <GeKo> so collector is using now 23GiB memory all the time at least
13:05:49 <GeKo> which is nuts
13:06:15 <GeKo> and it's still slowly growing over time
13:06:29 <GeKo> apart from spiking starting from that high plateau
13:06:37 <hiro> I want to check how that is going to be when it is finished with the tarballs
13:07:29 <GeKo> but creating the tarballs never used that much memory before
13:07:56 <hiro> yeah but we stopped (I did) the service a lot of time last week
13:08:06 <hiro> so it might have been a bit behind with that
13:08:20 <hiro> something I see now is that it has a lot of threads running
13:09:05 <hiro> and I want to understand if that is because of it being behind or that's an issue with how I modified the scheduler... which I didn't to be honest but 🤷
13:09:23 <GeKo> fwiw the dir-auths have been fine since 2025-04-29 20:00:00 (give or take)
13:09:41 <GeKo> so there is no pressure since then from that side
13:10:00 <hiro> I was having issues with collector also wed and thu
13:10:46 <GeKo> another thing worth in this context is writing down a timeline of that incident and check what we tried and what happened
13:10:56 <GeKo> i feel otherwise do we risk just running in circles
13:11:23 <GeKo> i recall we tried something in nov 2024 until we resorted to just bumping memory
13:11:38 <GeKo> but was that the same? did it help? etc.
13:11:52 <hiro> yeah I changed the garbage collector algorithm a few times
13:12:02 <hiro> because there are different flags with each version of java
13:12:11 <hiro> and when java updates that changes how the memory is managed
13:12:39 <hiro> the problem is often with the index and the tarballs
13:12:44 <GeKo> yeah. i think we should collect those things otherwise we start from scratch again the next time something like this happens
13:13:00 <GeKo> and we never understand what worked and what not
13:13:35 <hiro> well last time I had tried to use a different data structure for the index and traversing it
13:13:50 <hiro> like mapdb but then I ditched the patch
13:14:02 <GeKo> yup
13:14:02 <hiro> because I was spiking in cpu like crazy and it wasn't making things better
13:14:13 <hiro> so after a week I asked anarcat to pls bump memory
13:15:20 <hiro> this time I tried a few different things and I think they are making things a bit better to be honest
13:15:39 <hiro> one is treaming writing the index and the other one is making sure the threads empty the memory
13:15:42 <hiro> *streaming
13:15:55 <hiro> I think the number of relays we have also impacts things
13:16:06 <hiro> and the way the service is designed is inefficient
13:16:18 <hiro> because it is designed for a different network
13:16:23 <GeKo> yeah, i know
13:16:30 <hiro> also because there are better tools nowadays
13:17:04 <hiro> so if we could use object storage that would help also because of the min.io api we wouldn't have to create the index we would get that for "free"
13:17:26 <GeKo> yeah, i know
13:17:31 <hiro> but we aren't there yet
13:17:40 <GeKo> my point was just to document what we tried and how it went
13:18:32 <GeKo> e.g. on collector02 there a big memory spikes starting 2025-04-29 around 12:00 which went away 2025-05-01 around 10:00:00
13:18:32 <hiro> yeah I guess it depends if we end up merging that patch or not... maybe it makes sense to try and deploy the old version and see if it was the dirauths being late
13:21:14 <hiro> yeah so the latest version of this patch has been running since apr 29 21.40 utc
13:22:37 <hiro> but on the first I tried again to tune the Shenandoah garbage collector algorithm with some new flags
13:22:47 <hiro> so maybe that helped
13:23:39 <GeKo> apart from the collector issues we seem to have issues with other services like onionperf boxes which we miss
13:23:51 <GeKo> i wonder how we can resolve those things
13:24:04 <GeKo> should we try to spread the load more to maintain those things?
13:24:25 <GeKo> i'd be happy to keep and eye on the onionperf things and fix things as they occur
13:24:45 <hiro> I think they are all running though aren't they? I see the measuremnts
13:24:56 <hiro> do you mean the issue with the updates?
13:25:23 <GeKo> no like https://gitlab.torproject.org/tpo/network-health/metrics/onionperf/-/issues/40077
13:26:11 <hiro> why the performances graph is ok then... hold on
13:26:36 <hiro> https://metrics.torproject.org/torperf.html
13:27:15 <GeKo> it's not okay
13:27:44 <GeKo> op-us8a doesn't even show up there as it's not been working for a long time
13:27:54 <GeKo> the public measurements that is
13:28:40 <hiro> ah sorry forgot we have another one...
13:29:15 <hiro> I think yeah we can sync on maintaining those
13:29:37 <hiro> the alerts also ahve a lot of false positives since the new system and should be fixed
13:29:49 <GeKo> yeah
13:29:55 <hiro> so that doesn't help either
13:29:55 <GeKo> and we have things like https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40125
13:30:21 <GeKo> https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40123
13:30:31 <GeKo> https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40126
13:30:44 <GeKo> just to name some other tickets
13:31:10 <GeKo> which seem to support some kind of load sharing imo
13:32:23 <hiro> well debugging the metrics db is a bit more complicated to share... I had missed these while fixing collector
13:32:30 <hiro> I wonder why we aren't producing those
13:33:43 <hiro> ok
13:33:47 <hiro> found the issue
13:33:58 <hiro> let me fix that when the meeting is over
13:34:27 <hiro> I think we are in a situation when metrics services have become somehow fragile
13:35:54 <hiro> and we might have to stop and think how can we better maintain everything while we transition to new services
13:36:41 <GeKo> yeah i think so
13:37:09 <GeKo> or we could just say x is tier 1 prio, y is tier 2 prio and z is just best effort and communicate that clearly
13:37:22 <GeKo> while focusing on the transition
13:37:35 <hiro> that is an idea
13:37:51 <GeKo> or both
13:38:48 <hiro> I did operate with the assumption that we could rely on metrics.tpo and collector for a while longer though
13:39:09 <hiro> but every system update (plus the current load fluctuations) we have issues
13:39:33 <hiro> I'd be unhappy to have those as best effort to be honest
13:39:47 <GeKo> and then there are things happening on the network...
13:39:50 <hiro> because we wouldn't have valuable tools working properly like relay-search and the network wide stats
13:40:04 <GeKo> yeah, i agree
13:41:39 <GeKo> nothing else from my side
13:41:41 <GeKo> oh
13:41:50 <hiro> so yeah lets fix these I guess
13:41:58 <GeKo> hiro: no need to for you to do any p112 planning work
13:42:11 <GeKo> i am supposed to do that if necessary, i think
13:42:31 <hiro> ok that sounds good
13:43:54 <hiro> juga: do you have any topic to discuss?
13:44:12 <juga> hiro: i'm fine
13:44:22 <juga> continuing with sbws and chunks...
13:44:24 <hiro> me and GeKo (IRC) have been discussing for a while now hehe
13:44:41 <juga> np, it was interesting
13:45:23 <hiro> ok so if everyone is fine we could end the meeting
13:45:48 <GeKo> juga: once the infra is back (wchih seems to be the case) i plan to focus on sbws stuff
13:45:53 <GeKo> and p183 in general
13:46:00 <juga> GeKo: ok
13:46:13 <juga> i'm right with maatuska' sbws...
13:46:25 <GeKo> i really like to make progress on that part again...
13:46:32 <juga> yeah
13:46:44 <GeKo> i guess i do the analysis for mike first and then check the dashboards
13:46:55 <juga> sounds good
13:47:11 <GeKo> if there is anything more high prio for me, let me know
13:47:15 <juga> i'm still finishing with the chunk and production sbws
13:47:24 <GeKo> ack
13:47:24 <juga> ok, i will
13:49:56 <hiro> ooook!
13:50:04 <hiro> I'll end the meeting then
13:50:41 <hiro> #endmeeting