15:58:54 <GeKo> #startmeeting network-health 12/19/2022
15:58:54 <MeetBot> Meeting started Mon Dec 19 15:58:54 2022 UTC.  The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:58:54 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:59:00 <GeKo> okay, let's get started
15:59:18 <GeKo> on to the last team sync this year :)
15:59:32 <GeKo> the pad is, as usual, at: http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep
16:00:06 <GeKo> juga: hiro: ^
16:00:18 <GeKo> i wonder whether we have a ggus here for s112
16:00:26 <hiro> o/
16:01:25 <juga> o/
16:02:39 <GeKo> alright
16:03:01 <GeKo> hiro: how is it going the the infra issues we have all over the metrics place?
16:03:05 <GeKo> *with the
16:03:29 <hiro> so we had an issue with onionoo on friday
16:03:34 <hiro> someone was doing a lot of queries
16:03:43 <hiro> anarcat identified 3 or 4 azure ips
16:03:53 <hiro> and shut those down
16:04:04 <hiro> and it has been ok since then.
16:04:11 <GeKo> i guess that's still kind of ongoing given out alerts today in the morning?
16:04:16 <GeKo> ah, okay
16:04:41 <hiro> then there is an issue I think with getting bridges from polyanthum
16:05:01 <hiro> I got a collector alert I think on saturday and it was linked with a EOF error on one of the archives
16:05:18 <hiro> onionoo is running correctlyk but sometimes the bridge list is delayed
16:05:24 <hiro> I have an issue for the faulty archive
16:06:24 <GeKo> yeah
16:06:28 <hiro> it's very difficult to dig up issues because we just know some of the index is delayed
16:06:43 <hiro> I think we should make some time in the new year to have better log handling or something
16:07:15 <hiro> maybe a log browser like that loki service for prometheus
16:07:32 <hiro> https://grafana.com/oss/loki/
16:07:36 <hiro> we have a ticket open for this
16:07:52 <GeKo> sounds good
16:07:58 <hiro> I think we need a place that when we see an alert we can explore what is happening in all our services because things are very interconnected
16:08:06 <GeKo> monitoring-and-alerting#4
16:08:18 * GeKo nods
16:08:19 <hiro> yep
16:08:42 <hiro> also the nl7 onionperf instance is back
16:08:49 <hiro> greenhost told me they powercycled it
16:08:51 <GeKo> i hope things are settle down a bit over the holidays :)
16:08:57 <GeKo> nice \o/
16:09:05 <hiro> I tried a few times too over sunday morning but it wasn't coming back
16:09:07 <GeKo> *settling
16:10:41 <GeKo> hiro: do we have to kick the onionperf instance additionally?
16:10:48 <GeKo> it's not showing up in https://grafana2.torproject.org/d/TmYimmx7k/onionperf-clients-status?orgId=1 anymore
16:10:53 <hiro> uhm
16:11:00 <hiro> I logged into the machine and it was running
16:11:04 <hiro> I'll check again
16:11:33 <GeKo> k
16:12:27 <hiro> maybe it's prometheus that needs to be kicked
16:12:31 <hiro> on the machine I mean
16:12:51 <GeKo> yep
16:13:37 <GeKo> hiro: https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40078 contains a bunch of requests
16:13:52 <hiro> yeah I have those for this week
16:13:54 <GeKo> i am not sure whether we should do all of that now, like in this week
16:14:04 <hiro> was talking to richard last week
16:14:09 <GeKo> because given our lack of time
16:14:10 <hiro> yeah not sure we can
16:14:28 <GeKo> but maaaybe it is smart to touch the website things while we are at it
16:14:32 <hiro> but maybe let's see how much we can fix... especially on the app stats for the locale
16:14:39 <GeKo> right
16:15:03 <GeKo> having some kind of priority and making sure we are addressing the most important items seems like a good way to go
16:15:22 <GeKo> i mean there have been tickets around for enhancements for... quite some time :)
16:15:34 <hiro> I would like to finish having the data imported on metrics-psqlts-01 first and then do the rest
16:15:52 <GeKo> and it's not obvious why they should be suddenly such a high prio to squeeze them into the remaining days
16:15:57 <GeKo> +1
16:16:29 <GeKo> alright
16:16:33 <hiro> uhm I wanted to have a look at the locale thing first to be honest because I'd like to have a look at the webstats queries since we have issues with those
16:16:56 <hiro> if it's quick to fix the issue with the apps then also good
16:16:58 <GeKo> yeah, locale sounds like the most important of those items imo
16:17:23 <GeKo> right
16:18:02 <GeKo> okay, nothing marked in bold, do we still have stuff to discuss?
16:18:24 * juga is fine
16:18:50 * hiro is groot
16:19:00 <GeKo> just a small s112 update from my side:
16:19:21 <GeKo> i sent the criteria for O2.1 to bekeela so she is getting them to drl for some feedback
16:19:26 <GeKo> let's see how that goes
16:19:54 <juga> nice
16:20:10 <GeKo> and then i spent a lot of my time on analyzing an fd overload we saw in november
16:20:23 <GeKo> that's for the dos part of O3
16:21:04 <GeKo> might stilll need tomorrow to answer the most interesting questions
16:21:17 <GeKo> but then i should be able to focus on other stuff :)
16:21:29 <GeKo> seems we have no ggus here for today
16:21:43 <GeKo> i guess we can just call it then and get back to work
16:21:56 <GeKo> note: the next sync will be 1/9/2023
16:22:02 <juga> ack
16:22:14 <GeKo> a nice week everyone ΓΌ/
16:22:16 <GeKo> hah
16:22:18 <GeKo> o/
16:22:22 <juga> o/
16:22:22 <GeKo> #endmeeting