15:58:54 #startmeeting network-health 12/19/2022 15:58:54 Meeting started Mon Dec 19 15:58:54 2022 UTC. The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:58:54 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:59:00 okay, let's get started 15:59:18 on to the last team sync this year :) 15:59:32 the pad is, as usual, at: http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep 16:00:06 juga: hiro: ^ 16:00:18 i wonder whether we have a ggus here for s112 16:00:26 o/ 16:01:25 o/ 16:02:39 alright 16:03:01 hiro: how is it going the the infra issues we have all over the metrics place? 16:03:05 *with the 16:03:29 so we had an issue with onionoo on friday 16:03:34 someone was doing a lot of queries 16:03:43 anarcat identified 3 or 4 azure ips 16:03:53 and shut those down 16:04:04 and it has been ok since then. 16:04:11 i guess that's still kind of ongoing given out alerts today in the morning? 16:04:16 ah, okay 16:04:41 then there is an issue I think with getting bridges from polyanthum 16:05:01 I got a collector alert I think on saturday and it was linked with a EOF error on one of the archives 16:05:18 onionoo is running correctlyk but sometimes the bridge list is delayed 16:05:24 I have an issue for the faulty archive 16:06:24 yeah 16:06:28 it's very difficult to dig up issues because we just know some of the index is delayed 16:06:43 I think we should make some time in the new year to have better log handling or something 16:07:15 maybe a log browser like that loki service for prometheus 16:07:32 https://grafana.com/oss/loki/ 16:07:36 we have a ticket open for this 16:07:52 sounds good 16:07:58 I think we need a place that when we see an alert we can explore what is happening in all our services because things are very interconnected 16:08:06 monitoring-and-alerting#4 16:08:18 * GeKo nods 16:08:19 yep 16:08:42 also the nl7 onionperf instance is back 16:08:49 greenhost told me they powercycled it 16:08:51 i hope things are settle down a bit over the holidays :) 16:08:57 nice \o/ 16:09:05 I tried a few times too over sunday morning but it wasn't coming back 16:09:07 *settling 16:10:41 hiro: do we have to kick the onionperf instance additionally? 16:10:48 it's not showing up in https://grafana2.torproject.org/d/TmYimmx7k/onionperf-clients-status?orgId=1 anymore 16:10:53 uhm 16:11:00 I logged into the machine and it was running 16:11:04 I'll check again 16:11:33 k 16:12:27 maybe it's prometheus that needs to be kicked 16:12:31 on the machine I mean 16:12:51 yep 16:13:37 hiro: https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40078 contains a bunch of requests 16:13:52 yeah I have those for this week 16:13:54 i am not sure whether we should do all of that now, like in this week 16:14:04 was talking to richard last week 16:14:09 because given our lack of time 16:14:10 yeah not sure we can 16:14:28 but maaaybe it is smart to touch the website things while we are at it 16:14:32 but maybe let's see how much we can fix... especially on the app stats for the locale 16:14:39 right 16:15:03 having some kind of priority and making sure we are addressing the most important items seems like a good way to go 16:15:22 i mean there have been tickets around for enhancements for... quite some time :) 16:15:34 I would like to finish having the data imported on metrics-psqlts-01 first and then do the rest 16:15:52 and it's not obvious why they should be suddenly such a high prio to squeeze them into the remaining days 16:15:57 +1 16:16:29 alright 16:16:33 uhm I wanted to have a look at the locale thing first to be honest because I'd like to have a look at the webstats queries since we have issues with those 16:16:56 if it's quick to fix the issue with the apps then also good 16:16:58 yeah, locale sounds like the most important of those items imo 16:17:23 right 16:18:02 okay, nothing marked in bold, do we still have stuff to discuss? 16:18:24 * juga is fine 16:18:50 * hiro is groot 16:19:00 just a small s112 update from my side: 16:19:21 i sent the criteria for O2.1 to bekeela so she is getting them to drl for some feedback 16:19:26 let's see how that goes 16:19:54 nice 16:20:10 and then i spent a lot of my time on analyzing an fd overload we saw in november 16:20:23 that's for the dos part of O3 16:21:04 might stilll need tomorrow to answer the most interesting questions 16:21:17 but then i should be able to focus on other stuff :) 16:21:29 seems we have no ggus here for today 16:21:43 i guess we can just call it then and get back to work 16:21:56 note: the next sync will be 1/9/2023 16:22:02 ack 16:22:14 a nice week everyone ΓΌ/ 16:22:16 hah 16:22:18 o/ 16:22:22 o/ 16:22:22 #endmeeting