17:00:13 #startmeeting snapshot 20241118 17:00:13 Meeting started Mon Nov 18 17:00:13 2024 UTC. The chair is weasel. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic. 17:00:24 who's around? 17:00:25 o/ 17:00:28 *wave* 17:00:31 hey 17:00:32 o/ 17:00:45 Hey! 17:00:50 wow. full room, thanks everyone! 17:01:05 ln5 asked me to chair, so I'll do what I can do 17:01:19 #topic agenda 17:02:06 I propose the following agenda items: * status update, * roles and responsibilities (re email), # any other business, # close 17:02:12 rate-limit and caching of s.d.o 17:02:16 anything we should put on the agenda at this time? 17:02:18 noted. 17:02:28 re-import of ports 17:02:34 roles and responsibilities 17:02:37 axhn: that was in status update :) 17:02:40 ln5: already on the list 17:02:45 oh, tnx 17:02:49 varnish internal redirects to farm 17:02:50 weasel: fine for me :) 17:03:04 (i'm less brainy with this fever running. thanks for chairing.) 17:03:29 * status update, 17:03:29 * roles and responsibilities (re email), 17:03:29 * rate-limit and caching of s.d.o 17:03:29 * varnish internal redirects 17:03:29 * any other business, 17:03:30 * close 17:03:36 #topic status updates 17:03:56 axhn: let's start with the -ports re-import, shall we? you have the floor 17:04:02 Oh well 17:04:31 Status is, I finally found some time, and learned I have to re-compute all the file's sha1sums 17:04:53 Doing this might take years, I found some shortcuts, will still take weeks. 17:05:20 weasel will eventually get some files, you showed a blueprint in June or so 17:05:32 What do you do with this information? Could you read every file once, compute and figure out if it needs to be sent off and send off the bytes? :) 17:05:37 how come it'd take so long? any clever insight we might arrive at? 17:06:10 It's zfs snapshots on spinning rust, so most of the time it's waiting for the disks. 17:06:36 Gimme 16 Tbyte NVMe, but transferring the data takes some time as well 17:06:53 if you computed the checksum of a file on one snapshot, can you easily tell if it's the same file in the next snapshot? mtime/ctime/filesize/inode? 17:07:06 Yeah, that's the shortcut 17:07:22 that's what a snapshot import run does as well, iirc 17:08:18 There are issues where the hashsum in a Packages does not match a .deb found on the disk. Will possibly ignore the first, finding the cause for this might be difficult. 17:08:36 Anyway, not much more to say. 17:08:56 that.. seems weird, but possible. 17:08:59 thanks for the update. 17:09:20 yw 17:09:26 another server with the specs of -mlm-01 is in the pipe, no ETA yet 17:09:34 great, thanks. 17:09:44 what else is new or had action times recently, and doesn't have its own point later on? 17:10:02 if nothing comes up, we can re-raise that question at Any-Other-Business. 17:10:11 goince once, 17:10:16 twice, 17:10:19 small point 17:10:20 #topic roles and responsibilities (re email), 17:10:27 lyknode: go 17:10:27 later then ;) 17:10:50 #topic status updates, redux 17:11:16 Just to say I've merge a couple of MR from fmoessbauer, mainly regarging max-age of redirects 17:11:28 That would improve caching on the client side 17:11:38 Thanks for the review and quick integration! 17:11:50 sounds good. seems like we can get into details at rate-limit and caching? 17:12:15 I think we could go a step further and maybe a high max-age for all replies? 17:12:21 yeah sure. 17:12:29 #topic roles and responsibilities, rly now 17:12:36 ln5: do you want to summarize? 17:12:41 let me try 17:13:11 "DSA wires up the web setup to serve things and the remainder is on the service owner" is the current description of how the responsibilities fall between DSA and snapshot service owners 17:13:24 taken from email from today (thanks pkern[m]) 17:13:33 is there more, from history? 17:13:41 (Not speaking in an official capacity, inferring more from other services and history.) 17:14:03 Question is also: Who is on the hook for keeping serving lights on? DSA does some monitoring, is DSA on the hook to recover things? Probably? 17:14:10 Is there an "org-chart" or similar showing the infrastructure and responsible teams / persons? 17:14:16 ln5: it follows from the "only DSA has uid0" stuff, because they, then we, didn't want people to randomly change stuff 17:14:27 bc i've had trouble understanding what i can and should do, as well as who i should ask for help, and such 17:14:59 so dsa did all the things that required root, and gave some role account to the people operating a given service. 17:15:09 weasel: yes, but things are intertwined and DSA might need help with some things here 17:15:27 that worked ok-ish 10 years ago, and with something as complex as snapshot it doesn't really work all that well and it also doesn't scale 17:15:35 but that's where things come from 17:15:46 , like the web app in recent rate-limiting things 17:16:26 so will it work now? with me not being DSA while weasel was 17:16:30 i mean is 17:16:30 DSA is kinda limited on attention and time (and spoons). Keeping the lights on. 17:16:42 Looking at it from the outside. 🙃 17:16:49 the lights are flickering 17:17:05 can we do something about it, like run more containers? as been suggested 17:17:27 moving more stuff into containers is one way to work around requiring DSA for all things 17:17:41 but you mentioned at some point that containers come with their own issues 17:17:46 I think eventually we need a good way of running containers, including knowing what went into them and regular rebuilds. (Lest they get stale and DSA sad.) 17:17:56 right 17:18:10 containers done right would auto-build regulardly and be orchestrated in a sane manner 17:18:20 But the other question is if they'd be directly serving traffic or if there's something in front. That sad, the current setup with haproxy/varnish/apache2 is a tad too complex for my taste. :> 17:18:48 but for us to know if it's even worth doing it, we'd need a use-case where it actually works 17:18:51 We already struggle with some services getting no attention. Snapshot is the counterexample though, which is nice! 17:19:11 weasel: use case outside snapshot? 17:19:14 mainly due to new momentum. it was not getting much attention for a long time, which is why it's in this state 17:19:28 ln5: no, snapshot working with containers is a use-case in my book. 17:19:37 but the caching stuff could be in a container as well 17:19:41 the DB also, maybe. 17:19:52 that's what i've been doing on -dev-01 17:19:55 \o/ 17:19:58 It's also about having a clear picture of the whole chain from DNS to the flask app 17:20:06 db, apache, snapshot in separate containers 17:20:09 but no caching yet 17:20:19 fmoessbauer: Admittedly that's something Someone(TM) could Just(TM) document. :> 17:20:32 if prod had the caching in a container, would that help you? 17:20:36 and if yes, why don't we? :) 17:21:10 if it would give me read access to logs and write access to config and HUP powers, it would have helped 17:21:23 well, i should not see user data. i think. 17:21:30 I think it would these things 17:21:42 i mean, i think DSA thinks i should not see user data 17:21:55 i'd love to not see it! to be clear :) 17:22:00 we have given service owner read to the weblogs. granted, we have anonymized them for years, but you can already do netstat/ss 17:22:53 does service owner of s.d.o start at the apache server? Or does it include the varnish as well? 17:23:07 undefined 17:23:10 and haproxy? 17:23:17 currently neither 17:23:21 def not netfilter iiuc 17:23:21 certainly the service owner can't mess with varnish, haproxy and apache without DSA 17:23:26 We could redact the logs to 127.0.0.1 as well, that'd be easy. But it also makes DSA's life hard. So we are not happy with the current anonymization we have in some places either. 17:23:37 weasel: Well, apache has the reload sudoers thing in theory. 17:23:52 right 17:24:15 i think this might have worked better with a service owner who was also DSA 17:24:22 I personally don't have any issues with moving more services into service-owner controlled containers 17:24:29 does this mean we should design in another way? 17:24:32 ln5: That is true in many places, unfortunately. 17:24:38 ah, ic 17:24:49 it was also hard to find other people who cared about snapshot for ages :) 17:24:52 I could probably take an AI to export an anonymized log into a different directory using varnishncsa. 17:24:58 If that helps. 17:25:08 But I'm not sure if that's actually a solution to the problem you are having. 17:25:17 what is an AI? 17:25:19 These logs are pretty structures. Having histograms would already help a lot 17:25:34 if the problem is no logs, we should give you read access. but I don't have the impression that's the whole of it 17:25:47 that's only part of it, true 17:26:01 need to be able to try something, look at logs, back out, and so on 17:26:15 containers? 17:26:19 but some essential part IHMO, that how we can understand trafic for snapshot 17:26:35 It's also about statistics how much is is cached where - iow how much traffic hits the backend. 17:26:47 yes 17:27:11 so i think pkern[m] is on to something in recent email to list, with "more metrics needed" 17:27:39 that has also been a truism for the last 15 years, ever since munin stopped scaling when we hit 15 hosts :) 17:27:46 is there a prom server and graphana foo around already? 17:28:01 not yet. don't hold your breath 17:28:04 ack 17:28:18 should i try moving such things forward? 17:28:19 fmoessbauer: It's not clear to me if "how much is cached where" is necessarily relevant. What's relevant is to serve whatever gets through to the webapp and provide hints to the infrastructure. I imagine (but that's conjecture) that there's a small set of frequently requested stuff and everything else are one-offs. 17:28:36 ln5: I'm sure help would be appreciated. but DSA cycles are sparse. 17:28:53 weasel: can it be done outside of DSA? 17:28:56 @pkern[m] -> caching topic. But IMHO that is quite relevant to find the bottlenecks 17:29:06 ln5: unclear 17:29:37 so what we're seeing here wrt roles and responsibilities and power is not unique to snapshot? 17:29:44 just doublechecking 17:29:56 Correct. 17:30:06 snapshot ismore special than most because you can't really run it with a single role account. 17:30:07 (compared to mirrors, ftpmasters and archive for instance) 17:30:07 Although snapshot could potentially lead the way. 17:30:14 and there are no cycles for solving it, within DSA? 17:30:16 but similar issues to a lesser degree are in many other places 17:30:34 ic 17:31:06 so, for snapshot and this here now: 17:31:07 so it'd be biting off quite a bit to try to use snapshot for making change? 17:31:21 I have the advantage of having a few more spoons and time available right now, but who knows for how long. So I've been a bit jumpy with things that kept alerting. 17:31:42 moving more system services into containers (like varnish) would help, right? 17:31:49 so if you want to do that, please do. 17:31:52 pkern[m]: yes, thanks for all the time spent on snapshot! 17:32:57 (I think haproxy is in the loop for ssl termination) 17:33:29 weasel: ok, i will set up on dev-01 and try to bog it down a little 17:33:59 any prior art within debian for running containers for real loads? 17:34:10 no, you're the first :) 17:34:19 lyknode: mirrors is a subset of dsa these days fwiw. it doesn't /need/ to be, but that's how it's ended up for a while now 17:34:19 any existing machinery for autobuilding images and so on? 17:34:39 weasel: oh, fun (looking for the right smiley, failing) 17:34:57 reproducible containers built against s.d.o. Circle closes :D 17:34:58 buildd is also a subset of DSA these days. 17:35:15 ack 17:35:18 fmoessbauer: :) 17:35:38 ln5: we played at that in the early 2010s, and then we ran out of tuits 17:35:57 i will give it a shot, pestering many of you :) 17:36:26 please lmk when you're fed up and i'll stop highlighting you 17:36:37 the more you can do without requiring DSA, the better. 17:36:39 I'll let you know 17:36:45 should we move on? 17:37:02 fine by me 17:37:04 #topic rate-limit and caching of s.d.o 17:37:22 lyknode, fmoessbauer, *: you have updates/questions/discussion points? 17:37:39 We have many use-cases where we use s.d.o to build reproducible images and containers. Main problem is currently reliability and caching. 17:38:07 We could easily put a transparent proxy in-between s.d.o and our network, but for that we need precise and caching headers. 17:38:44 that and the fact that we have too much traffic for what snapshot can handle. 17:39:01 I don't think we actually have too much traffic, with Tencent blocked. 🙃 17:39:07 files I think have a TTL of a day or a week? 17:39:14 But ICBW. 17:39:21 The redirects between /archive and /file are also problematic, as the redirect itself is expensive for sdo but not well cached. 17:39:39 @weasel This was reduced to 10m by @pkern[m] to put load off varnish 17:40:00 cache-control: max-age=86400, public 17:40:07 for redirects 17:40:09 bc varnish is expelling data while serving 17:40:09 for files: max-age=31536000, 17:40:20 I wouldn't mind to cache redirects for much longer if you provide the VCL. 17:40:22 max-age 1 day is what I'm seeing at the client 17:40:40 But whatever was there before did not work in that setup. 17:40:52 that should provide "precise and caching headers" as requested by fmoessbauer, wouldn't it? 17:40:52 @weasel 1d: that's my change from last week. 17:40:53 pkern[m]: that can be done in the webapp 17:41:25 lyknode: In this Rube Goldberg machine the ttl in varnish is explicitly set for some reason. 17:41:28 here's what i have jotted down about varnish: 17:41:34 - [ ] varnish settings, keep and explain or ditch: 17:41:34 - [ ] "weasel's rule" 17:41:34 - [ ] 1w TTL 17:41:34 - [ ] max 1M 17:41:37 - [ ] no keepalive (set resp.http.connection = "close") 17:41:40 fmoessbauer: what was it before? 17:41:48 @weasel 10m 17:41:53 I think the TTL only refer to varnish cache, not the http cache headers. 17:42:26 fmoessbauer: I thought that for files it was much much more than that when I built it 17:42:45 So the varnish cache could stay low, but we can increase the max-age from the webapp on all replies to allow clients to cache themselves 17:42:47 Correct. The TTL there is only varnish cache internal. 17:42:51 @weasel for files it is super-long +ETag, but that does not help much as the expensive redirect was barely cached 17:43:11 My hope is that varnish can already reduce the load on the DB by caching nearly all of the 302 redirects. 17:43:42 Given that most files are frequently accessed - I don't know if that is the case, though. 17:43:44 right, varnish shouldn't cache the files, but it should cache the redirect. and even the redirect can be cached for a longish time 17:43:54 The redirect has "cache-control: max-age=86400, public" 17:44:08 and hopefully the cache survives reboots 17:44:16 ln5: It doesn't even survive restarts. 17:44:20 the cache is in memory only 17:44:21 right 17:44:23 There's no backend in open source varnish that does. 17:44:28 ugh 17:44:41 is it the right tool? 17:45:02 the question is also: is there a lot of caching possible? 17:45:17 https://salsa.debian.org/snapshot-team/snapshot/-/merge_requests/23#note_547040 17:45:17 Right now I'd make the argument that Fastly will now offload some of this. We can argue if we need to run our own attempts at caching. :> 17:45:27 my guess is that the redirect from timestamp/path to the sha1-indexed file can be cached 17:45:47 So the issue I was facing was that Varnish didn't find space in the cache, presumably because it also cached files, which it shouldn't. 17:45:50 lyknode: and that that's relevant for most requests 17:45:51 @lyknode the stats are from the apache, so we don't see if varnish caught most of it. 17:46:02 If we tell it to only cache redirects, then caching for a day should be fine. (Gut feeling.) 17:46:24 fmoessbauer: ah! true that. 17:46:27 it'd be great with some data 17:46:44 basically, varnish should cache everything based on the cache-control headers from the backend, except file data for static things on disk (the files/ tree) 17:46:56 @pkern[m] IMHO we should ONLY cache the redirects as these are expensive to compute but cheap to cache. 17:47:02 (I'm stating desired behaviour, not what I think it does currently) 17:48:15 I suspect that the "varnish internal redirects" topic also folds into this one, right? 17:48:24 or is that a dedicated issue? 17:48:31 Dedicated, I think. 17:48:35 ok 17:48:43 Yes, but that's more complicated as then varnish also needs to cache the big data :/ 17:48:47 I can't do a full page screenshot or print of the Fastly interface, sigh. The graphs only work when they are on screen. 17:48:53 fmoessbauer: Why? 17:49:14 Because it internally restarts the request? 17:49:23 That depends on the internals of varnish - which I don't know. 17:49:26 we're running out of time soonish 17:49:40 But then everything needs to run through varnish. Needs to be investigated 17:50:23 Anyways, I would like to give the internal redirect thingy a try. Then we know more. 17:50:42 attempt at a summary: varnish doesn't always do the thing we'd consider smart here. we should play with its config. 17:51:11 Or have something else cache stuff. :/ 17:51:21 agree. And we need to set the app cache headers much longer. 17:51:22 is that a (significantly simplified) version that is correct? 17:51:40 #agreed varnish doesn't always do the thing we'd consider smart here 17:51:48 i have opinions. should i pursue them or is CDN a given? 17:51:59 and that we could bump some other max-age in the webapp. 17:52:03 #action try to update the varnish config, evaluate alternatives 17:52:14 #action evaluate backend caching header times 17:52:34 #topic varnish internal redirects 17:52:40 + report correct retry-after headers 17:52:54 anything on internal stuff real quick? 17:53:19 Would cut number of http requests in half. Nothing else ;) 17:53:36 ok, great. thanks :) 17:53:43 #info Would cut number of http requests in half. Nothing else 17:53:44 I put what fmoessbauer had onto snapshot-lw07, I think. When I merged the change I hit Bad Request immediately, but couldn't investigate. 17:53:55 Assuming that's the change we're talking about. 17:54:15 @pkern[m]. Yes. Possible reason and solution stated in the mail. 17:54:21 #topic any other business, 17:54:26 fmoessbauer: Right now I can't reproduce on lw07. 17:54:44 Oh no... a heisenbug. 17:54:47 anything else? 17:54:58 yeah, but maybe for next time 17:55:14 we still need to address the rate-limiting 17:55:30 #topic rate-limiting 17:55:36 3 minutes 17:55:58 (although, looking at mlm-01, the load has significantly dropped) 17:56:11 my MR to let apt respect the retry-after headers was accepted today. THANKS! https://salsa.debian.org/apt-team/apt/-/merge_requests/383 17:56:23 yay 17:56:42 so, rate-limit feedback next time? 17:56:49 Anyways, we also need to put meaningfull values in the headers 17:57:00 pkern[m]: ln5: with fastly and the big spammer AS block, are we looking better now? 17:57:20 I backed out some rate limiting out of Varnish today. I confirmed that the netfilter stuff we have no longer rate limits. And everything is behind Fastly. 17:57:43 My feeling is that right now we're good, but it'd probably be good for someone to double check that we don't do too many 503s or something. 17:57:54 ok, I'd like to close this. we can continue some discussions after the meeting, but we already are 8 minutes over 17:57:57 #topic close 17:58:02 next meeting 2024-12-16T1700Z? 17:58:04 from before: 17:58:04 #action ln5 to evaluate caching layer in containers 17:58:05 We still had spikes: https://munin.debian.org/debian.org/snapshot-mlm-01.debian.org/apache_servers.html 17:58:08 proposed next meeting: december 16, same time 17:58:09 But I think that's fine. 17:58:32 weasel: good date and time 17:58:35 #agreed next meeting: 2024-12-16 1700Z 17:58:35 https://volatile.noreply.org/2024-11-18-lyA0tkkunHc/34945F9C-6A1E-4837-8C85-01DD9B40FCF6.ics 17:58:45 I guess we need to monitor. And people need to speak up if there's something to fix. :) 17:58:56 #endmeeting