#debian-snapshot log

17:00:13 <weasel> #startmeeting snapshot 20241118
17:00:13 <MeetBot> Meeting started Mon Nov 18 17:00:13 2024 UTC.  The chair is weasel. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:13 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:00:24 <weasel> who's around?
17:00:25 <ln5> o/
17:00:28 <axhn> *wave*
17:00:31 <lyknode> hey
17:00:32 <pkern[m]> o/
17:00:45 <fmoessbauer> Hey!
17:00:50 <weasel> wow.  full room, thanks everyone!
17:01:05 <weasel> ln5 asked me to chair, so I'll do what I can do
17:01:19 <weasel> #topic agenda
17:02:06 <weasel> I propose the following agenda items:  * status update, * roles and responsibilities (re email), # any other business, # close
17:02:12 <fmoessbauer> rate-limit and caching of s.d.o
17:02:16 <weasel> anything we should put on the agenda at this time?
17:02:18 <weasel> noted.
17:02:28 <axhn> re-import of ports
17:02:34 <ln5> roles and responsibilities
17:02:37 <weasel> axhn: that was in status update :)
17:02:40 <weasel> ln5: already on the list
17:02:45 <ln5> oh, tnx
17:02:49 <fmoessbauer> varnish internal redirects to farm
17:02:50 <axhn> weasel: fine for me :)
17:03:04 <ln5> (i'm less brainy with this fever running. thanks for chairing.)
17:03:29 <weasel> * status update,
17:03:29 <weasel> * roles and responsibilities (re email),
17:03:29 <weasel> * rate-limit and caching of s.d.o
17:03:29 <weasel> * varnish internal redirects
17:03:29 <weasel> * any other business,
17:03:30 <weasel> * close
17:03:36 <weasel> #topic status updates
17:03:56 <weasel> axhn: let's start with the -ports re-import, shall we?  you have the floor
17:04:02 <axhn> Oh well
17:04:31 <axhn> Status is, I finally found some time, and learned I have to re-compute all the file's sha1sums
17:04:53 <axhn> Doing this might take years, I found some shortcuts, will still take weeks.
17:05:20 <axhn> weasel will eventually get some files, you showed a blueprint in June or so
17:05:32 <pkern[m]> What do you do with this information? Could you read every file once, compute and figure out if it needs to be sent off and send off the bytes? :)
17:05:37 <weasel> how come it'd take so long?  any clever insight we might arrive at?
17:06:10 <axhn> It's zfs snapshots on spinning rust, so most of the time it's waiting for the disks.
17:06:36 <axhn> Gimme 16 Tbyte NVMe, but transferring the data takes some time as well
17:06:53 <weasel> if you computed the checksum of a file on one snapshot, can you easily tell if it's the same file in the next snapshot?  mtime/ctime/filesize/inode?
17:07:06 <axhn> Yeah, that's the shortcut
17:07:22 <weasel> that's what a snapshot import run does as well, iirc
17:08:18 <axhn> There are issues where the hashsum in a Packages does not match a .deb found on the disk. Will possibly ignore the first, finding the cause for this might be difficult.
17:08:36 <axhn> Anyway, not much more to say.
17:08:56 <weasel> that.. seems weird, but possible.
17:08:59 <weasel> thanks for the update.
17:09:20 <axhn> yw
17:09:26 <ln5> another server with the specs of -mlm-01 is in the pipe, no ETA yet
17:09:34 <weasel> great, thanks.
17:09:44 <weasel> what else is new or had action times recently, and doesn't have its own point later on?
17:10:02 <weasel> if nothing comes up, we can re-raise that question at Any-Other-Business.
17:10:11 <weasel> goince once,
17:10:16 <weasel> twice,
17:10:19 <lyknode> small point
17:10:20 <weasel> #topic roles and responsibilities (re email),
17:10:27 <weasel> lyknode: go
17:10:27 <lyknode> later then ;)
17:10:50 <weasel> #topic status updates, redux
17:11:16 <lyknode> Just to say I've merge a couple of MR from fmoessbauer, mainly regarging max-age of redirects
17:11:28 <lyknode> That would improve caching on the client side
17:11:38 <fmoessbauer> Thanks for the review and quick integration!
17:11:50 <weasel> sounds good.  seems like we can get into details at rate-limit and caching?
17:12:15 <lyknode> I think we could go a step further and maybe a high max-age for all replies?
17:12:21 <lyknode> yeah sure.
17:12:29 <weasel> #topic roles and responsibilities, rly now
17:12:36 <weasel> ln5: do you want to summarize?
17:12:41 <ln5> let me try
17:13:11 <ln5> "DSA wires up the web setup to serve things and the remainder is on the service owner" is the current description of how the responsibilities fall between DSA and snapshot service owners
17:13:24 <ln5> taken from email from today (thanks pkern[m])
17:13:33 <ln5> is there more, from history?
17:13:41 <pkern[m]> (Not speaking in an official capacity, inferring more from other services and history.)
17:14:03 <pkern[m]> Question is also: Who is on the hook for keeping serving lights on? DSA does some monitoring, is DSA on the hook to recover things? Probably?
17:14:10 <fmoessbauer> Is there an "org-chart" or similar showing the infrastructure and responsible teams / persons?
17:14:16 <weasel> ln5: it follows from the "only DSA has uid0" stuff, because they, then we, didn't want people to randomly change stuff
17:14:27 <ln5> bc i've had trouble understanding what i can and should do, as well as who i should ask for help, and such
17:14:59 <weasel> so dsa did all the things that required root, and gave some role account to the people operating a given service.
17:15:09 <ln5> weasel: yes, but things are intertwined and DSA might need help with some things here
17:15:27 <weasel> that worked ok-ish 10 years ago, and with something as complex as snapshot it doesn't really work all that well and it also doesn't scale
17:15:35 <weasel> but that's where things come from
17:15:46 <ln5> , like the web app in recent rate-limiting things
17:16:26 <ln5> so will it work now? with me not being DSA while weasel was
17:16:30 <ln5> i mean is
17:16:30 <pkern[m]> DSA is kinda limited on attention and time (and spoons). Keeping the lights on.
17:16:42 <pkern[m]> Looking at it from the outside. 🙃
17:16:49 <weasel> the lights are flickering
17:17:05 <ln5> can we do something about it, like run more containers? as been suggested
17:17:27 <weasel> moving more stuff into containers is one way to work around requiring DSA for all things
17:17:41 <weasel> but you mentioned at some point that containers come with their own issues
17:17:46 <pkern[m]> I think eventually we need a good way of running containers, including knowing what went into them and regular rebuilds. (Lest they get stale and DSA sad.)
17:17:56 <weasel> right
17:18:10 <weasel> containers done right would auto-build regulardly and be orchestrated in a sane manner
17:18:20 <pkern[m]> But the other question is if they'd be directly serving traffic or if there's something in front. That sad, the current setup with haproxy/varnish/apache2 is a tad too complex for my taste. :>
17:18:48 <weasel> but for us to know if it's even worth doing it, we'd need a use-case where it actually works
17:18:51 <pkern[m]> We already struggle with some services getting no attention. Snapshot is the counterexample though, which is nice!
17:19:11 <ln5> weasel: use case outside snapshot?
17:19:14 <weasel> mainly due to new momentum.  it was not getting much attention for a long time, which is why it's in this state
17:19:28 <weasel> ln5: no, snapshot working with containers is a use-case in my book.
17:19:37 <weasel> but the caching stuff could be in a container as well
17:19:41 <weasel> the DB also, maybe.
17:19:52 <ln5> that's what i've been doing on -dev-01
17:19:55 <weasel> \o/
17:19:58 <fmoessbauer> It's also about having a clear picture of the whole chain from DNS to the flask app
17:20:06 <ln5> db, apache, snapshot in separate containers
17:20:09 <ln5> but no caching yet
17:20:19 <pkern[m]> fmoessbauer: Admittedly that's something Someone(TM) could Just(TM) document. :>
17:20:32 <weasel> if prod had the caching in a container, would that help you?
17:20:36 <weasel> and if yes, why don't we? :)
17:21:10 <ln5> if it would give me read access to logs and write access to config and HUP powers, it would have helped
17:21:23 <ln5> well, i should not see user data. i think.
17:21:30 <weasel> I think it would these things
17:21:42 <ln5> i mean, i think DSA thinks i should not see user data
17:21:55 <ln5> i'd love to not see it! to be clear :)
17:22:00 <weasel> we have given service owner read to the weblogs.  granted, we have anonymized them for years, but you can already do netstat/ss
17:22:53 <fmoessbauer> does service owner of s.d.o start at the apache server? Or does it include the varnish as well?
17:23:07 <weasel> undefined
17:23:10 <ln5> and haproxy?
17:23:17 <lyknode> currently neither
17:23:21 <ln5> def not netfilter iiuc
17:23:21 <weasel> certainly the service owner can't mess with varnish, haproxy and apache without DSA
17:23:26 <pkern[m]> We could redact the logs to 127.0.0.1 as well, that'd be easy. But it also makes DSA's life hard. So we are not happy with the current anonymization we have in some places either.
17:23:37 <pkern[m]> weasel: Well, apache has the reload sudoers thing in theory.
17:23:52 <weasel> right
17:24:15 <ln5> i think this might have worked better with a service owner who was also DSA
17:24:22 <weasel> I personally don't have any issues with moving more services into service-owner controlled containers
17:24:29 <ln5> does this mean we should design in another way?
17:24:32 <pkern[m]> ln5: That is true in many places, unfortunately.
17:24:38 <ln5> ah, ic
17:24:49 <weasel> it was also hard to find other people who cared about snapshot for ages :)
17:24:52 <pkern[m]> I could probably take an AI to export an anonymized log into a different directory using varnishncsa.
17:24:58 <pkern[m]> If that helps.
17:25:08 <pkern[m]> But I'm not sure if that's actually a solution to the problem you are having.
17:25:17 <ln5> what is an AI?
17:25:19 <fmoessbauer> These logs are pretty structures. Having histograms would already help a lot
17:25:34 <weasel> if the problem is no logs, we should give you read access.  but I don't have the impression that's the whole of it
17:25:47 <ln5> that's only part of it, true
17:26:01 <ln5> need to be able to try something, look at logs, back out, and so on
17:26:15 <weasel> containers?
17:26:19 <lyknode> but some essential part IHMO, that how we can understand trafic for snapshot
17:26:35 <fmoessbauer> It's also about statistics how much is is cached where - iow how much traffic hits the backend.
17:26:47 <ln5> yes
17:27:11 <ln5> so i think pkern[m] is on to something in recent email to list, with "more metrics needed"
17:27:39 <weasel> that has also been a truism for the last 15 years, ever since munin stopped scaling when we hit 15 hosts :)
17:27:46 <ln5> is there a prom server and graphana foo around already?
17:28:01 <weasel> not yet.  don't hold your breath
17:28:04 <ln5> ack
17:28:18 <ln5> should i try moving such things forward?
17:28:19 <pkern[m]> fmoessbauer: It's not clear to me if "how much is cached where" is necessarily relevant. What's relevant is to serve whatever gets through to the webapp and provide hints to the infrastructure. I imagine (but that's conjecture) that there's a small set of frequently requested stuff and everything else are one-offs.
17:28:36 <weasel> ln5: I'm sure help would be appreciated.  but DSA cycles are sparse.
17:28:53 <ln5> weasel: can it be done outside of DSA?
17:28:56 <fmoessbauer> @pkern[m] -> caching topic. But IMHO that is quite relevant to find the bottlenecks
17:29:06 <weasel> ln5: unclear
17:29:37 <ln5> so what we're seeing here wrt roles and responsibilities and power is not unique to snapshot?
17:29:44 <ln5> just doublechecking
17:29:56 <pkern[m]> Correct.
17:30:06 <weasel> snapshot ismore  special than most because you can't really run it with a single role account.
17:30:07 <lyknode> (compared to mirrors, ftpmasters and archive for instance)
17:30:07 <pkern[m]> Although snapshot could potentially lead the way.
17:30:14 <ln5> and there are no cycles for solving it, within DSA?
17:30:16 <weasel> but similar issues to a lesser degree are in many other places
17:30:34 <ln5> ic
17:31:06 <weasel> so, for snapshot and this here now:
17:31:07 <ln5> so it'd be biting off quite a bit to try to use snapshot for making change?
17:31:21 <pkern[m]> I have the advantage of having a few more spoons and time available right now, but who knows for how long. So I've been a bit jumpy with things that kept alerting.
17:31:42 <weasel> moving more system services into containers (like varnish) would help, right?
17:31:49 <weasel> so if you want to do that, please do.
17:31:52 <ln5> pkern[m]: yes, thanks for all the time spent on snapshot!
17:32:57 <weasel> (I think haproxy is in the loop for ssl termination)
17:33:29 <ln5> weasel: ok, i will set up on dev-01 and try to bog it down a little
17:33:59 <ln5> any prior art within debian for running containers for real loads?
17:34:10 <weasel> no, you're the first :)
17:34:19 <adsb> lyknode: mirrors is a subset of dsa these days fwiw. it doesn't /need/ to be, but that's how it's ended up for a while now
17:34:19 <ln5> any existing machinery for autobuilding images and so on?
17:34:39 <ln5> weasel: oh, fun (looking for the right smiley, failing)
17:34:57 <fmoessbauer> reproducible containers built against s.d.o. Circle closes :D
17:34:58 <pkern[m]> buildd is also a subset of DSA these days.
17:35:15 <lyknode> ack
17:35:18 <ln5> fmoessbauer: :)
17:35:38 <weasel> ln5: we played at that in the early 2010s, and then we ran out of tuits
17:35:57 <ln5> i will give it a shot, pestering many of you :)
17:36:26 <ln5> please lmk when you're fed up and i'll stop highlighting you
17:36:37 <weasel> the more you can do without requiring DSA, the better.
17:36:39 <weasel> I'll let you know
17:36:45 <weasel> should we move on?
17:37:02 <ln5> fine by me
17:37:04 <weasel> #topic rate-limit and caching of s.d.o
17:37:22 <weasel> lyknode, fmoessbauer, *: you have updates/questions/discussion points?
17:37:39 <fmoessbauer> We have many use-cases where we use s.d.o to build reproducible images and containers. Main problem is currently reliability and caching.
17:38:07 <fmoessbauer> We could easily put a transparent proxy in-between s.d.o and our network, but for that we need precise and caching headers.
17:38:44 <lyknode> that and the fact that we have too much traffic for what snapshot can handle.
17:39:01 <pkern[m]> I don't think we actually have too much traffic, with Tencent blocked. 🙃
17:39:07 <weasel> files I think have a TTL of a day or a week?
17:39:14 <pkern[m]> But ICBW.
17:39:21 <fmoessbauer> The redirects between /archive and /file are also problematic, as the redirect itself is expensive for sdo but not well cached.
17:39:39 <fmoessbauer> @weasel This was reduced to 10m by @pkern[m] to put load off varnish
17:40:00 <weasel> cache-control: max-age=86400, public
17:40:07 <lyknode> for redirects
17:40:09 <ln5> bc varnish is expelling data while serving
17:40:09 <lyknode> for files: max-age=31536000,
17:40:20 <pkern[m]> I wouldn't mind to cache redirects for much longer if you provide the VCL.
17:40:22 <weasel> max-age 1 day is what I'm seeing at the client
17:40:40 <pkern[m]> But whatever was there before did not work in that setup.
17:40:52 <weasel> that should provide "precise and caching headers" as requested by fmoessbauer, wouldn't it?
17:40:52 <fmoessbauer> @weasel 1d: that's my change from last week.
17:40:53 <lyknode> pkern[m]: that can be done in the webapp
17:41:25 <pkern[m]> lyknode: In this Rube Goldberg machine the ttl in varnish is explicitly set for some reason.
17:41:28 <ln5> here's what i have jotted down about varnish:
17:41:34 <ln5> - [ ] varnish settings, keep and explain or ditch:
17:41:34 <ln5> - [ ] "weasel's rule"
17:41:34 <ln5> - [ ] 1w TTL
17:41:34 <ln5> - [ ] max 1M
17:41:37 <ln5> - [ ] no keepalive (set resp.http.connection = "close")
17:41:40 <weasel> fmoessbauer: what was it before?
17:41:48 <fmoessbauer> @weasel 10m
17:41:53 <lyknode> I think the TTL only refer to varnish cache, not the http cache headers.
17:42:26 <weasel> fmoessbauer: I thought that for files it was much much more than that when I built it
17:42:45 <lyknode> So the varnish cache could stay low, but we can increase the max-age from the webapp on all replies to allow clients to cache themselves
17:42:47 <pkern[m]> Correct. The TTL there is only varnish cache internal.
17:42:51 <fmoessbauer> @weasel for files it is super-long +ETag, but that does not help much as the expensive redirect was barely cached
17:43:11 <fmoessbauer> My hope is that varnish can already reduce the load on the DB by caching nearly all of the 302 redirects.
17:43:42 <fmoessbauer> Given that most files are frequently accessed - I don't know if that is the case, though.
17:43:44 <weasel> right, varnish shouldn't cache the files, but it should cache the redirect.  and even the redirect can be cached for a longish time
17:43:54 <pkern[m]> The redirect has "cache-control: max-age=86400, public"
17:44:08 <ln5> and hopefully the cache survives reboots
17:44:16 <pkern[m]> ln5: It doesn't even survive restarts.
17:44:20 <weasel> the cache is in memory only
17:44:21 <ln5> right
17:44:23 <pkern[m]> There's no backend in open source varnish that does.
17:44:28 <ln5> ugh
17:44:41 <ln5> is it the right tool?
17:45:02 <lyknode> the question is also: is there a lot of caching possible?
17:45:17 <lyknode> https://salsa.debian.org/snapshot-team/snapshot/-/merge_requests/23#note_547040
17:45:17 <pkern[m]> Right now I'd make the argument that Fastly will now offload some of this. We can argue if we need to run our own attempts at caching. :>
17:45:27 <weasel> my guess is that the redirect from timestamp/path to the sha1-indexed file can be cached
17:45:47 <pkern[m]> So the issue I was facing was that Varnish didn't find space in the cache, presumably because it also cached files, which it shouldn't.
17:45:50 <weasel> lyknode: and that that's relevant for most requests
17:45:51 <fmoessbauer> @lyknode the stats are from the apache, so we don't see if varnish caught most of it.
17:46:02 <pkern[m]> If we tell it to only cache redirects, then caching for a day should be fine. (Gut feeling.)
17:46:24 <lyknode> fmoessbauer: ah! true that.
17:46:27 <ln5> it'd be great with some data
17:46:44 <weasel> basically, varnish should cache everything based on the cache-control headers from the backend, except file data for static things on disk (the files/ tree)
17:46:56 <fmoessbauer> @pkern[m] IMHO we should ONLY cache the redirects as these are expensive to compute but cheap to cache.
17:47:02 <weasel> (I'm stating desired behaviour, not what I think it does currently)
17:48:15 <weasel> I suspect that the "varnish internal redirects" topic also folds into this one, right?
17:48:24 <weasel> or is that a dedicated issue?
17:48:31 <pkern[m]> Dedicated, I think.
17:48:35 <weasel> ok
17:48:43 <fmoessbauer> Yes, but that's more complicated as then varnish also needs to cache the big data :/
17:48:47 <pkern[m]> I can't do a full page screenshot or print of the Fastly interface, sigh. The graphs only work when they are on screen.
17:48:53 <pkern[m]> fmoessbauer: Why?
17:49:14 <pkern[m]> Because it internally restarts the request?
17:49:23 <fmoessbauer> That depends on the internals of varnish - which I don't know.
17:49:26 <weasel> we're running out of time soonish
17:49:40 <fmoessbauer> But then everything needs to run through varnish. Needs to be investigated
17:50:23 <fmoessbauer> Anyways, I would like to give the internal redirect thingy a try. Then we know more.
17:50:42 <weasel> attempt at a summary:  varnish doesn't always do the thing we'd consider smart here.  we should play with its config.
17:51:11 <pkern[m]> Or have something else cache stuff. :/
17:51:21 <fmoessbauer> agree. And we need to set the app cache headers much longer.
17:51:22 <weasel> is that a (significantly simplified) version that is correct?
17:51:40 <weasel> #agreed varnish doesn't always do the thing we'd consider smart here
17:51:48 <ln5> i have opinions. should i pursue them or is CDN a given?
17:51:59 <lyknode> and that we could bump some other max-age in the webapp.
17:52:03 <weasel> #action try to update the varnish config, evaluate alternatives
17:52:14 <weasel> #action evaluate backend caching header times
17:52:34 <weasel> #topic varnish internal redirects
17:52:40 <fmoessbauer> + report correct retry-after headers
17:52:54 <weasel> anything on internal stuff real quick?
17:53:19 <fmoessbauer> Would cut number of http requests in half. Nothing else ;)
17:53:36 <weasel> ok, great.  thanks :)
17:53:43 <weasel> #info Would cut number of http requests in half. Nothing else
17:53:44 <pkern[m]> I put what fmoessbauer had onto snapshot-lw07, I think. When I merged the change I hit Bad Request immediately, but couldn't investigate.
17:53:55 <pkern[m]> Assuming that's the change we're talking about.
17:54:15 <fmoessbauer> @pkern[m]. Yes. Possible reason and solution stated in the mail.
17:54:21 <weasel> #topic any other business,
17:54:26 <pkern[m]> fmoessbauer: Right now I can't reproduce on lw07.
17:54:44 <fmoessbauer> Oh no... a heisenbug.
17:54:47 <weasel> anything else?
17:54:58 <lyknode> yeah, but maybe for next time
17:55:14 <lyknode> we still need to address the rate-limiting
17:55:30 <weasel> #topic rate-limiting
17:55:36 <weasel> 3 minutes
17:55:58 <lyknode> (although, looking at mlm-01, the load has significantly dropped)
17:56:11 <fmoessbauer> my MR to let apt respect the retry-after headers was accepted today. THANKS!  https://salsa.debian.org/apt-team/apt/-/merge_requests/383
17:56:23 <weasel> yay
17:56:42 <weasel> so, rate-limit feedback next time?
17:56:49 <fmoessbauer> Anyways, we also need to put meaningfull values in the headers
17:57:00 <lyknode> pkern[m]: ln5: with fastly and the big spammer AS block, are we looking better now?
17:57:20 <pkern[m]> I backed out some rate limiting out of Varnish today. I confirmed that the netfilter stuff we have no longer rate limits. And everything is behind Fastly.
17:57:43 <pkern[m]> My feeling is that right now we're good, but it'd probably be good for someone to double check that we don't do too many 503s or something.
17:57:54 <weasel> ok, I'd like to close this.  we can continue some discussions after the meeting, but we already are 8 minutes over
17:57:57 <weasel> #topic close
17:58:02 <ln5> next meeting 2024-12-16T1700Z?
17:58:04 <weasel> from before:
17:58:04 <weasel> #action ln5 to evaluate caching layer in containers
17:58:05 <pkern[m]> We still had spikes: https://munin.debian.org/debian.org/snapshot-mlm-01.debian.org/apache_servers.html
17:58:08 <weasel> proposed next meeting: december 16, same time
17:58:09 <pkern[m]> But I think that's fine.
17:58:32 <ln5> weasel: good date and time
17:58:35 <weasel> #agreed next meeting: 2024-12-16 1700Z
17:58:35 <weasel> https://volatile.noreply.org/2024-11-18-lyA0tkkunHc/34945F9C-6A1E-4837-8C85-01DD9B40FCF6.ics
17:58:45 <pkern[m]> I guess we need to monitor. And people need to speak up if there's something to fix. :)
17:58:56 <weasel> #endmeeting