16:58:42 <ahf> #startmeeting Network team meeting, 22nd August 2022
16:58:42 <MeetBot> Meeting started Mon Aug 22 16:58:42 2022 UTC.  The chair is ahf. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:58:42 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:58:44 <ahf> yoyo
16:58:52 <jnewsome> o/
16:58:55 <ahf> pad is at https://pad.riseup.net/p/tor-netteam-2022.1-keep
16:58:58 <gaba> o/
16:59:05 <Diziet> o/
16:59:33 <mikeperry> o/
16:59:42 <ahf> o/
16:59:47 <micah> o/
16:59:49 <GeKo> hi
16:59:59 <ahf> GeKo is the c-c-c-c-combo breaker :p
17:00:06 <micah> lol
17:00:16 <juga> o/
17:00:18 <ahf> okay, let's get started. dgoulet is out right now i think and nickm's still off
17:00:22 <ahf> eta: you here?
17:00:27 <eta> o/
17:00:30 <eta> thanks for the ping :p
17:00:35 <ahf> np!
17:00:42 * eta was busy doing a travel expenses request :p
17:00:43 <ahf> let's take a look at our boards at https://gitlab.torproject.org/groups/tpo/core/-/boards
17:00:59 <ahf> eta: that sounds almost as exciting as this meeting is going to be!
17:01:05 <eta> wheeee yay
17:01:57 <ahf> i don't see anything off in our board. i may have some questions for you eta on onionmasq one of these days in this week
17:02:02 <ahf> but nothing terrible there
17:02:13 <eta> ahf: yeah I just moved a ticket to needs review :p
17:02:28 <eta> looks like the next thing I need to do is possibly dormant mode / happy eyeballs support
17:02:42 <eta> maybe dormant mode first given it has roadmap on it
17:03:03 <eta> did anyone else end up taking a bit of that in the end?
17:03:05 <ahf> ya, i saw that one - i think for onionmasq we can start diving into some of the DNS stuff next whilst the android gang plays around with things on device
17:03:25 <ahf> i think dormant mode may be higher on priority here
17:03:45 <ahf> ok, anybody who is blocked on anything on their side of the board?
17:03:51 <Diziet> eta: Of dormant mode ?
17:03:56 <eta> Diziet: yeah
17:04:31 <Diziet> arti#71
17:05:06 <Diziet> We have much of that I think, but maybe it isn't wired into everything.
17:05:16 <ahf> dormant mode is going to interact with so many components i think
17:05:17 <Diziet> I have a vague memory of having this conversation before and someone saying "oh but it doesn't do XYZ"
17:05:24 <ahf> at least it was like that in C
17:05:27 <Diziet> But apprently that isn't in the ticket?
17:05:29 <ahf> in the C implementation
17:05:55 <ahf> the ticket doesn't have any measurement part of it, no? as in, check if we actually don't do any work when we are not supposed to do work, no?
17:06:03 <ahf> it think they just lists the bigger picture
17:06:24 <Diziet> I guess the work item there is "make an arti go dormant and check it doesn't do anything"
17:06:33 <ahf> ya, or "do as little as possible" :-)
17:06:42 <ahf> and nothing of the things we think it shouldn't be doing
17:06:42 <Diziet> Since apparently we don't know what it is we are missing...
17:06:52 <ahf> in C tor we did a ton of profiling while this work happened
17:07:03 <ahf> i think there was an entire winter where we just did profile driven development around 2018/2019
17:07:12 <Diziet> It shouldn't be too hard to see if it's making syscalls etc.
17:07:14 <ahf> ya
17:07:28 <ahf> ok very good
17:07:47 <ahf> let's skip release status as david is out. i have been going over the docs and such for it and i believe i can roll releases while he is gone
17:07:49 <ahf> and nick can co-sign them
17:07:54 <Diziet> eta: is this the only ticket for this?  Is there another one that's assigned to you or something?
17:08:08 <ahf> 2/3 of us are needed for these releases
17:08:29 <eta> Diziet: yeah, arti#90 is what I'm looking at
17:08:43 <eta> I'll take that to mean nobody has anything interesting to say to me about it, and I'll just go look at that this week then :p
17:08:51 <Diziet> eta: ahhhhh
17:09:21 <ahf> GeKo: i cannot remember if i asked you about this, but core/tor#40635 isn't of very high urgency, right? but is something we should fix in an upcoming release
17:09:36 <Diziet> eta: Right, that was what I was missing (and, now, I see it's linked from #71 too, so sorry)
17:09:43 <ahf> i put it on icebox, but i do not remember having the conversation with you
17:10:48 <GeKo> ahf: i did not ask
17:10:59 <ahf> it's me who is supposed to ask i think :o
17:10:59 <GeKo> but i would love seeing a fix
17:11:10 <ahf> ok, let's mark it higher than Icebox then
17:11:19 <GeKo> i think it would make our life easier for bad-relay work
17:11:23 <GeKo> in a bunch of cases
17:11:36 <GeKo> so, no show-stopper but...
17:11:37 <ahf> when you put it in For Network Health Team, i am supposed to catch that and hear how urgent it is and i don't think i did that
17:11:40 <ahf> ya
17:11:52 <ahf> oki, i will bump it up higher there - it sounds like a bug that we should be able to detect in tests too
17:12:04 <GeKo> yeah
17:12:05 <ahf> i think that was everything we had from external teams here
17:12:19 <ahf> i don't see announcements nor discussion items
17:12:26 <ahf> mikeperry: you wanna dive into s61?
17:12:31 <mikeperry> kk
17:13:26 <mikeperry> so I ran a link-sharing sim starting thursday. it ran through the weekend (they take about 50-60% longer than other sims on the large runner)
17:13:34 <mikeperry> https://gitlab.torproject.org/jnewsome/sponsor-61-sims/-/jobs/167572/artifacts/file/public/de/tornet.plot.pages.pdf
17:13:40 <mikeperry> https://gitlab.torproject.org/jnewsome/sponsor-61-sims/-/jobs/167572/artifacts/file/public/hk/tornet.plot.pages.pdf
17:14:06 <mikeperry> the main thing I noticed is that error rates for Exit streams was much much higher than before
17:14:12 <mikeperry> there's also more perf variance
17:14:15 <ahf> what is "link sharing" ? how much we co-utilize links in inter-relay comms?
17:14:42 <mikeperry> ahf: meant to simulate how multiple relay instances share the same upstream, as they do in the live network
17:15:20 <ahf> ah, so the number of relays that for example shares same uplink (like N relays in some german hetzner datacenter) ?
17:15:22 <jnewsome> yeah; we  don't really have ground truth here. the current model is probably an over-estimate of how much is happening
17:15:25 <micah> the error rate you refer to is in the last graph?
17:16:09 <mikeperry> micah: after "exit transfer goodput" graphs
17:16:14 <ahf> interesting work
17:16:17 <micah> oh I see there are a couple there
17:16:20 <mikeperry> "exit transfer error rate"
17:17:13 <micah> the times are much higher too
17:17:26 <mikeperry> the error rate for exits in the sims is much higher than the live network (blue) and previous sim with congestion control and a "nonflooding" network model period
17:18:37 <mikeperry> there was also two circuits with runaway queue sizes in the first run. but none in the next two runs. the next two runs behaved similar to previous RFC3742 sims wrt queue sizes
17:19:20 <mikeperry> my guess is that the runaway queues are caused by a bad RTT estimate on those two circuits. but not sure.
17:19:58 <mikeperry> ahf: I could use dgoulet's log patches to append_cell_to_circuit_queue(), which I believe is on your relays now
17:20:23 <ahf> like you want to apply the patches for this work?
17:20:29 <mikeperry> so having access to that would be good. I can add that beefed up log to a sim run to check the queue issue in more detail if it reappears
17:20:56 <ahf> could you sent me an ssh key to add on the relays? i am also missing to add geko to it such that he has access to it, but i did get keys from him
17:21:12 <GeKo> just ping me when you are ready with that
17:21:19 <ahf> ya, i will get to it today
17:21:29 <GeKo> and i can try to take a look at the logs etc.
17:21:36 <GeKo> ack
17:21:46 <mikeperry> ahf: ok yeah I have a few keys that are yubikeys.. the ones on gitlab. will get them to you later
17:21:59 <ahf> sounds good
17:23:42 <mikeperry> juga,geko: how is the sbws bwscanner_cc=2 going? should I take a look at that yet?
17:24:04 <mikeperry> (also do we have a ticket for the xon/xoff -> bwfiles change and spec update?)
17:24:10 <juga> mikeperry: i think is going fine, no need to take a look yet
17:24:41 <juga> mikeperry: i just crated that issue
17:24:56 <mikeperry> kk. no rush. this will be useful but again its not our brightest fire, still :)
17:25:08 <juga> sbws#40144
17:25:13 <juga> ok
17:26:24 <mikeperry> GeKo,ahf: how is overload and ram usage looking so far?
17:26:53 <mikeperry> I saw the #tor-dev note that exits are still only at ~58% upgrade to 0.4.7.9/0.4.7.10
17:27:29 <ahf> still quite high. on akka today we have seen a range of 5% to 30% dropped ntor cells
17:27:42 <ahf> so the party is still on, so to say
17:28:16 <ahf> i have not looked at memory usage there. both nodes seems to be using around 5GB of resident memory and have plenty to spare
17:28:23 <mikeperry> dgoulet also saw some cloudflare onion services still hitting the new queue limits on Friday; this probably means they also have not upgraded to 0.4.7.10
17:28:33 <ahf> yes
17:28:48 <mikeperry> does anyone know who runs the cloudflare onion service deployment? do we have any contacts there?
17:29:06 <ahf> both akka and ukko is running the same tor version now though
17:29:34 <ahf> i don't know if we have any contact there. it used to be an intern who build it, then someone took over it but she moved to work for apple at some point and i don't know who is running it now
17:29:43 <ahf> but it's teh crypto group who is responsible for it i think
17:29:48 <GeKo> mikeperry: i could ping nick from cloudflare
17:29:49 <micah> (longclaw ram is still 96% utilized, which is uncommonly high)
17:29:59 <mikeperry> heh uh oh. they could stay on 0.4.7.8 forever, heh
17:30:12 <ahf> so we could write to Nick Sullivan
17:30:14 <GeKo> although it's been a while until we had some email convo
17:30:15 <micah> (oh, but it had not updated to 4.10 yet)
17:30:27 <GeKo> ahf: yeah, that's the nick i meant
17:30:36 <ahf> GeKo: maybe a good idea to hear him, yeah
17:30:53 <micah> +1
17:30:55 <GeKo> i think the tor stuff is still on his team's plate
17:31:02 <ahf> yeah
17:31:19 <mikeperry> micah: hrm that is super weird. I actually think that might be a different kind of attack, but who knows
17:32:18 <GeKo> ahf: mikeperry: if you want to write that mail to him please do, otherwise i can get to that tomorrow
17:32:54 <mikeperry> I don't have his email. but I can be on Cc to answer any questions
17:33:26 <ahf> GeKo: you don't have a relationship with him already from nethealth side of things?
17:33:44 <GeKo> just tor browser and cloudflare onions
17:33:59 <GeKo> so, no
17:34:32 <ahf> ok - i think it is fine if you write it - i have met him once and did the haproxy stuff for his intern back then, but that is also around it
17:34:43 <ahf> i had to look his name up on twitter :o
17:35:41 <mikeperry> GeKo: any other new network-health things you have noticed or want to discuss?
17:36:36 <mikeperry> at some point if these exits ever upgrade, we can try lowering that queue limit further. according to the sims, we can safely go down from 4k to 2k, but we need onions and exits to upgrade for that one
17:36:50 <GeKo> nothing from my side for today
17:37:04 <mikeperry> this should reduce memory pressure at guards, but won't change that ntor overload situation
17:37:44 <micah> which is the min version that people need that would allow for that queue limit reduction?
17:38:01 <mikeperry> 0.4.7.9 or 0.4.7.10
17:38:12 <mikeperry> 0.4.7.9 has the fix, but has a busted geoip file
17:38:45 <mikeperry> the fix is https://gitlab.torproject.org/tpo/core/tor/-/issues/40642
17:39:00 <mikeperry> but there also is a fix to DESTROY cell handling for all relays that will also reduce queue pressure
17:39:17 <micah> thx
17:40:06 <ahf> anything else we have for today?
17:40:08 <mikeperry> that DESTROY fix is https://gitlab.torproject.org/tpo/core/tor/-/issues/40623, and I believe it was backported
17:40:49 <mikeperry> err at least I hope it was backported
17:40:57 <mikeperry> I am still not sure how to check that in gitlab
17:41:20 <ahf> hm, i don't think gitlab shows that - only the git branches reveals that
17:42:45 <mikeperry> aha yes I see it in maint-0.4.5
17:43:01 <mikeperry> looks like the DESTROY fix should be in 0.4.5.14
17:43:13 <ahf> yeah, but not in release-0.4.5 so it looks like it comes in next 0.4.5 release
17:43:40 <mikeperry> oh ok. so yeah i guess remembering to do that falls on you ahf
17:44:05 <mikeperry> that one does matter for this memory situation. I guess you can still check with dgoulet before he leaves
17:44:36 <ahf> which commit are you looking at in -maint ?
17:44:59 <mikeperry> there are three
17:45:01 <mikeperry> 6fcae8e0d080d7d0875eab4a0118e8fdaf5e832c
17:45:11 <mikeperry> dc13936f20e6263a099f40d32a274847e8384f96
17:45:18 <mikeperry> 8d8afc4efa538682ef2b80f6664456b34b84e519
17:45:21 <ahf> i think they are up-to-date with each other so i think they did come out with a release around 12/8 2022
17:45:21 <mikeperry> in maint-0.4.5
17:46:04 <ahf> ya, they are in the release branch too
17:46:11 <ahf> so i think they did arrive with the recent release
17:46:28 <mikeperry> ok great
17:46:47 <ahf> yeah, for 0.4.5.14
17:47:02 <ahf> https://gitlab.torproject.org/tpo/core/tor/-/commits/tor-0.4.5.14/
17:47:20 <ahf> ok that was good
17:47:33 <ahf> anything else we need to do now?
17:48:17 <mikeperry> I think that's it for s61. I will follow up with keys
17:48:22 <ahf> sounds good
17:48:26 <ahf> thanks all o/
17:48:29 <ahf> #endmeeting