16:58:42 #startmeeting Network team meeting, 22nd August 2022 16:58:42 Meeting started Mon Aug 22 16:58:42 2022 UTC. The chair is ahf. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:58:42 Useful Commands: #action #agreed #help #info #idea #link #topic. 16:58:44 yoyo 16:58:52 o/ 16:58:55 pad is at https://pad.riseup.net/p/tor-netteam-2022.1-keep 16:58:58 o/ 16:59:05 o/ 16:59:33 o/ 16:59:42 o/ 16:59:47 o/ 16:59:49 hi 16:59:59 GeKo is the c-c-c-c-combo breaker :p 17:00:06 lol 17:00:16 o/ 17:00:18 okay, let's get started. dgoulet is out right now i think and nickm's still off 17:00:22 eta: you here? 17:00:27 o/ 17:00:30 thanks for the ping :p 17:00:35 np! 17:00:42 * eta was busy doing a travel expenses request :p 17:00:43 let's take a look at our boards at https://gitlab.torproject.org/groups/tpo/core/-/boards 17:00:59 eta: that sounds almost as exciting as this meeting is going to be! 17:01:05 wheeee yay 17:01:57 i don't see anything off in our board. i may have some questions for you eta on onionmasq one of these days in this week 17:02:02 but nothing terrible there 17:02:13 ahf: yeah I just moved a ticket to needs review :p 17:02:28 looks like the next thing I need to do is possibly dormant mode / happy eyeballs support 17:02:42 maybe dormant mode first given it has roadmap on it 17:03:03 did anyone else end up taking a bit of that in the end? 17:03:05 ya, i saw that one - i think for onionmasq we can start diving into some of the DNS stuff next whilst the android gang plays around with things on device 17:03:25 i think dormant mode may be higher on priority here 17:03:45 ok, anybody who is blocked on anything on their side of the board? 17:03:51 eta: Of dormant mode ? 17:03:56 Diziet: yeah 17:04:31 arti#71 17:05:06 We have much of that I think, but maybe it isn't wired into everything. 17:05:16 dormant mode is going to interact with so many components i think 17:05:17 I have a vague memory of having this conversation before and someone saying "oh but it doesn't do XYZ" 17:05:24 at least it was like that in C 17:05:27 But apprently that isn't in the ticket? 17:05:29 in the C implementation 17:05:55 the ticket doesn't have any measurement part of it, no? as in, check if we actually don't do any work when we are not supposed to do work, no? 17:06:03 it think they just lists the bigger picture 17:06:24 I guess the work item there is "make an arti go dormant and check it doesn't do anything" 17:06:33 ya, or "do as little as possible" :-) 17:06:42 and nothing of the things we think it shouldn't be doing 17:06:42 Since apparently we don't know what it is we are missing... 17:06:52 in C tor we did a ton of profiling while this work happened 17:07:03 i think there was an entire winter where we just did profile driven development around 2018/2019 17:07:12 It shouldn't be too hard to see if it's making syscalls etc. 17:07:14 ya 17:07:28 ok very good 17:07:47 let's skip release status as david is out. i have been going over the docs and such for it and i believe i can roll releases while he is gone 17:07:49 and nick can co-sign them 17:07:54 eta: is this the only ticket for this? Is there another one that's assigned to you or something? 17:08:08 2/3 of us are needed for these releases 17:08:29 Diziet: yeah, arti#90 is what I'm looking at 17:08:43 I'll take that to mean nobody has anything interesting to say to me about it, and I'll just go look at that this week then :p 17:08:51 eta: ahhhhh 17:09:21 GeKo: i cannot remember if i asked you about this, but core/tor#40635 isn't of very high urgency, right? but is something we should fix in an upcoming release 17:09:36 eta: Right, that was what I was missing (and, now, I see it's linked from #71 too, so sorry) 17:09:43 i put it on icebox, but i do not remember having the conversation with you 17:10:48 ahf: i did not ask 17:10:59 it's me who is supposed to ask i think :o 17:10:59 but i would love seeing a fix 17:11:10 ok, let's mark it higher than Icebox then 17:11:19 i think it would make our life easier for bad-relay work 17:11:23 in a bunch of cases 17:11:36 so, no show-stopper but... 17:11:37 when you put it in For Network Health Team, i am supposed to catch that and hear how urgent it is and i don't think i did that 17:11:40 ya 17:11:52 oki, i will bump it up higher there - it sounds like a bug that we should be able to detect in tests too 17:12:04 yeah 17:12:05 i think that was everything we had from external teams here 17:12:19 i don't see announcements nor discussion items 17:12:26 mikeperry: you wanna dive into s61? 17:12:31 kk 17:13:26 so I ran a link-sharing sim starting thursday. it ran through the weekend (they take about 50-60% longer than other sims on the large runner) 17:13:34 https://gitlab.torproject.org/jnewsome/sponsor-61-sims/-/jobs/167572/artifacts/file/public/de/tornet.plot.pages.pdf 17:13:40 https://gitlab.torproject.org/jnewsome/sponsor-61-sims/-/jobs/167572/artifacts/file/public/hk/tornet.plot.pages.pdf 17:14:06 the main thing I noticed is that error rates for Exit streams was much much higher than before 17:14:12 there's also more perf variance 17:14:15 what is "link sharing" ? how much we co-utilize links in inter-relay comms? 17:14:42 ahf: meant to simulate how multiple relay instances share the same upstream, as they do in the live network 17:15:20 ah, so the number of relays that for example shares same uplink (like N relays in some german hetzner datacenter) ? 17:15:22 yeah; we don't really have ground truth here. the current model is probably an over-estimate of how much is happening 17:15:25 the error rate you refer to is in the last graph? 17:16:09 micah: after "exit transfer goodput" graphs 17:16:14 interesting work 17:16:17 oh I see there are a couple there 17:16:20 "exit transfer error rate" 17:17:13 the times are much higher too 17:17:26 the error rate for exits in the sims is much higher than the live network (blue) and previous sim with congestion control and a "nonflooding" network model period 17:18:37 there was also two circuits with runaway queue sizes in the first run. but none in the next two runs. the next two runs behaved similar to previous RFC3742 sims wrt queue sizes 17:19:20 my guess is that the runaway queues are caused by a bad RTT estimate on those two circuits. but not sure. 17:19:58 ahf: I could use dgoulet's log patches to append_cell_to_circuit_queue(), which I believe is on your relays now 17:20:23 like you want to apply the patches for this work? 17:20:29 so having access to that would be good. I can add that beefed up log to a sim run to check the queue issue in more detail if it reappears 17:20:56 could you sent me an ssh key to add on the relays? i am also missing to add geko to it such that he has access to it, but i did get keys from him 17:21:12 just ping me when you are ready with that 17:21:19 ya, i will get to it today 17:21:29 and i can try to take a look at the logs etc. 17:21:36 ack 17:21:46 ahf: ok yeah I have a few keys that are yubikeys.. the ones on gitlab. will get them to you later 17:21:59 sounds good 17:23:42 juga,geko: how is the sbws bwscanner_cc=2 going? should I take a look at that yet? 17:24:04 (also do we have a ticket for the xon/xoff -> bwfiles change and spec update?) 17:24:10 mikeperry: i think is going fine, no need to take a look yet 17:24:41 mikeperry: i just crated that issue 17:24:56 kk. no rush. this will be useful but again its not our brightest fire, still :) 17:25:08 sbws#40144 17:25:13 ok 17:26:24 GeKo,ahf: how is overload and ram usage looking so far? 17:26:53 I saw the #tor-dev note that exits are still only at ~58% upgrade to 0.4.7.9/0.4.7.10 17:27:29 still quite high. on akka today we have seen a range of 5% to 30% dropped ntor cells 17:27:42 so the party is still on, so to say 17:28:16 i have not looked at memory usage there. both nodes seems to be using around 5GB of resident memory and have plenty to spare 17:28:23 dgoulet also saw some cloudflare onion services still hitting the new queue limits on Friday; this probably means they also have not upgraded to 0.4.7.10 17:28:33 yes 17:28:48 does anyone know who runs the cloudflare onion service deployment? do we have any contacts there? 17:29:06 both akka and ukko is running the same tor version now though 17:29:34 i don't know if we have any contact there. it used to be an intern who build it, then someone took over it but she moved to work for apple at some point and i don't know who is running it now 17:29:43 but it's teh crypto group who is responsible for it i think 17:29:48 mikeperry: i could ping nick from cloudflare 17:29:49 (longclaw ram is still 96% utilized, which is uncommonly high) 17:29:59 heh uh oh. they could stay on 0.4.7.8 forever, heh 17:30:12 so we could write to Nick Sullivan 17:30:14 although it's been a while until we had some email convo 17:30:15 (oh, but it had not updated to 4.10 yet) 17:30:27 ahf: yeah, that's the nick i meant 17:30:36 GeKo: maybe a good idea to hear him, yeah 17:30:53 +1 17:30:55 i think the tor stuff is still on his team's plate 17:31:02 yeah 17:31:19 micah: hrm that is super weird. I actually think that might be a different kind of attack, but who knows 17:32:18 ahf: mikeperry: if you want to write that mail to him please do, otherwise i can get to that tomorrow 17:32:54 I don't have his email. but I can be on Cc to answer any questions 17:33:26 GeKo: you don't have a relationship with him already from nethealth side of things? 17:33:44 just tor browser and cloudflare onions 17:33:59 so, no 17:34:32 ok - i think it is fine if you write it - i have met him once and did the haproxy stuff for his intern back then, but that is also around it 17:34:43 i had to look his name up on twitter :o 17:35:41 GeKo: any other new network-health things you have noticed or want to discuss? 17:36:36 at some point if these exits ever upgrade, we can try lowering that queue limit further. according to the sims, we can safely go down from 4k to 2k, but we need onions and exits to upgrade for that one 17:36:50 nothing from my side for today 17:37:04 this should reduce memory pressure at guards, but won't change that ntor overload situation 17:37:44 which is the min version that people need that would allow for that queue limit reduction? 17:38:01 0.4.7.9 or 0.4.7.10 17:38:12 0.4.7.9 has the fix, but has a busted geoip file 17:38:45 the fix is https://gitlab.torproject.org/tpo/core/tor/-/issues/40642 17:39:00 but there also is a fix to DESTROY cell handling for all relays that will also reduce queue pressure 17:39:17 thx 17:40:06 anything else we have for today? 17:40:08 that DESTROY fix is https://gitlab.torproject.org/tpo/core/tor/-/issues/40623, and I believe it was backported 17:40:49 err at least I hope it was backported 17:40:57 I am still not sure how to check that in gitlab 17:41:20 hm, i don't think gitlab shows that - only the git branches reveals that 17:42:45 aha yes I see it in maint-0.4.5 17:43:01 looks like the DESTROY fix should be in 0.4.5.14 17:43:13 yeah, but not in release-0.4.5 so it looks like it comes in next 0.4.5 release 17:43:40 oh ok. so yeah i guess remembering to do that falls on you ahf 17:44:05 that one does matter for this memory situation. I guess you can still check with dgoulet before he leaves 17:44:36 which commit are you looking at in -maint ? 17:44:59 there are three 17:45:01 6fcae8e0d080d7d0875eab4a0118e8fdaf5e832c 17:45:11 dc13936f20e6263a099f40d32a274847e8384f96 17:45:18 8d8afc4efa538682ef2b80f6664456b34b84e519 17:45:21 i think they are up-to-date with each other so i think they did come out with a release around 12/8 2022 17:45:21 in maint-0.4.5 17:46:04 ya, they are in the release branch too 17:46:11 so i think they did arrive with the recent release 17:46:28 ok great 17:46:47 yeah, for 0.4.5.14 17:47:02 https://gitlab.torproject.org/tpo/core/tor/-/commits/tor-0.4.5.14/ 17:47:20 ok that was good 17:47:33 anything else we need to do now? 17:48:17 I think that's it for s61. I will follow up with keys 17:48:22 sounds good 17:48:26 thanks all o/ 17:48:29 #endmeeting