16:00:26 #startmeeting network-health 08/30/2021 16:00:26 Meeting started Mon Aug 30 16:00:26 2021 UTC. The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:26 Useful Commands: #action #agreed #help #info #idea #link #topic. 16:00:31 hello! 16:00:36 hello 16:00:44 last network-health meeting in august 2021 :) 16:00:48 let's see 16:00:54 http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep 16:00:57 is our pad 16:01:10 please add your items on it 16:01:15 if you have not already 16:01:57 as always for things you want to bring up or talk about, mark them as bold 16:02:05 or put them into the discussion section 16:06:24 okay, let's get started 16:07:07 i don't see anything marked as bold yet 16:07:10 good 16:07:13 I'm around if we want to talk about the grafana dashboard 16:07:52 ggus: i've done a first round of drafting some process for dealing with EOL relays/bridges 16:08:11 https://pad.riseup.net/p/NzO5KK6H2_tp_bSJ7xdI 16:08:33 i gonna stare a bit more at it this week and move it as a draft maybe onto the wiki 16:08:37 *into 16:09:01 there are still a bunch of XXX we should think about 16:09:20 but either way, if you have comments/ideas those would be much appreciated 16:09:33 meskio: sounds good 16:09:44 hiro: do we want to start talking about that point first? 16:09:48 * ggus loading the pad 16:09:57 GeKo: ok sounds good 16:10:12 o/ 16:10:25 so the dashboard was a first step to see what data we can get out of onionoo that we need and how grafana can help us 16:10:38 IMO it's a bit basic and slow to load but it's a start 16:10:56 depending on what ggus and cohosh and meskio need we can do different things 16:11:20 it looks prety nice and I think it can be very useful 16:11:27 one thing I was talking with GeKo about was that if we start loading data about relays/bridges in a postgres db we can have timeseries "for free" with grafana queries 16:11:59 and on grafana you can do alerts on timeseries like from the ui which could be interesting for spotting patterns 16:12:34 but we are not there yet and maybe knowing what you all need can help me understand how to prioritize things 16:12:52 hiro: that would be super useful! one thing that i would like to work for a future sponsor is relay operator *retention*. in order to do that, we need to know which relays (and its contactinfo) are leaving the network. 16:13:36 hiro: how to do you tell if a bridge is "overloaded"? i think this will be really useful to know wehther we need more default bridges for example 16:13:55 it's in the descriptors 16:13:55 GeKo: wrt EOL relays/bridges, i can take a look next week. this week is very packed. 16:14:20 no worries 16:14:38 cohosh: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n637 16:14:42 ah ty! 16:14:46 Are 'bridges per distributor' the number of bridges available on each mechanism? (the graph now says 'No data', but it was working last time I looked at it) 16:15:01 also: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1353 16:15:13 cohosh: prop#328 16:16:36 meskio: yes sometimes the dashboard needs to be reloaded 16:16:53 ggus: how far in the future is "future" for the future sponsor? 16:16:58 and yes that's the idea, I just grouped per mechanism 16:17:04 that's nothing we committed yet, right? 16:17:29 GeKo: that's right. we don't have yet any work committed 16:17:40 okay. 16:17:54 so we could think about when writing the proposal to get money for that part, too 16:17:55 but to evaluate how much work it will be, we need to start collecting this. 16:18:26 that part = getting the grafana dashboard to do what you need 16:18:46 hiro: nice, that looks pretty cool, do you also have data on the number of distributed bridges per distributor? that might be nice to see how much each distributor is used 16:19:37 meskio: that data is from bridgedb? 16:20:14 we have bridge users by rtansport 16:20:15 yes 16:20:27 you would have to use data from the pool assignments 16:20:45 https://metrics.torproject.org/collector.html#bridge-pool-assignments 16:21:08 exactly, AFAIK metrics.tpo doesn't let you see how many bridges are requested by email or how many by moat 16:21:21 https://metrics.torproject.org/bridgedb-distributor.html 16:21:26 we've done this manually, the same as how we look at default bridge usage metrics 16:21:46 i didnt understand. what's missing from that graph? ^ 16:21:49 but it would be cool to have a more accessible and faster way to see recent changes 16:21:52 ggus: thanks for proof me wrong :) 16:22:14 hahah 16:22:34 ggus: oh yeah, i was thinking by country XD 16:22:34 * meskio needs to explore more metrics.tpo ... 16:22:51 but maybe by country is too fine-grained for grafana? 16:23:26 in grafana you can set up selectors, so you could have a country selector, but AFAIK will be for the whole dashboard 16:23:43 yeah sorry, i guess this is getting too off topic from network health 16:23:56 :) 16:23:58 cohosh meskio, in grafana you can do different kind of grouping... the problem in this case is where we get the data 16:24:29 because I started just filtering from onionoo basically I get a json with all the data from there and then I do some manipulation on grafana directly 16:24:33 if this data was in a db we could do more 16:24:52 one first step maybe if connect grafana to the db we are suing on the metrics website 16:25:05 that's used to generate all the graphs that we have on metrics.tpo 16:25:21 s/suing/usign 16:26:15 we could think about that 16:26:39 just to play around and see what is possible already 16:27:08 which probably not many people know :) 16:27:12 regarding measuring operators churn ... we should record when we stop seeing a relay for a while 16:27:21 I think we have that data on exonerator 16:27:44 but only the ip 16:27:56 hiro: yeah, and we want contact_info 16:27:57 I haven't looked at it just yet to be honest 16:29:19 maybe development wise is not a lot of work but we should start thinking on the amount of data we store and for how much time 16:29:31 if we had snapshots of all the relays at a given time we could find out 16:29:41 ggus: you could file a ticket in the network-health/team project about some way to get at that info 16:29:44 question is how many snapshots we need and how many we can keep 16:29:56 GeKo: yes 16:29:58 hiro: well, we do. we have hourly onionoo json files :) 16:30:27 we also do in collector 16:31:37 we can also get the contact info manually from polyanthum 16:31:44 but to have that into grafana we should have it in a db somewhere 16:31:48 hiro: GeKo: we could take a look at TorBSD stats and think about a network diversity dashboard - https://torbsd.github.io/oostats.html 16:31:49 maybe we don't need grafana 16:32:03 just a report of what ops are leaving 16:32:35 ggus: yeah 16:32:45 that's definitely something i want :) 16:32:54 ggus yeah that we can do 16:33:17 maybe even already 16:33:34 i'll file a ticket to track that effort after the meeting is done 16:33:41 ack! 16:34:11 so, yes, i guess we should keep thinking about potential use-cases for that dashboard 16:34:24 with an eye towards where the data should come from/is coming from 16:35:02 but, yes, this is for all teams and we need that input to know where we should prioritize our time/resources 16:35:36 i wonder whether we should collect somewhere all the ideas that popped up now 16:35:49 is there a ticket for this? 16:35:53 and might pop up once we start thinking more about possible use-cases 16:35:57 one ticket to rule them all! 16:35:58 that should be a good place to collect use cases 16:36:01 there was the ticket about tpi infrastructure 16:36:03 :) 16:36:13 but it was only about the overload case 16:36:28 i think there is no place for that yet 16:36:36 and no ticket :) 16:37:27 but i guess i can file one here, too 16:37:35 https://gitlab.torproject.org/tpo/network-health/team/-/issues/34 16:37:54 yeah, that's specific for the s61 use case 16:38:23 * gaba 's browser is so f* slow today... 16:38:58 geko: you are creating a tikcet for it then 16:39:07 yes 16:39:10 ty 16:39:17 anything else for the grafana discussion item? 16:39:18 gaba: my tb is very slow today too. 16:39:48 the grafana board has been loading for the last 5 min... with no graphs 16:39:55 if not then let's go to the other item i put on the list 16:40:27 GeKo: hiro: https://gitlab.torproject.org/tpo/network-health/team/-/issues/113 16:40:46 there was some discussion last week going on about how to present the overloaded info to relay operators and how to offer them help 16:40:59 is everyone good with that now? 16:41:05 and we have a plan moving forward? 16:41:26 or do we need more discussion by all stakeholders? 16:41:30 I think - for me - what is left is the support article 16:41:42 right now as i see it we go with the support article 16:41:54 and point relay operators on relay search to that one 16:42:02 in case their relay is overloaded 16:42:13 after adding to the support portal and to metrics, we could send an email to tor-relays@ 16:42:30 that, too, good idea 16:42:48 as I commented on https://gitlab.torproject.org/tpo/web/support/-/merge_requests/43 maybe we want to add some more pointers to help people figure out what is wrong before dumping their data to some email address 16:43:24 otherwise as GeKo said people might just send the data 16:43:35 yeah, i am fine with that 16:44:00 really, reaching out to dgoulet or me should be the last resort 16:44:13 and one could see that case as a failure on our side :) 16:44:48 in that we did not good enough help to relays operators figuring out what's up with their system before they reached that step 16:44:54 *give 16:45:50 I think some examples of what needs tuning on sysctl or how to understand what's overloaded 16:46:22 if you or dgoulet already have typical scenarios we could add these 16:46:30 who is writing the support article? 16:46:34 is that you, ggus? 16:46:38 i'm not familiar with this new overloaded info. it will check every consensus, how this will works? 16:46:50 I made a draft during the weekend 16:46:50 GeKo: hiro is writing 16:47:10 hrm, there is no dgoulet here 16:47:45 hiro: that's your !43? 16:48:14 yes 16:48:50 okay. i'll try helping with that although i don't have much experience with overloaded relays 16:49:08 i guess we should flag dgoulet there, too, for input 16:49:09 yes me neither 16:49:22 here we go 16:49:53 dgoulet: we try to get that support article for the overload indicator done 16:50:06 https://gitlab.torproject.org/tpo/web/support/-/merge_requests/43 has the draft 16:50:07 ah nice yes! 16:50:11 anything I can help with? 16:50:20 i guess getting input from you would be good 16:50:29 np 16:50:50 I'll get your feedback today for sure 16:51:01 i think we can use that mr for ideas we think we should add 16:51:08 and how to phrase things 16:51:28 ok! 16:51:35 in particular examples of what to check for and what to do in case X would be good 16:52:05 dgoulet: i think if folks reach the step where they feel they need to contact us then we have not done the best job with the support article 16:52:24 at least that's the kind of polar star i have here for guidance :) 16:52:42 i mean if that happens from time to time, well, that's fine 16:52:45 agree 16:52:59 but we should really try to give relay ops the means to solve the problems on their own 16:53:09 yep I have found some ops publish some of their tunings around but we don't have much pointers for people about where to start looking if there is an issue 16:53:22 just so the potential exposure of that metrics port data is minimized 16:53:42 okay 16:53:55 it's good to see that we are all on the same page here, though 16:54:13 do we have anything else for today's meeting? 16:54:53 hearing nothing 16:55:03 thanks everyone and have a nice week o/ 16:55:05 #endmeeting