16:00:26 <GeKo> #startmeeting network-health 08/30/2021
16:00:26 <MeetBot> Meeting started Mon Aug 30 16:00:26 2021 UTC.  The chair is GeKo. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:26 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:00:31 <GeKo> hello!
16:00:36 <hiro> hello
16:00:44 <GeKo> last network-health meeting in august 2021 :)
16:00:48 <GeKo> let's see
16:00:54 <GeKo> http://kfahv6wfkbezjyg4r6mlhpmieydbebr5vkok5r34ya464gqz6c44bnyd.onion/p/tor-nethealthteam-2021.1-keep
16:00:57 <GeKo> is our pad
16:01:10 <GeKo> please add your items on it
16:01:15 <GeKo> if you have not already
16:01:57 <GeKo> as always for things you want to bring up or talk about, mark them as bold
16:02:05 <GeKo> or put them into the discussion section
16:06:24 <GeKo> okay, let's get started
16:07:07 <GeKo> i don't see anything marked as bold yet
16:07:10 <GeKo> good
16:07:13 <meskio> I'm around if we want to talk about the grafana dashboard
16:07:52 <GeKo> ggus: i've done a first round of drafting some process for dealing with EOL relays/bridges
16:08:11 <GeKo> https://pad.riseup.net/p/NzO5KK6H2_tp_bSJ7xdI
16:08:33 <GeKo> i gonna stare a bit more at it this week and move it as a draft maybe onto the wiki
16:08:37 <GeKo> *into
16:09:01 <GeKo> there are still a bunch of XXX we should think about
16:09:20 <GeKo> but either way, if you have comments/ideas those would be much appreciated
16:09:33 <GeKo> meskio: sounds good
16:09:44 <GeKo> hiro: do we want to start talking about that point first?
16:09:48 * ggus loading the pad
16:09:57 <hiro> GeKo: ok sounds good
16:10:12 <cohosh> o/
16:10:25 <hiro> so the dashboard was a first step to see what data we can get out of onionoo that we need and how grafana can help us
16:10:38 <hiro> IMO it's a bit basic and slow to load but it's a start
16:10:56 <hiro> depending on what ggus and cohosh and meskio need we can do different things
16:11:20 <meskio> it looks prety nice and I think it can be very useful
16:11:27 <hiro> one thing I was talking with GeKo about was that if we start loading data about relays/bridges in a postgres db we can have timeseries "for free" with grafana queries
16:11:59 <hiro> and on grafana you can do alerts on timeseries like from the ui which could be interesting for spotting patterns
16:12:34 <hiro> but we are not there yet and maybe knowing what you all need can help me understand how to prioritize things
16:12:52 <ggus> hiro: that would be super useful! one thing that i would like to work for a future sponsor is relay operator *retention*. in order to do that, we need to know which relays (and its contactinfo) are leaving the network.
16:13:36 <cohosh> hiro: how to do you tell if a bridge is "overloaded"? i think this will be really useful to know wehther we need more default bridges for example
16:13:55 <hiro> it's in the descriptors
16:13:55 <ggus> GeKo: wrt EOL relays/bridges, i can take a look next week. this week is very packed.
16:14:20 <GeKo> no worries
16:14:38 <hiro> cohosh: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n637
16:14:42 <cohosh> ah ty!
16:14:46 <meskio> Are 'bridges per distributor' the number of bridges available on each mechanism? (the graph now says 'No data', but it was working last time I looked at it)
16:15:01 <hiro> also: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1353
16:15:13 <GeKo> cohosh: prop#328
16:16:36 <hiro> meskio: yes sometimes the dashboard needs to be reloaded
16:16:53 <GeKo> ggus: how far in the future is "future" for the future sponsor?
16:16:58 <hiro> and yes that's the idea, I just grouped per mechanism
16:17:04 <GeKo> that's nothing we committed yet, right?
16:17:29 <ggus> GeKo: that's right. we don't have yet any work committed
16:17:40 <GeKo> okay.
16:17:54 <GeKo> so we could think about when writing the proposal to get money for that part, too
16:17:55 <ggus> but to evaluate how much work it will be, we need to start collecting this.
16:18:26 <GeKo> that part = getting the grafana dashboard to do what you need
16:18:46 <meskio> hiro: nice, that looks pretty cool, do you also have data on the number of distributed bridges per distributor? that might be nice to see how much each distributor is used
16:19:37 <hiro> meskio: that data is from bridgedb?
16:20:14 <hiro> we have bridge users by rtansport
16:20:15 <meskio> yes
16:20:27 <cohosh> you would have to use data from the pool assignments
16:20:45 <cohosh> https://metrics.torproject.org/collector.html#bridge-pool-assignments
16:21:08 <meskio> exactly, AFAIK metrics.tpo doesn't let you see how many bridges are requested by email or how many by moat
16:21:21 <ggus> https://metrics.torproject.org/bridgedb-distributor.html
16:21:26 <cohosh> we've done this manually, the same as how we look at default bridge usage metrics
16:21:46 <ggus> i didnt understand. what's missing from that graph? ^
16:21:49 <cohosh> but it would be cool to have a more accessible and faster way to see recent changes
16:21:52 <meskio> ggus: thanks for proof me wrong :)
16:22:14 <ggus> hahah
16:22:34 <cohosh> ggus: oh yeah, i was thinking by country XD
16:22:34 * meskio needs to explore more metrics.tpo ...
16:22:51 <cohosh> but maybe by country is too fine-grained for grafana?
16:23:26 <meskio> in grafana you can set up selectors, so you could have a country selector, but AFAIK will be for the whole dashboard
16:23:43 <cohosh> yeah sorry, i guess this is getting too off topic from network health
16:23:56 <GeKo> :)
16:23:58 <hiro> cohosh meskio, in grafana you can do different kind of grouping... the problem in this case is where we get the data
16:24:29 <hiro> because I started just filtering from onionoo basically I get a json with all the data from there and then I do some manipulation on grafana directly
16:24:33 <hiro> if this data was in a db we could do more
16:24:52 <hiro> one first step maybe if connect grafana to the db we are suing on the metrics website
16:25:05 <hiro> that's used to generate all the graphs that we have on metrics.tpo
16:25:21 <hiro> s/suing/usign
16:26:15 <GeKo> we could think about that
16:26:39 <GeKo> just to play around and see what is possible already
16:27:08 <GeKo> which probably not many people know :)
16:27:12 <hiro> regarding measuring operators churn ... we should record when we stop seeing a relay for a while
16:27:21 <hiro> I think we have that data on exonerator
16:27:44 <hiro> but only the ip
16:27:56 <ggus> hiro: yeah, and we want contact_info
16:27:57 <hiro> I haven't looked at it just yet to be honest
16:29:19 <hiro> maybe development wise is not a lot of work but we should start thinking on the amount of data we store and for how much time
16:29:31 <hiro> if we had snapshots of all the relays at a given time we could find out
16:29:41 <GeKo> ggus: you could file a ticket in the network-health/team project about some way to get at that info
16:29:44 <hiro> question is how many snapshots we need and how many we can keep
16:29:56 <ggus> GeKo: yes
16:29:58 <GeKo> hiro: well, we do. we have hourly onionoo json files :)
16:30:27 <hiro> we also do in collector
16:31:37 <cohosh> we can also get the contact info manually from polyanthum
16:31:44 <hiro> but to have that into grafana we should have it in a db somewhere
16:31:48 <ggus> hiro: GeKo: we could take a look at TorBSD stats and think about a network diversity dashboard - https://torbsd.github.io/oostats.html
16:31:49 <hiro> maybe we don't need grafana
16:32:03 <hiro> just a report of what ops are leaving
16:32:35 <GeKo> ggus: yeah
16:32:45 <GeKo> that's definitely something i want :)
16:32:54 <hiro> ggus yeah that we can do
16:33:17 <hiro> maybe even already
16:33:34 <GeKo> i'll file a ticket to track that effort after the meeting is done
16:33:41 <ggus> ack!
16:34:11 <GeKo> so, yes, i guess we should keep thinking about potential use-cases for that dashboard
16:34:24 <GeKo> with an eye towards where the data should come from/is coming from
16:35:02 <GeKo> but, yes, this is for all teams and we need that input to know where we should prioritize our time/resources
16:35:36 <GeKo> i wonder whether we should collect somewhere all the ideas that popped up now
16:35:49 <gaba> is there a ticket for this?
16:35:53 <GeKo> and might pop up once we start thinking more about possible use-cases
16:35:57 <ggus> one ticket to rule them all!
16:35:58 <gaba> that should be a good place to collect use cases
16:36:01 <hiro> there was the ticket about tpi infrastructure
16:36:03 <gaba> :)
16:36:13 <hiro> but it was only about the overload case
16:36:28 <GeKo> i think there is no place for that yet
16:36:36 <GeKo> and no ticket :)
16:37:27 <GeKo> but i guess i can file one here, too
16:37:35 <hiro> https://gitlab.torproject.org/tpo/network-health/team/-/issues/34
16:37:54 <GeKo> yeah, that's specific for the s61 use case
16:38:23 * gaba 's browser is so f* slow today...
16:38:58 <gaba> geko: you are creating a tikcet for it then
16:39:07 <GeKo> yes
16:39:10 <gaba> ty
16:39:17 <GeKo> anything else for the grafana discussion item?
16:39:18 <ggus> gaba: my tb is very slow today too.
16:39:48 <gaba> the grafana board has been loading for the last 5 min... with no graphs
16:39:55 <GeKo> if not then let's go to the other item i put on the list
16:40:27 <ggus> GeKo: hiro: https://gitlab.torproject.org/tpo/network-health/team/-/issues/113
16:40:46 <GeKo> there was some discussion last week going on about how to present the overloaded info to relay operators and how to offer them help
16:40:59 <GeKo> is everyone good with that now?
16:41:05 <GeKo> and we have a plan moving forward?
16:41:26 <GeKo> or do we need more discussion by all stakeholders?
16:41:30 <hiro> I think - for me - what is left is the support article
16:41:42 <GeKo> right now as i see it we go with the support article
16:41:54 <GeKo> and point relay operators on relay search to that one
16:42:02 <GeKo> in case their relay is overloaded
16:42:13 <ggus> after adding to the support portal and to metrics, we could send an email to tor-relays@
16:42:30 <GeKo> that, too, good idea
16:42:48 <hiro> as I commented on https://gitlab.torproject.org/tpo/web/support/-/merge_requests/43 maybe we want to add some more pointers to help people figure out what is wrong before dumping their data to some email address
16:43:24 <hiro> otherwise as GeKo said people might just send the data
16:43:35 <GeKo> yeah, i am fine with that
16:44:00 <GeKo> really, reaching out to dgoulet or me should be the last resort
16:44:13 <GeKo> and one could see that case as a failure on our side :)
16:44:48 <GeKo> in that we did not good enough help to relays operators figuring out what's up with their system before they reached that step
16:44:54 <GeKo> *give
16:45:50 <hiro> I think some examples of what needs tuning on sysctl or how to understand what's overloaded
16:46:22 <hiro> if you or dgoulet already have typical scenarios we could add these
16:46:30 <GeKo> who is writing the support article?
16:46:34 <GeKo> is that you, ggus?
16:46:38 <ggus> i'm not familiar with this new overloaded info. it will check every consensus, how this will works?
16:46:50 <hiro> I made a draft during the weekend
16:46:50 <ggus> GeKo: hiro is writing
16:47:10 <GeKo> hrm, there is no dgoulet here
16:47:45 <GeKo> hiro: that's your !43?
16:48:14 <hiro> yes
16:48:50 <GeKo> okay. i'll try helping with that although i don't have much experience with overloaded relays
16:49:08 <GeKo> i guess we should flag dgoulet there, too, for input
16:49:09 <hiro> yes me neither
16:49:22 <GeKo> here we go
16:49:53 <GeKo> dgoulet: we try to get that support article for the overload indicator done
16:50:06 <GeKo> https://gitlab.torproject.org/tpo/web/support/-/merge_requests/43 has the draft
16:50:07 <dgoulet> ah nice yes!
16:50:11 <dgoulet> anything I can help with?
16:50:20 <GeKo> i guess getting input from you would be good
16:50:29 <dgoulet> np
16:50:50 <dgoulet> I'll get your feedback today for sure
16:51:01 <GeKo> i think we can use that mr for ideas we think we should add
16:51:08 <GeKo> and how to phrase things
16:51:28 <ggus> ok!
16:51:35 <GeKo> in particular examples of what to check for and what to do in case X would be good
16:52:05 <GeKo> dgoulet: i think if folks reach the step where they feel they need to contact us then we have not done the best job with the support article
16:52:24 <GeKo> at least that's the kind of polar star i have here for guidance :)
16:52:42 <GeKo> i mean if that happens from time to time, well, that's fine
16:52:45 <dgoulet> agree
16:52:59 <GeKo> but we should really try to give relay ops the means to solve the problems on their own
16:53:09 <hiro> yep I have found some ops publish some of their tunings around but we don't have much pointers for people about where to start looking if there is an issue
16:53:22 <GeKo> just so the potential exposure of that metrics port data is minimized
16:53:42 <GeKo> okay
16:53:55 <GeKo> it's good to see that we are all on the same page here, though
16:54:13 <GeKo> do we have anything else for today's meeting?
16:54:53 <GeKo> hearing nothing
16:55:03 <GeKo> thanks everyone and have a nice week o/
16:55:05 <GeKo> #endmeeting