17:59:46 <phw> #startmeeting anti-censorship team meeting
17:59:46 <MeetBot> Meeting started Thu Dec 12 17:59:46 2019 UTC.  The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:59:46 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:59:52 <cohosh> hello
17:59:53 <dcf1> aloha
18:00:06 <phw> here's our meeting pad: https://pad.riseup.net/p/tor-censorship-2019-keep
18:00:39 <phw> looks like a sparse agenda :)
18:01:02 <dcf1> agendum
18:01:40 <GeKo> hah
18:01:45 <phw> true
18:02:09 <phw> regarding #31422: i wanted to ask y'all if you can think of some bridgedb-internal metrics that are worth keeping track of
18:02:31 <phw> i wonder if i asked this before? it feels like i did
18:03:23 <dcf1> I can think of as vaguely interesting: number of users a single bridge has been given to over time.
18:03:39 <dcf1> i.e., how long to give to 10 users, how long to give to 100, how long to give to 1000, etc.
18:04:07 <cohosh> once we have the ability for bridgedb to test if bridges are down, it would be nice to know how reliable our bridges are (how much uptime they have or how many are currently working)
18:04:24 <phw> right, these are two great suggestions
18:04:46 <gaba> o/
18:05:03 <hiro> something it would be interesting is the number of connections over time...
18:05:19 <hiro> to see if bridges become unusable after a while
18:05:25 <hiro> and how quickly
18:05:44 <hiro> even if the bridge is running maybe censorship system know about it and block it
18:06:17 <phw> hiro: do you mean connections from users to the bridge?
18:06:41 <hiro> uhm I think it might be trickier than just seeing if users are connecting to it
18:07:05 <hiro> I am thinking maybe people will try to use the bridge and then disconnct from it if they cannot use tor normally
18:07:08 <phw> (trying to understand what connections we're talking about)
18:07:49 <hiro> so we would need some estimation of how usable is the bridge
18:08:10 <hiro> I guess that would be successfull tor connections...
18:08:15 <hiro> not just connectiosn to the bridge
18:08:35 <hiro> but maybe I am not thinking about something
18:08:45 <phw> hiro: in #31874, we're writing a tool that can help bridgedb test if a bridge works; that is, if you can actually bootstrap a tor connection over it
18:09:28 <phw> the idea is that bridgedb shouldn't hand out bridges that don't work; not because of censorship but because they're broken
18:09:56 <hiro> uhm now that I think about if the bridge is blocked people will not be able to connect to it
18:11:00 <hiro> so what I am saying is a way to measure if the bridge is reachable from certain locations... I am not sure there is a way to measure that consistently
18:11:25 <phw> we also plan to work with ooni to build a feedback loop between bridgedb and ooni: bridgedb gives a bridge to ooni, ooni tests it and hands the results back to bridgedb, bridgedb then doesn't hand it out to people who wouldn't be able to connect to it
18:11:50 <phw> hiro: right, that's what ooni will help us do
18:11:51 <dcf1> hiro: you can partly recover that from bridge-extra-info descriptors, let me find a link real quick.
18:12:00 <hiro> yeah I guess that's how you could measure that...
18:12:38 <hiro> ^ w/ ooni that is
18:13:23 <dcf1> https://trac.torproject.org/projects/tor/ticket/30636#comment:36 is from a recent example. See "dirreq-v3-ips ir=32,by=8,de=8,mx=8,ru=8,ua=8,us=8", tells you what countries a bridge is receiving connections from.
18:13:51 <dcf1> if that's what you mean
18:15:06 <hiro> dcf1 yes that's about it
18:15:45 <phw> ok, thanks friends, these are helpful suggestions. i'll add them to the ticket after our meeting
18:16:17 <phw> next are our monitoring needs (#32679)
18:16:25 <dcf1> hiro: but one thing you can't get afaik is a linking of where those users got the bridge from, i.e., a test of how well the BridgeDB->user->use pipeline works.
18:18:12 <hiro> thanks dcf1
18:18:39 <dcf1> What's gman999 currently monitoring? Default bridges and...?
18:18:56 <hiro> if bridgedb is up or not
18:19:01 <dcf1> I assume this is how phw is so on the ball contacting operators when a bridge goes down.
18:19:36 <phw> dcf1: default bridges, and various hosts (snowflake broker, snowflake www, gettor smtp, gettor www, bridgedb smtp, bridgedb www)
18:19:55 <phw> the host monitoring is basically broken though because sysmon doesn't actually understand http(s) and cannot follow redirects
18:20:23 <phw> it works like a charm for basic tcp tests, though, which is good enough for default bridges
18:20:26 <hiro> phw we can add all this to nagios
18:20:42 <phw> hiro: nagios doesn't allow me and cohosh to add/update tests, no?
18:20:51 <dcf1> re anarcat's suggestion about prometheus, the prometheus-node-exporter is running on the broker, but we never reachaed a conclusion in #29863 about whether to expose it.
18:21:04 <hiro> yes this is what I wanted to discuss
18:21:16 <phw> i'm also fine with prometheus but isn't the idea of this "blackbox exporter" to have it run on the monitoring targets?
18:21:36 <dcf1> I think prometheus also requires us to ping someone to add a new monitoring target? Because the central prometheus information gatherer needs to know which hosts to collect from.
18:21:39 <hiro> prometheus scrapes endpoints
18:21:46 <hiro> yes
18:21:54 <cohosh> right i think we were okay with exporting some stats
18:22:33 <dcf1> I set up prometheus-node-exporter on the new snowflake broker, with the same config as the old one, but didn't expose it through the firewall because the old one wasn't.
18:22:38 <cohosh> and we got approval for a small machine to do the scraping for basically tp.net domains
18:22:55 <phw> mh, i don't want to ask default bridge operators to set up yet another thing on their bridge
18:23:34 <dcf1> phw: I think that's an important consideration.
18:23:39 <cohosh> and the idea was that all "third-party" infrastructure would be scraped from that machine
18:23:43 <hiro> and to answer your question phw I can add the update tests to nagios for tpo infra. for most of the cases nagios reads a status file or ping some host or any operation that nagios can do quickly. if you have a script writing a status file or performing the check, I can add that to nagios very quickly
18:24:47 <phw> hiro: the two main problems i want fixed are: 1) the anti-censorship team should be able to add/update monitoring targets to minimise friction and 2) i don't want to run extra software on monitoring targets
18:25:07 <phw> these are my $0.02; what do you think, cohosh and dcf1?
18:25:31 <cohosh> those sound like good constraints to me
18:25:50 <dcf1> I'm basically in agreement, though personally I don't feel the need to add targets directly myself.
18:26:00 <phw> this is why i like the monit service, which i have been running on my laptop. i just add a quick rule that fetches a page over https and raises an alert if the page doesn't contain a string.
18:26:15 <hiro> can you add your constraints in the ticket then?
18:26:49 <anarcat> keep in mind the blackbox exporter
18:26:56 <anarcat> it doesn't require anything on the monitored ends
18:26:57 <phw> they already are, but i suppose i can add two explicit bullet points if that helps
18:27:00 <anarcat> it's just like monit, more or less
18:27:03 <hiro> in the case of prometheus the blackbox exporter doesn't require targets to run anything
18:27:07 <dcf1> I think what hiro is suggesting is there there be a place where anti-censorship devs can write, and then have it be automatically picked up by nagios?
18:27:07 <anarcat> ha
18:27:12 <anarcat> hiro: you have this well in hand :)
18:27:15 <anarcat> sorry for barging in
18:27:19 <hiro> haha anarcat I was about to say the same
18:27:25 <hiro> we said it together :)
18:27:42 <hiro> dcf1 yes
18:28:01 <phw> hiro: alrighty, that's good news.
18:28:06 <hiro> like the bridgedb check... I wrote the script that create a status file
18:28:15 <dcf1> Okay "blackbox exporter" is something I didn't know.
18:28:16 <hiro> but on nagios is 3 quick lines to add
18:28:27 <hiro> this is for TPO infra
18:28:36 <hiro> like bridgedb gettor and some other stuff that we run
18:28:51 <hiro> for bridges and other external things we can use the blackbox exporter
18:29:20 <hiro> #link https://trac.torproject.org/projects/tor/ticket/32679
18:29:23 <hiro> sorry
18:29:24 <hiro> not this
18:29:33 <hiro> qubeslife^TM
18:29:48 <hiro> #link https://github.com/prometheus/blackbox_exporter#prometheus-configuration
18:29:59 <hiro> (copying between different VMs :) )
18:30:01 <phw> hiro: ok, i understand. and if i remember correctly, prometheus has some email alert system that we could use?
18:30:25 <anarcat> it does not
18:30:32 <anarcat> at least not yet
18:30:37 <hiro> yes we aren't using that yet
18:30:42 <anarcat> there's a thing called alertmanager that we could setup, but we haven't deployed anything there yet
18:30:47 <anarcat> but we could!
18:30:47 <phw> i'd like us to get a notification once any monitoring target has issues
18:30:58 <hiro> the thing is anarcat this is a different prometheus istance
18:31:07 <hiro> and I could set it up so we can test it
18:31:13 <anarcat> yeah that would be great, actually
18:32:29 <phw> ok, thanks hiro
18:32:44 <cohosh> hiro: thanks!
18:33:26 <hiro> ok so.. one thing I would need is a list of target we should have on prometheus to start testing
18:33:29 <phw> so, to sum up: we should use prometheus; we can use "exporters" on internal infrastructure and the "blackbox exporter" on external infrastructure, which we don't control; we will give the alertmanager a shot to receive notifications
18:33:37 <phw> did i sum up correctly?
18:33:40 <hiro> uhm
18:33:49 <hiro> I was suggesting to use nagios for internal infra
18:34:05 <hiro> since we also have the alerts there setup
18:34:17 <hiro> and the blackbox exporter for external things
18:34:40 <phw> hmm, ok. can our team also get alerts if any internal services go offline?
18:34:47 <hiro> yes
18:35:12 <anarcat> phw: that sounds great :)
18:35:21 <anarcat> and yes, whatever hiro says :)
18:35:53 <anarcat> we're not ready yet to ditch nagios internally, but it seems like a good idea to experiment with prometheus as a replacement externally, because we're hesitant in monitoring external stuff with the internal nagios server
18:35:56 <phw> so for nagios, we would ask TPA to handle monitoring targets. i'm fine with that because these don't change a lot
18:36:07 <hiro> yes
18:36:16 <anarcat> if we use the blackbox exporter, we (tpa) could delegate its configuration to other people (say the anticensorship team)
18:36:21 <anarcat> right
18:36:36 <phw> for external targets, i'd like to be able to do this myself. is this an option in prometheus?
18:36:42 <phw> oh, gotcha
18:38:25 <phw> this sounds good. i'll update the ticket with a summary of our discussion. thanks hiro and anarcat
18:38:32 <hiro> thanks phw
18:38:41 <anarcat> awesome
18:38:54 <anarcat> phw: you'd follow https://github.com/prometheus/blackbox_exporter#prometheus-configuration :)
18:39:12 <phw> oh, one last question: is there some kind of "dashboard" in prometheus and nagios that gives us an overview of what monitoring targets are currently offline, and maybe even a history of outages?
18:39:34 <hiro> in prometheus there is grafana
18:39:44 * phw likes diagrams
18:39:49 <anarcat> yeah, we'd build a dashboard
18:39:58 <anarcat> in fact i would start with a dashboard, and then work on the alert manager stuff
18:40:29 <hiro> #link https://grafana.com/
18:40:34 <phw> ok, perfect. let's move to prometheus/nagios then, and make our sysmon instance monitor prometheus/nagios ;)
18:41:37 <phw> shall we move on to our next discussion item? it's about log preservation from our old snowflake broker
18:42:40 <phw> cohosh: ^
18:42:45 <dcf1> cohosh: yeah I just want to check if there's anything special you need from the old broker before I shut it down.
18:43:02 <cohosh> dcf1: do we have copies of the old broker logs on the new machine?
18:43:03 <dcf1> If there's anything important I can do it today before you go on vacation.
18:43:13 <cohosh> metrics should be taken care of
18:43:21 <dcf1> I was planning to copy /var/log/snowflake-broker and /home/snowflake-broker.
18:43:32 <dcf1> Maybe /home/snowflake-broker is unnecessary, that's where metrics.log lives.
18:43:33 <cohosh> cool that sounds good to me :)
18:43:39 <dcf1> okay thanks.
18:43:58 <cohosh> thanks for taking care of that
18:44:49 <phw> ok, let's do reviews!
18:45:04 <phw> looks like we only have #32712, for cohosh?
18:45:19 <phw> oh, i had a review, but i already assigned cohosh :>
18:45:26 <phw> #30716, fwiw
18:45:52 <cohosh> phw: i just looked at that this morning, lmk if you want something else from me
18:46:07 <cohosh> #32712 could be for phw or hiro, it's about gettor
18:46:09 <phw> cohosh: will do, thanks!
18:46:15 <phw> i can do #32712
18:46:28 <cohosh> thanks
18:46:47 <cohosh> dcf1: i'll be away for a few weeks, but let me know if there's some way i can be helpful for snowflake + TurboTunnel
18:47:15 <cohosh> i am very excited about seeing this happen
18:47:24 * phw too
18:47:33 <hiro> phw I think we need more storage on gettor
18:48:12 <hiro> ?
18:49:17 <phw> hiro: cohosh is in a better position to answer this, i think
18:50:04 <cohosh> more storage would help, but this script should make it so we're not downloading all binaries at once before uploading them
18:50:22 <cohosh> hiro and i talked about this a bit in #tor-project the other day
18:51:11 <cohosh> it's probably a good idea to make the storage at least 15-20GB if that's not difficult, just because the gitlab repo is getting to be about that big
18:51:27 <cohosh> (although we'll need a solution for that as well)
18:54:09 <phw> anything else on anyone's mind?
18:54:14 <cohosh> nope
18:54:24 <phw> #endmeeting