17:59:46 <phw> #startmeeting anti-censorship team meeting 17:59:46 <MeetBot> Meeting started Thu Dec 12 17:59:46 2019 UTC. The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:59:46 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 17:59:52 <cohosh> hello 17:59:53 <dcf1> aloha 18:00:06 <phw> here's our meeting pad: https://pad.riseup.net/p/tor-censorship-2019-keep 18:00:39 <phw> looks like a sparse agenda :) 18:01:02 <dcf1> agendum 18:01:40 <GeKo> hah 18:01:45 <phw> true 18:02:09 <phw> regarding #31422: i wanted to ask y'all if you can think of some bridgedb-internal metrics that are worth keeping track of 18:02:31 <phw> i wonder if i asked this before? it feels like i did 18:03:23 <dcf1> I can think of as vaguely interesting: number of users a single bridge has been given to over time. 18:03:39 <dcf1> i.e., how long to give to 10 users, how long to give to 100, how long to give to 1000, etc. 18:04:07 <cohosh> once we have the ability for bridgedb to test if bridges are down, it would be nice to know how reliable our bridges are (how much uptime they have or how many are currently working) 18:04:24 <phw> right, these are two great suggestions 18:04:46 <gaba> o/ 18:05:03 <hiro> something it would be interesting is the number of connections over time... 18:05:19 <hiro> to see if bridges become unusable after a while 18:05:25 <hiro> and how quickly 18:05:44 <hiro> even if the bridge is running maybe censorship system know about it and block it 18:06:17 <phw> hiro: do you mean connections from users to the bridge? 18:06:41 <hiro> uhm I think it might be trickier than just seeing if users are connecting to it 18:07:05 <hiro> I am thinking maybe people will try to use the bridge and then disconnct from it if they cannot use tor normally 18:07:08 <phw> (trying to understand what connections we're talking about) 18:07:49 <hiro> so we would need some estimation of how usable is the bridge 18:08:10 <hiro> I guess that would be successfull tor connections... 18:08:15 <hiro> not just connectiosn to the bridge 18:08:35 <hiro> but maybe I am not thinking about something 18:08:45 <phw> hiro: in #31874, we're writing a tool that can help bridgedb test if a bridge works; that is, if you can actually bootstrap a tor connection over it 18:09:28 <phw> the idea is that bridgedb shouldn't hand out bridges that don't work; not because of censorship but because they're broken 18:09:56 <hiro> uhm now that I think about if the bridge is blocked people will not be able to connect to it 18:11:00 <hiro> so what I am saying is a way to measure if the bridge is reachable from certain locations... I am not sure there is a way to measure that consistently 18:11:25 <phw> we also plan to work with ooni to build a feedback loop between bridgedb and ooni: bridgedb gives a bridge to ooni, ooni tests it and hands the results back to bridgedb, bridgedb then doesn't hand it out to people who wouldn't be able to connect to it 18:11:50 <phw> hiro: right, that's what ooni will help us do 18:11:51 <dcf1> hiro: you can partly recover that from bridge-extra-info descriptors, let me find a link real quick. 18:12:00 <hiro> yeah I guess that's how you could measure that... 18:12:38 <hiro> ^ w/ ooni that is 18:13:23 <dcf1> https://trac.torproject.org/projects/tor/ticket/30636#comment:36 is from a recent example. See "dirreq-v3-ips ir=32,by=8,de=8,mx=8,ru=8,ua=8,us=8", tells you what countries a bridge is receiving connections from. 18:13:51 <dcf1> if that's what you mean 18:15:06 <hiro> dcf1 yes that's about it 18:15:45 <phw> ok, thanks friends, these are helpful suggestions. i'll add them to the ticket after our meeting 18:16:17 <phw> next are our monitoring needs (#32679) 18:16:25 <dcf1> hiro: but one thing you can't get afaik is a linking of where those users got the bridge from, i.e., a test of how well the BridgeDB->user->use pipeline works. 18:18:12 <hiro> thanks dcf1 18:18:39 <dcf1> What's gman999 currently monitoring? Default bridges and...? 18:18:56 <hiro> if bridgedb is up or not 18:19:01 <dcf1> I assume this is how phw is so on the ball contacting operators when a bridge goes down. 18:19:36 <phw> dcf1: default bridges, and various hosts (snowflake broker, snowflake www, gettor smtp, gettor www, bridgedb smtp, bridgedb www) 18:19:55 <phw> the host monitoring is basically broken though because sysmon doesn't actually understand http(s) and cannot follow redirects 18:20:23 <phw> it works like a charm for basic tcp tests, though, which is good enough for default bridges 18:20:26 <hiro> phw we can add all this to nagios 18:20:42 <phw> hiro: nagios doesn't allow me and cohosh to add/update tests, no? 18:20:51 <dcf1> re anarcat's suggestion about prometheus, the prometheus-node-exporter is running on the broker, but we never reachaed a conclusion in #29863 about whether to expose it. 18:21:04 <hiro> yes this is what I wanted to discuss 18:21:16 <phw> i'm also fine with prometheus but isn't the idea of this "blackbox exporter" to have it run on the monitoring targets? 18:21:36 <dcf1> I think prometheus also requires us to ping someone to add a new monitoring target? Because the central prometheus information gatherer needs to know which hosts to collect from. 18:21:39 <hiro> prometheus scrapes endpoints 18:21:46 <hiro> yes 18:21:54 <cohosh> right i think we were okay with exporting some stats 18:22:33 <dcf1> I set up prometheus-node-exporter on the new snowflake broker, with the same config as the old one, but didn't expose it through the firewall because the old one wasn't. 18:22:38 <cohosh> and we got approval for a small machine to do the scraping for basically tp.net domains 18:22:55 <phw> mh, i don't want to ask default bridge operators to set up yet another thing on their bridge 18:23:34 <dcf1> phw: I think that's an important consideration. 18:23:39 <cohosh> and the idea was that all "third-party" infrastructure would be scraped from that machine 18:23:43 <hiro> and to answer your question phw I can add the update tests to nagios for tpo infra. for most of the cases nagios reads a status file or ping some host or any operation that nagios can do quickly. if you have a script writing a status file or performing the check, I can add that to nagios very quickly 18:24:47 <phw> hiro: the two main problems i want fixed are: 1) the anti-censorship team should be able to add/update monitoring targets to minimise friction and 2) i don't want to run extra software on monitoring targets 18:25:07 <phw> these are my $0.02; what do you think, cohosh and dcf1? 18:25:31 <cohosh> those sound like good constraints to me 18:25:50 <dcf1> I'm basically in agreement, though personally I don't feel the need to add targets directly myself. 18:26:00 <phw> this is why i like the monit service, which i have been running on my laptop. i just add a quick rule that fetches a page over https and raises an alert if the page doesn't contain a string. 18:26:15 <hiro> can you add your constraints in the ticket then? 18:26:49 <anarcat> keep in mind the blackbox exporter 18:26:56 <anarcat> it doesn't require anything on the monitored ends 18:26:57 <phw> they already are, but i suppose i can add two explicit bullet points if that helps 18:27:00 <anarcat> it's just like monit, more or less 18:27:03 <hiro> in the case of prometheus the blackbox exporter doesn't require targets to run anything 18:27:07 <dcf1> I think what hiro is suggesting is there there be a place where anti-censorship devs can write, and then have it be automatically picked up by nagios? 18:27:07 <anarcat> ha 18:27:12 <anarcat> hiro: you have this well in hand :) 18:27:15 <anarcat> sorry for barging in 18:27:19 <hiro> haha anarcat I was about to say the same 18:27:25 <hiro> we said it together :) 18:27:42 <hiro> dcf1 yes 18:28:01 <phw> hiro: alrighty, that's good news. 18:28:06 <hiro> like the bridgedb check... I wrote the script that create a status file 18:28:15 <dcf1> Okay "blackbox exporter" is something I didn't know. 18:28:16 <hiro> but on nagios is 3 quick lines to add 18:28:27 <hiro> this is for TPO infra 18:28:36 <hiro> like bridgedb gettor and some other stuff that we run 18:28:51 <hiro> for bridges and other external things we can use the blackbox exporter 18:29:20 <hiro> #link https://trac.torproject.org/projects/tor/ticket/32679 18:29:23 <hiro> sorry 18:29:24 <hiro> not this 18:29:33 <hiro> qubeslife^TM 18:29:48 <hiro> #link https://github.com/prometheus/blackbox_exporter#prometheus-configuration 18:29:59 <hiro> (copying between different VMs :) ) 18:30:01 <phw> hiro: ok, i understand. and if i remember correctly, prometheus has some email alert system that we could use? 18:30:25 <anarcat> it does not 18:30:32 <anarcat> at least not yet 18:30:37 <hiro> yes we aren't using that yet 18:30:42 <anarcat> there's a thing called alertmanager that we could setup, but we haven't deployed anything there yet 18:30:47 <anarcat> but we could! 18:30:47 <phw> i'd like us to get a notification once any monitoring target has issues 18:30:58 <hiro> the thing is anarcat this is a different prometheus istance 18:31:07 <hiro> and I could set it up so we can test it 18:31:13 <anarcat> yeah that would be great, actually 18:32:29 <phw> ok, thanks hiro 18:32:44 <cohosh> hiro: thanks! 18:33:26 <hiro> ok so.. one thing I would need is a list of target we should have on prometheus to start testing 18:33:29 <phw> so, to sum up: we should use prometheus; we can use "exporters" on internal infrastructure and the "blackbox exporter" on external infrastructure, which we don't control; we will give the alertmanager a shot to receive notifications 18:33:37 <phw> did i sum up correctly? 18:33:40 <hiro> uhm 18:33:49 <hiro> I was suggesting to use nagios for internal infra 18:34:05 <hiro> since we also have the alerts there setup 18:34:17 <hiro> and the blackbox exporter for external things 18:34:40 <phw> hmm, ok. can our team also get alerts if any internal services go offline? 18:34:47 <hiro> yes 18:35:12 <anarcat> phw: that sounds great :) 18:35:21 <anarcat> and yes, whatever hiro says :) 18:35:53 <anarcat> we're not ready yet to ditch nagios internally, but it seems like a good idea to experiment with prometheus as a replacement externally, because we're hesitant in monitoring external stuff with the internal nagios server 18:35:56 <phw> so for nagios, we would ask TPA to handle monitoring targets. i'm fine with that because these don't change a lot 18:36:07 <hiro> yes 18:36:16 <anarcat> if we use the blackbox exporter, we (tpa) could delegate its configuration to other people (say the anticensorship team) 18:36:21 <anarcat> right 18:36:36 <phw> for external targets, i'd like to be able to do this myself. is this an option in prometheus? 18:36:42 <phw> oh, gotcha 18:38:25 <phw> this sounds good. i'll update the ticket with a summary of our discussion. thanks hiro and anarcat 18:38:32 <hiro> thanks phw 18:38:41 <anarcat> awesome 18:38:54 <anarcat> phw: you'd follow https://github.com/prometheus/blackbox_exporter#prometheus-configuration :) 18:39:12 <phw> oh, one last question: is there some kind of "dashboard" in prometheus and nagios that gives us an overview of what monitoring targets are currently offline, and maybe even a history of outages? 18:39:34 <hiro> in prometheus there is grafana 18:39:44 * phw likes diagrams 18:39:49 <anarcat> yeah, we'd build a dashboard 18:39:58 <anarcat> in fact i would start with a dashboard, and then work on the alert manager stuff 18:40:29 <hiro> #link https://grafana.com/ 18:40:34 <phw> ok, perfect. let's move to prometheus/nagios then, and make our sysmon instance monitor prometheus/nagios ;) 18:41:37 <phw> shall we move on to our next discussion item? it's about log preservation from our old snowflake broker 18:42:40 <phw> cohosh: ^ 18:42:45 <dcf1> cohosh: yeah I just want to check if there's anything special you need from the old broker before I shut it down. 18:43:02 <cohosh> dcf1: do we have copies of the old broker logs on the new machine? 18:43:03 <dcf1> If there's anything important I can do it today before you go on vacation. 18:43:13 <cohosh> metrics should be taken care of 18:43:21 <dcf1> I was planning to copy /var/log/snowflake-broker and /home/snowflake-broker. 18:43:32 <dcf1> Maybe /home/snowflake-broker is unnecessary, that's where metrics.log lives. 18:43:33 <cohosh> cool that sounds good to me :) 18:43:39 <dcf1> okay thanks. 18:43:58 <cohosh> thanks for taking care of that 18:44:49 <phw> ok, let's do reviews! 18:45:04 <phw> looks like we only have #32712, for cohosh? 18:45:19 <phw> oh, i had a review, but i already assigned cohosh :> 18:45:26 <phw> #30716, fwiw 18:45:52 <cohosh> phw: i just looked at that this morning, lmk if you want something else from me 18:46:07 <cohosh> #32712 could be for phw or hiro, it's about gettor 18:46:09 <phw> cohosh: will do, thanks! 18:46:15 <phw> i can do #32712 18:46:28 <cohosh> thanks 18:46:47 <cohosh> dcf1: i'll be away for a few weeks, but let me know if there's some way i can be helpful for snowflake + TurboTunnel 18:47:15 <cohosh> i am very excited about seeing this happen 18:47:24 * phw too 18:47:33 <hiro> phw I think we need more storage on gettor 18:48:12 <hiro> ? 18:49:17 <phw> hiro: cohosh is in a better position to answer this, i think 18:50:04 <cohosh> more storage would help, but this script should make it so we're not downloading all binaries at once before uploading them 18:50:22 <cohosh> hiro and i talked about this a bit in #tor-project the other day 18:51:11 <cohosh> it's probably a good idea to make the storage at least 15-20GB if that's not difficult, just because the gitlab repo is getting to be about that big 18:51:27 <cohosh> (although we'll need a solution for that as well) 18:54:09 <phw> anything else on anyone's mind? 18:54:14 <cohosh> nope 18:54:24 <phw> #endmeeting