17:01:03 #startmeeting OONI dev gathering 2016-06-06 17:01:03 Meeting started Mon Jun 6 17:01:03 2016 UTC. The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:01:03 Useful Commands: #action #agreed #help #info #idea #link #topic. 17:01:07 here we go 17:01:14 * landers yo 17:01:36 hello everyone 17:02:37 sbs told me that he couldn't make it to this meeting, but would like to report back that he plans on releasing a beta for measurement-kit in the next days 17:02:42 heya 17:03:10 I would say we can start by going through the items we have in the agenda 17:03:36 (https://pad.riseup.net/p/ooni-irc-pad) 17:03:37 #topic Possible ways of implementing graceful restart of the ooni-backend 17:03:56 this is related to the ticket in here: https://github.com/TheTorProject/ooni-backend/issues/86 17:04:59 the ticket seems to offer a path forward :) 17:05:07 basically the issue is that the current situation is that if you kill a backend and restart it any report that is currently pending will be dropped since it uses an in memory data structure to keep track of the open tickets 17:05:13 err open reports 17:05:37 what's a typical time duration from a client opening a report to closing a report? 17:05:38 how long are reports 'pending'? 17:05:43 yeah there are actually multiple ways forward 17:06:11 that's a fine question. They can go from the order of seconds to an order of multiple hours 17:06:17 it depends on how big the input list is 17:06:37 apache does this nicely by asking the old process to close when done, and closing all the listening sockets, then starts the new process 17:06:53 old clients are served by the old process, new ones by the new process 17:06:58 but nothing just gets killed 17:07:03 hellais: is it because the upload goes in tandem with the probing? multiple hours is a long time :o 17:07:27 willscott: yes that is the reason. 17:07:46 we could certainly aim for graceful take-overs, i'm tempted to say that if the backend crashes the client needs to restart its upload 17:09:30 irl: I think this sort of logic may be more complicated to implement in our case, because it's not just a matter of looking at the open sockets, but it also requires application specific logic. 17:09:52 I guess one problem is that we are using a stateless protocol in a stateful manner 17:11:08 so state needs to be preserved across reloads, but is currently only in memory? 17:11:11 that is it's not sufficient to say don't accept any more incoming connections, but it's more of saying ONLY accept incoming connections that are updates for reports that have already been created and redirect all other report opening operations to another process. 17:11:24 irl: yes 17:11:55 ah ok, is there any redirect mechanism currently supported? 17:13:23 irl: yeah in a way the bouncer is this redirect mechanism. That is the address that the bouncer points to can be adjusted at runtime to be something different, that basically leads to new report creations being done on another collector 17:14:13 so it's like a load balancer pool, we can drop one of the oonibs out of the pool, wait x time and reload it, add it back to the pool 17:14:15 I discuss how this can be used to support graceful reload in here: https://github.com/TheTorProject/ooni-backend/issues/87 (red-black deployment strategy), but it's a bit convoluted 17:15:11 it's a many step process but i think it's automatable enough 17:15:20 so the graceful reload stuff would still not keep us from losing things on a crash, right? 17:15:21 irl: yeah, that is one way to go about it. 17:15:22 and it's pretty common for things like web app deployments 17:16:01 landers: no, but we'll just write code that doesn't crash (: 17:16:27 I think adding real load balanicng in ooni-backend will help alleviate these issues 17:16:36 yeah I think this sort of thing should be something that is quite standard 17:16:53 onionbalance is reaching maturity 17:17:05 and standard load balancing has been mature a while 17:17:12 there may be existing solutions we can drop in 17:17:29 this way we can have a fleet of collectors and test helpers 17:17:45 we can also monitor them and automatically drop out any that are unresponsive 17:18:01 anadahz: I think that our load-balancing needs are quite particular though since we need to take into account state. That is report upload operations are transactions and I don't think load balancers for HTTP usually take into account this use-case 17:18:02 to prevent new reports going to backends that are down 17:18:40 hellais: we can do some form of deterministic routing to the backend 17:18:49 twisted supports HAProxy, oniobalance is good but not when you also have non HS test helpers ie.cloudflonted HTTPS test helpers 17:18:49 based on the 5 tuple or some http header 17:19:15 hellais: load balancers will have requests from the same client go to the same backend 17:19:18 based on cookies or the like 17:19:37 willscott: not in the case of Tor though 17:20:01 the actual deployment will need more exploration of the tools out there, but i think the strategy in #87 gets a +1 from me 17:20:04 basing it on HTTP headers and other things would require changing the client protocol that would lead to loosing reports from legacy clients. 17:20:21 eventually we should maybe consider having the upload be a single request? 17:20:47 willscott: yes I think that is probably the right way to go about it in the future. 17:21:26 since already in MK the report upload is done at the end of a test run and we plan to do something similar in ooniprobe as well as part of the reducing fingerprintability effort 17:22:16 great. added a note to that effect in the ticket 17:22:21 hellais: in any case we 'll need to implement load balancing in ooni-backend we can't just have a number of probes submiting to just a server! 17:23:25 anadahz: I think load-balancing is the sort of thing that you implement once you start noticing that you can't process the load anymore and to my understanding we are not currently overcapacitated. 17:23:44 also in order to have load-balancing we would have to have 2 more things: 17:23:53 1) Keep track of the state of reports inside of something like a database 17:24:33 2) Instead of writing to the filesystem the temporary reports it's probably ideal to push them to some messagequeue that get's consumed by the pipeline 17:24:45 those seem much longer term 17:24:51 exactly 17:24:54 well we are experiencing issues when we upgrade the backend and we also need to implement a way to reaload TLS certificates in backend. Both of these wouldn't be a problem if we had load balancing 17:25:21 that is why I think we should come up with something that is the most effective short term fix with the least amount of architectural changes required 17:25:46 and what do you think that is? 17:25:57 anadahz: yes, but load-balancing as I explained above and in the tickets create another set of issues that require 1. and 2. 17:27:01 hellais: but perhaps it's easier to implement these 2 rather than implement a way to auto-update the backend and reread the renewed TLS certificates 17:27:32 and in the future implement these 2 things 17:27:39 willscott: in my opinion the quickest way of doing this would be, checking the existence of a report based on filenames on disk (if you see in the raw_reports directory some file called like the report the person wants to update you consider it to exist and if you don't have it in memory you set the stale time to the maximum) 17:28:08 and then have a deployment stategy as that outlined in the "red-black deployment strategy" in #87 17:29:58 where we have two separate processes for the bouncer and the collector. When we need to update the collector we spin up a new collector pointing to the same shared directory. Once it comes up we make it point to the new one and we monitor that no reports are in progress. When all reports are finished we can kill the old collector and consider roll-over complete. 17:30:26 i guess there's a longer conversation to be had about whether we think that will be more work or getting clients to batch to a single report and loosing old clients 17:30:56 As an extra step if the bouncer also needs to be updated another step needs to happen after the roll-over of the collector that is similar, where you take up the new bouncer and once it comes up you remap the ports of the old one to the new one 17:31:05 how often are we restarting the bouncer? 17:31:13 (or loosing reports) 17:31:27 that seems like the question that tells us how high we should prioritize this work 17:31:47 we have done it once a month in the past 17:32:16 but when we will switch to having HTTPS collectors we will have to do it again and possibly every 60 days to re-new SSL certificate with lets encrypt 17:32:33 that can happen at the same time as the once/month restart 17:32:58 i'd personally be in favor of having clients make a single http upload 17:33:07 how quickly do ooniprobes upgrade (if a new version did single-req upload)? 17:33:13 anadahz: those two things are actually not very simple to implement and it's something I would like to do after having spent a fair amount of time thinking about it, because many things can go wrong. 17:33:28 willscott: everytime that we update ooni-backend 17:33:47 anadahz: which sounds like ~once/month, yeah? 17:33:49 willscott: and drop support for legacy clients? 17:34:12 or let's say "not care about loosing reports from older clients" 17:34:13 hellais: or continue to loose reports from them once/month? 17:34:14 just maybe lose half-completed reports from legacy clients 17:34:21 willscott: sometimes we have more than 1 release per month 17:34:47 do we have a sense of the distribution of version / how many clients stay up with current releases? 17:35:03 especially as we switch to web-connectivity and stuff, we're already going to be less excited about the old reports anyway 17:35:47 willscott: well the raspberry pi image updates automatically ooniprobe every 7 days. debian packages usually come out within 1 week of when we release on pypi 17:35:54 I can look at the stats for the last update 17:35:57 unless we have a really pressing need to support old versions, it seems like a lot of work we're proposing for unclear gain 17:36:09 willscott: ubuntu has rather old binaries across all the distributions, but these will be picking up 17:36:24 we're working towards backports for debian stable and these are pending 17:36:49 within ~1 week of new releases i am hoping to have debian/ubuntu up to date also in the stable backports 17:36:58 awesome! 17:37:21 i think we also are in the lucky position of having mostly technical probe operators who will be able to keep the clients updated if they care 17:38:22 hellais is going to look at stats. should we move on to the next agenda topic? 17:38:39 i 'm not really OK with the fact to have downtime introduced to the backend 17:39:16 anadahz: explain? what would cause us to 'have downtime introduced' 17:39:23 and we should also consider that in the future ooni will not be the only probe submiting reports 17:40:14 do we expect other probes that can't submit reports in a single http request? 17:40:16 willscott: on every backend upgrade 17:40:53 mm. we have that already though. it seems like something to mitigate, but moving to a stateless backend rather than transactional would make things way easier 17:41:37 +1 next topic 17:42:22 yeah I agree I think the way to go is make a change to the submission protocol, for sure it's going to be easier to implement and more future proof 17:42:24 apart from this we should kill ooni-backend in every TLS renewal 17:42:52 since there is no graceful restart option supported atm 17:45:06 graceful restart is easier to implement *after* submission is stateless 17:45:09 these are the stats of version number of ooniprobe per vantage point: https://paste.debian.net/719029/ 17:45:15 (in the month of June) 17:45:33 probably thre will be other issues that we haven't really thought about, like reports resuming, 17:46:10 bandwidth considerations when the probes resend reports after backend termination 17:46:18 yeah I agree if the submission is stateless then graceful shutdown is super easy, it's just a matter of not accepting anymore connections and waiting for the pending ones to conclude 17:46:49 hellais: that seems to suggest that all the probes are within 1 version of current, which means that within a couple months of rolling out a stateless change we can solve the problem 17:46:50 1.4.2 is april 29, that's quite recent 17:46:56 anadahz: if we make it stateless there will be no notion of "resuming". It either has succeed or you need to retry. 17:47:11 willscott: yeah people are updating often 17:47:30 I'm actually quite surprised to not see anything from the 1.3 series so that is very good 17:47:44 hellais: are you taking into considerations the GB vantage points ? 17:48:00 anadahz: no, but those are just 3-4 vantage points 17:48:44 and yeah that probe is running 1.3.2 17:48:45 hellais: landers so we decided that we go for the stateless reports submission? 17:48:54 it's a good thing I guess if we stop receiving reports from them though :P 17:49:28 anadahz: yes I think this is the best way forward. 17:50:15 moving onto the next item 17:50:32 #topic possible ways in which we can prevent submission of “malicious reports” 17:51:15 captchas 17:51:20 as part of addressing this I had created in the past this ticket: https://github.com/TheTorProject/ooni-probe/issues/438 17:51:21 https://github.com/TheTorProject/ooni-backend/issues/88 (doing the work of MightyOctopus) 17:52:05 basically the idea was to have some way of authoring reports (perhaps cryptographically signing them in some way) so that if we were to ever encouter malicious ones we could at least limit processing to those from trusted sources 17:52:35 but a malicious person wouldn't maintain the same ID over time, yeah? 17:52:44 ref: https://lists.torproject.org/pipermail/ooni-operators/2016-June/000003.html 17:52:57 i think this might reduce to A Hard Problem 17:53:23 right now pipeline ignores *all* reports from GB 17:53:25 landers: yeah, they probably wouldn't, but at least if you end up in this sort of situation you can limit processing to "known good data" that is the ones from people you know 17:53:34 oic 17:54:30 but yeah in the end this fits withing the category of "A Hard Problem" so I don't think there is one clear good solution to this (otherwise people like cloudflare woulnd't be making all Tor users sad, probably :P) 17:57:28 note: I don't actually think the above operator is actually doing something malicious and we should actually be supporting their usecase in the pipeline or at least advise against that sort of stuff somewhere. 17:57:45 authorship + stronger relationships with probe operators is probably good for other reasons, too, yeah? 17:58:34 landers: well yes and no, in some cases it may actually be a problem for people to have reports they submit all be linked to the same identity 17:59:28 optional authorship, then 17:59:38 the use case I am thinking to be particularly problematic is when I for example run ooniprobe on my laptop and run it from the networks I visit while travelling, now on the internet there is a public log of where I have been at what time if the relation between ID and me is revealed or deduced via correlation 17:59:46 pats on the back for regular submissions from interesting places 18:01:38 (FYI the operator that was submitting reports that way has contacted me in private and has stopped submitting them) 18:02:04 well, some kind of optional signing would be a not-so-hard first step 18:02:04 anyways is there anything more to discuss on this point? 18:02:38 hellais: ah nice, it seems ooni-operators maling list does work :) 18:03:03 i think we are good to end this meeting (almost) on time 18:03:40 ok sounds good, well thanks for attending and until next time! 18:03:42 #endmeeting