17:01:03 <hellais> #startmeeting OONI dev gathering 2016-06-06
17:01:03 <MeetBot> Meeting started Mon Jun  6 17:01:03 2016 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:03 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:01:07 <hellais> here we go
17:01:14 * landers yo
17:01:36 <anadahz> hello everyone
17:02:37 <hellais> sbs told me that he couldn't make it to this meeting, but would like to report back that he plans on releasing a beta for measurement-kit in the next days
17:02:42 <agrabeli_> heya
17:03:10 <hellais> I would say we can start by going through the items we have in the agenda
17:03:36 <anadahz> (https://pad.riseup.net/p/ooni-irc-pad)
17:03:37 <hellais> #topic Possible ways of implementing graceful restart of the ooni-backend
17:03:56 <hellais> this is related to the ticket in here: https://github.com/TheTorProject/ooni-backend/issues/86
17:04:59 <willscott> the ticket seems to offer a path forward :)
17:05:07 <hellais> basically the issue is that the current situation is that if you kill a backend and restart it any report that is currently pending will be dropped since it uses an in memory data structure to keep track of the open tickets
17:05:13 <hellais> err open reports
17:05:37 <landers> what's a typical time duration from a client opening a report to closing a report?
17:05:38 <willscott> how long are reports 'pending'?
17:05:43 <hellais> yeah there are actually multiple ways forward
17:06:11 <hellais> that's a fine question. They can go from the order of seconds to an order of multiple hours
17:06:17 <hellais> it depends on how big the input list is
17:06:37 <irl> apache does this nicely by asking the old process to close when done, and closing all the listening sockets, then starts the new process
17:06:53 <irl> old clients are served by the old process, new ones by the new process
17:06:58 <irl> but nothing just gets killed
17:07:03 <willscott> hellais: is it because the upload goes in tandem with the probing? multiple hours is a long time :o
17:07:27 <hellais> willscott: yes that is the reason.
17:07:46 <willscott> we could certainly aim for graceful take-overs, i'm tempted to say that if the backend crashes the client needs to restart its upload
17:09:30 <hellais> irl: I think this sort of logic may be more complicated to implement in our case, because it's not just a matter of looking at the open sockets, but it also requires application specific logic.
17:09:52 <hellais> I guess one problem is that we are using a stateless protocol in a stateful manner
17:11:08 <irl> so state needs to be preserved across reloads, but is currently only in memory?
17:11:11 <hellais> that is it's not sufficient to say don't accept any more incoming connections, but it's more of saying ONLY accept incoming connections that are updates for reports that have already been created and redirect all other report opening operations to another process.
17:11:24 <hellais> irl: yes
17:11:55 <irl> ah ok, is there any redirect mechanism currently supported?
17:13:23 <hellais> irl: yeah in a way the bouncer is this redirect mechanism. That is the address that the bouncer points to can be adjusted at runtime to be something different, that basically leads to new report creations being done on another collector
17:14:13 <irl> so it's like a load balancer pool, we can drop one of the oonibs out of the pool, wait x time and reload it, add it back to the pool
17:14:15 <hellais> I discuss how this can be used to support graceful reload in here: https://github.com/TheTorProject/ooni-backend/issues/87 (red-black deployment strategy), but it's a bit convoluted
17:15:11 <irl> it's a many step process but i think it's automatable enough
17:15:20 <landers> so the graceful reload stuff would still not keep us from losing things on a crash, right?
17:15:21 <hellais> irl: yeah, that is one way to go about it.
17:15:22 <irl> and it's pretty common for things like web app deployments
17:16:01 <irl> landers: no, but we'll just write code that doesn't crash (:
17:16:27 <anadahz> I think adding real load balanicng in ooni-backend will help alleviate these issues
17:16:36 <hellais> yeah I think this sort of thing should be something that is quite standard
17:16:53 <irl> onionbalance is reaching maturity
17:17:05 <irl> and standard load balancing has been mature a while
17:17:12 <irl> there may be existing solutions we can drop in
17:17:29 <anadahz> this way we can have a fleet of collectors and test helpers
17:17:45 <irl> we can also monitor them and automatically drop out any that are unresponsive
17:18:01 <hellais> anadahz: I think that our load-balancing needs are quite particular though since we need to take into account state. That is report upload operations are transactions and I don't think load balancers for HTTP usually take into account this use-case
17:18:02 <irl> to prevent new reports going to backends that are down
17:18:40 <irl> hellais: we can do some form of deterministic routing to the backend
17:18:49 <anadahz> twisted supports HAProxy, oniobalance is good but not when you also have non HS test helpers ie.cloudflonted HTTPS test helpers
17:18:49 <irl> based on the 5 tuple or some http header
17:19:15 <willscott> hellais: load balancers will have requests from the same client go to the same backend
17:19:18 <willscott> based on cookies or the like
17:19:37 <hellais> willscott: not in the case of Tor though
17:20:01 <irl> the actual deployment will need more exploration of the tools out there, but i think the strategy in #87 gets a +1 from me
17:20:04 <hellais> basing it on HTTP headers and other things would require changing the client protocol that would lead to loosing reports from legacy clients.
17:20:21 <willscott> eventually we should maybe consider having the upload be a single request?
17:20:47 <hellais> willscott: yes I think that is probably the right way to go about it in the future.
17:21:26 <hellais> since already in MK the report upload is done at the end of a test run and we plan to do something similar in ooniprobe as well as part of the reducing fingerprintability effort
17:22:16 <willscott> great. added a note to that effect in the ticket
17:22:21 <anadahz> hellais: in any case we 'll need to implement load balancing in ooni-backend we can't just have a number of probes submiting to just a server!
17:23:25 <hellais> anadahz: I think load-balancing is the sort of thing that you implement once you start noticing that you can't process the load anymore and to my understanding we are not currently overcapacitated.
17:23:44 <hellais> also in order to have load-balancing we would have to have 2 more things:
17:23:53 <hellais> 1) Keep track of the state of reports inside of something like a database
17:24:33 <hellais> 2) Instead of writing to the filesystem the temporary reports it's probably ideal to push them to some messagequeue that get's consumed by the pipeline
17:24:45 <willscott> those seem much longer term
17:24:51 <hellais> exactly
17:24:54 <anadahz> well we are experiencing issues when we upgrade the backend and we also need to implement a way to reaload TLS certificates in backend. Both of these wouldn't be a problem if we had load balancing
17:25:21 <hellais> that is why I think we should come up with something that is the most effective short term fix with the least amount of architectural changes required
17:25:46 <willscott> and what do you think that is?
17:25:57 <hellais> anadahz: yes, but load-balancing as I explained above and in the tickets create another set of issues that require 1. and 2.
17:27:01 <anadahz> hellais: but perhaps it's easier to implement these 2 rather than implement a way to auto-update the backend and reread the renewed  TLS certificates
17:27:32 <anadahz> and in the future implement these 2 things
17:27:39 <hellais> willscott: in my opinion the quickest way of doing this would be, checking the existence of a report based on filenames on disk (if you see in the raw_reports directory some file called like the report the person wants to update you consider it to exist and if you don't have it in memory you set the stale time to the maximum)
17:28:08 <hellais> and then have a deployment stategy as that outlined in the "red-black deployment strategy" in #87
17:29:58 <hellais> where we have two separate processes for the bouncer and the collector. When we need to update the collector we spin up a new collector pointing to the same shared directory. Once it comes up we make it point to the new one and we monitor that no reports are in progress. When all reports are finished we can kill the old collector and consider roll-over complete.
17:30:26 <willscott> i guess there's a longer conversation to be had about whether we think that will be more work or getting clients to batch to a single report and loosing old clients
17:30:56 <hellais> As an extra step if the bouncer also needs to be updated another step needs to happen after the roll-over of the collector that is similar, where you take up the new bouncer and once it comes up you remap the ports of the old one to the new one
17:31:05 <willscott> how often are we restarting the bouncer?
17:31:13 <willscott> (or loosing reports)
17:31:27 <willscott> that seems like the question that tells us how high we should prioritize this work
17:31:47 <hellais> we have done it once a month in the past
17:32:16 <hellais> but when we will switch to having HTTPS collectors we will have to do it again and possibly every 60 days to re-new SSL certificate with lets encrypt
17:32:33 <willscott> that can happen at the same time as the once/month restart
17:32:58 <willscott> i'd personally be in favor of having clients make a single http upload
17:33:07 <landers> how quickly do ooniprobes upgrade (if a new version did single-req upload)?
17:33:13 <hellais> anadahz: those two things are actually not very simple to implement and it's something I would like to do after having spent a fair amount of time thinking about it, because many things can go wrong.
17:33:28 <anadahz> willscott: everytime that we update ooni-backend
17:33:47 <willscott> anadahz: which sounds like ~once/month, yeah?
17:33:49 <hellais> willscott: and drop support for legacy clients?
17:34:12 <hellais> or let's say "not care about loosing reports from older clients"
17:34:13 <willscott> hellais: or continue to loose reports from them once/month?
17:34:14 <landers> just maybe lose half-completed reports from legacy clients
17:34:21 <anadahz> willscott: sometimes we have more than 1 release per month
17:34:47 <willscott> do we have a sense of the distribution of version / how many clients stay up with current releases?
17:35:03 <willscott> especially as we switch to web-connectivity and stuff, we're already going to be less excited about the old reports anyway
17:35:47 <hellais> willscott: well the raspberry pi image updates automatically ooniprobe every 7 days. debian packages usually come out within 1 week of when we release on pypi
17:35:54 <hellais> I can look at the stats for the last update
17:35:57 <willscott> unless we have a really pressing need to support old versions, it seems like a lot of work we're proposing for unclear gain
17:36:09 <irl> willscott: ubuntu has rather old binaries across all the distributions, but these will be picking up
17:36:24 <irl> we're working towards backports for debian stable and these are pending
17:36:49 <irl> within ~1 week of new releases i am hoping to have debian/ubuntu up to date also in the stable backports
17:36:58 <willscott> awesome!
17:37:21 <willscott> i think we also are in the lucky position of having mostly technical probe operators who will be able to keep the clients updated if they care
17:38:22 <willscott> hellais is going to look at stats. should we move on to the next agenda topic?
17:38:39 <anadahz> i 'm not really OK with the fact to have downtime introduced to the backend
17:39:16 <willscott> anadahz: explain? what would cause us to 'have downtime introduced'
17:39:23 <anadahz> and we should also consider that in the future ooni will not be the only probe submiting reports
17:40:14 <willscott> do we expect other probes that can't submit reports in a single http request?
17:40:16 <anadahz> willscott: on every backend upgrade
17:40:53 <willscott> mm. we have that already though. it seems like something to mitigate, but moving to a stateless backend rather than transactional would make things way easier
17:41:37 <landers> +1 next topic
17:42:22 <hellais> yeah I agree I think the way to go is make a change to the submission protocol, for sure it's going to be easier to implement and more future proof
17:42:24 <anadahz> apart from this we should kill ooni-backend in every TLS renewal
17:42:52 <anadahz> since there is no graceful restart option supported atm
17:45:06 <landers> graceful restart is easier to implement *after* submission is stateless
17:45:09 <hellais> these are the stats of version number of ooniprobe per vantage point: https://paste.debian.net/719029/
17:45:15 <hellais> (in the month of June)
17:45:33 <anadahz> probably thre will be other issues that we haven't really thought about, like reports resuming,
17:46:10 <anadahz> bandwidth considerations when the probes resend reports after backend termination
17:46:18 <hellais> yeah I agree if the submission is stateless then graceful shutdown is super easy, it's just a matter of not accepting anymore connections and waiting for the pending ones to conclude
17:46:49 <willscott> hellais: that seems to suggest that all the probes are within 1 version of current, which means that within a couple months of rolling out a stateless change we can solve the problem
17:46:50 <landers> 1.4.2 is april 29, that's quite recent
17:46:56 <hellais> anadahz: if we make it stateless there will be no notion of "resuming". It either has succeed or you need to retry.
17:47:11 <hellais> willscott: yeah people are updating often
17:47:30 <hellais> I'm actually quite surprised to not see anything from the 1.3 series so that is very good
17:47:44 <anadahz> hellais: are you taking into considerations the GB vantage points ?
17:48:00 <hellais> anadahz: no, but those are just 3-4 vantage points
17:48:44 <hellais> and yeah that probe is running 1.3.2
17:48:45 <anadahz> hellais: landers so we decided that we go for the stateless reports submission?
17:48:54 <hellais> it's a good thing I guess if we stop receiving reports from them though :P
17:49:28 <hellais> anadahz: yes I think this is the best way forward.
17:50:15 <hellais> moving onto the next item
17:50:32 <hellais> #topic possible ways in which we can prevent submission of “malicious reports”
17:51:15 <landers> captchas
17:51:20 <hellais> as part of addressing this I had created in the past this ticket: https://github.com/TheTorProject/ooni-probe/issues/438
17:51:21 <anadahz> https://github.com/TheTorProject/ooni-backend/issues/88 (doing the work of MightyOctopus)
17:52:05 <hellais> basically the idea was to have some way of authoring reports (perhaps cryptographically signing them in some way) so that if we were to ever encouter malicious ones we could at least limit processing to those from trusted sources
17:52:35 <landers> but a malicious person wouldn't maintain the same ID over time, yeah?
17:52:44 <anadahz> ref: https://lists.torproject.org/pipermail/ooni-operators/2016-June/000003.html
17:52:57 <landers> i think this might reduce to A Hard Problem
17:53:23 <anadahz> right now pipeline ignores *all* reports from GB
17:53:25 <hellais> landers: yeah, they probably wouldn't, but at least if you end up in this sort of situation you can limit processing to "known good data" that is the ones from people you know
17:53:34 <landers> oic
17:54:30 <hellais> but yeah in the end this fits withing the category of "A Hard Problem" so I don't think there is one clear good solution to this (otherwise people like cloudflare woulnd't be making all Tor users sad, probably :P)
17:57:28 <hellais> note: I don't actually think the above operator is actually doing something malicious and we should actually be supporting their usecase in the pipeline or at least advise against that sort of stuff somewhere.
17:57:45 <landers> authorship + stronger relationships with probe operators is probably good for other reasons, too, yeah?
17:58:34 <hellais> landers: well yes and no, in some cases it may actually be a problem for people to have reports they submit all be linked to the same identity
17:59:28 <landers> optional authorship, then
17:59:38 <hellais> the use case I am thinking to be particularly problematic is when I for example run ooniprobe on my laptop and run it from the networks I visit while travelling, now on the internet there is a public log of where I have been at what time if the relation between ID and me is revealed or deduced via correlation
17:59:46 <landers> pats on the back for regular submissions from interesting places
18:01:38 <hellais> (FYI the operator that was submitting reports that way has contacted me in private and has stopped submitting them)
18:02:04 <landers> well, some kind of optional signing would be a not-so-hard first step
18:02:04 <hellais> anyways is there anything more to discuss on this point?
18:02:38 <anadahz> hellais: ah nice, it seems ooni-operators maling list does work :)
18:03:03 <anadahz> i think we are good to end this meeting (almost) on time
18:03:40 <hellais> ok sounds good, well thanks for attending and until next time!
18:03:42 <hellais> #endmeeting