#ooni log

16:59:55 <hellais> #startmeeting
16:59:55 <MeetBot> Meeting started Mon Dec  7 16:59:55 2015 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:59:55 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:00:01 <hellais> so who is here?
17:00:22 <anadahz> hi
17:00:44 <simonv3> I’m here
17:00:54 <alangiu> hi
17:01:37 <gxg3> hi
17:01:56 <sbs-mobile> gxg3: Hi!
17:01:59 <hodgepodge> ʕ ͡◔ ᴥ ͡◔ʔ/
17:02:22 <hellais> hodgepodge: are you trying to pwn our terminals?
17:02:32 <sbs-mobile> lol
17:02:33 <hodgepodge> I might be fuzzing.
17:02:36 <hodgepodge> pew pew.
17:03:00 <vtduncan> hi!
17:05:28 <hellais> so let's get this started
17:05:32 <hellais> what were you up to last week?
17:06:27 <sbs-mobile> more MeasurementKit development
17:06:53 <sbs-mobile> released couple of beta versions
17:07:08 <sbs-mobile> and helped with the proposal
17:07:12 <sbs-mobile> EOF
17:08:45 <simonv3> I started working on the requirements for the ooni-api, getting largely familiar with the project
17:09:07 <simonv3> I did some basic getting-to-know-the code exploring and made some small changes ready for the next steps
17:09:28 <simonv3> Also, had a pretty good convo about the requirements of the app and other things with helias, hodgepodge, and bnvk
17:09:29 <simonv3> EOF
17:10:19 <hodgepodge> I'll go next.
17:10:24 <hellais> * Worked on deployment scripts for ooni-pipeline and ooni-api
17:10:24 <hellais> * Did some brainstorming with simonv3 and hodgepodge on the api and pipeline requirements
17:10:27 <hellais> * Implemented some minor fixes to ooni-api
17:10:30 <hellais> * Worked on submitting the OTF proposal
17:10:32 <hellais> * Started refactoring ooni-probe to support JSON upload of reports
17:10:35 <hellais> * Started refactoring the ooni-probe web GUI
17:10:38 <hellais> EOF
17:10:43 * hodgepodge looks around
17:10:44 <hodgepodge> I'll go next.
17:10:48 <hodgepodge> • Opened issue #4 (suggestions to improve the usability of ooni-api by machines, and humans): https://github.com/TheTorProject/ooni-api/issues/4 - proposes storing test results as a JSON object within the PgSQL DB used by ooni-api. Commentary would be (greatly) appreciated. In response @hellais has opened tickets in ooni-probe and ooni-pipeline to accommodate this request.
17:10:53 <hodgepodge> • On a call with bnvk/hellias/simonv3/gxg to discuss requirements for ooni-api, as well as the proposed changes to PgSQL DB
17:10:58 <hodgepodge> • Proposed an API specification for ooni-api - proposed models, relationships, and functionality: https://github.com/TheTorProject/ooni-api/wiki/API-Design
17:11:01 <hodgepodge> • Familiarized myself with strongloop, and developed models as described in the API spec.
17:11:06 <hodgepodge> • Started working on ooni-api in a local staging environment
17:11:43 <hodgepodge> • May work with @hellais regarding the JSON format used by ooni-pipeline
17:11:46 * hodgepodge shrugs
17:11:47 <hodgepodge> EOF
17:12:54 <anadahz> resolved some lepidopter bugs, revise some PT install scripts, closed a number of tickets/issues
17:12:58 <anadahz> EOF
17:16:52 <alangiu> I have worked on measurement-kit with sbs to relaese the first beta version and I'm working to improve and fix some issues before the first release
17:16:54 <alangiu> EOF
17:19:47 <juga__> solved some minor issues to finish with the psiphon and openvpn tests,thx anadahz, hellais and sbs for the review
17:19:51 <juga__> EOF
17:23:52 <hellais> very good
17:24:46 <hellais> are there some particular topics for discussion?
17:26:34 <hodgepodge> Are there plans of tying in metrics from services other than ooni-probe for ooni-api?
17:26:43 <hodgepodge> Might be best for OoB, but, eeeehhh.
17:26:57 <hodgepodge> e.g. measurementkit.
17:27:31 <hellais> hodgepodge: as a matter of fact I have been thinking quite a bit about this and there are some other projects data that we may wish to integrate into ooni-api
17:27:50 <hellais> some things that I would like to have integrated into are a subset of the data from http://scans.io/
17:28:06 <hodgepodge> Thanks, @hellais! That'll definitely impact the API design, and is great to know ahead of time.
17:28:35 <hodgepodge> Oh wow, this is pretty badass.
17:28:55 <hellais> particularly for their potential usefulness as baselines and how it can be used as inputs to the data
17:29:24 <hellais> s/data/measurements/
17:29:38 <hodgepodge> I was thinking of developing a tool that can aggregate links to articles that talk about censorship for a given country. That might be neat to have.
17:29:52 <hellais> that would be very cool too
17:29:56 <hodgepodge> It'd be a neat little M/L exercise.
17:30:13 <hodgepodge> That's it from me, anyone else have anything they'd like to talk about?
17:30:25 <hellais> we should also factor into the design of the API the difference between tests that are run from the network vantage point of a country and those that are run from the outside
17:30:45 <hodgepodge> Hm. So tests run w/oonib as a proxy, you mean?
17:30:56 <hodgepodge> I think that oonib works as a test helper sometimes.
17:31:01 <hellais> as an example there is a test that is VERY accurate at determining DNS blocking in china, but it doesn't have to be run necessarily from china
17:31:15 <hodgepodge> Oh neat.
17:32:19 <hellais> hodgepodge: no, I mean like this test: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettests/experimental/dns_injection.py
17:32:39 <hellais> another thing we can do, for example, is run dns_consistency measurements on open DNS resolvers
17:33:17 <hodgepodge> That would be a great idea - I'm a little surprised that certain countries would try to respond to DNS requests sent from a non-existent DNS server.
17:33:28 <hellais> though these tests would require some specific data model that takes into account the fact that they are not run from the vantage point of where the network interference is highlighted
17:34:22 <hodgepodge> Wouldn't that be captured by probe_cc/probe_ip?
17:34:53 <hellais> that's the thing, it wouldn't, that would be from wherever the test is run from, which can also be another country
17:35:14 <hellais> but the measurement would actually be relevant, in the case of the dns_injection test, the target non-existent DNS resolver
17:35:23 <hellais> or in the case of the dns_consistency test it would be the CC of the input
17:35:46 <hodgepodge> Oh, gotcha. It shouldn't be too difficult to modify the test-specs to handle that.
17:38:18 <hellais> yeah, in general I think that in this process of making the API we will improve the test specification as well
17:38:45 <hellais> and if you do notice some inconsistencies in the test specs, please point them out as they are not set in stone
17:38:54 <hodgepodge> That's true. By the way, when is the PoC due? December 31st?
17:39:12 <hellais> hodgepodge: yeah that is the target date
17:40:00 <bnvk> hallo party people
17:40:18 <hodgepodge> Raise the roof, bnvk is in the house.
17:42:09 <bnvk> :P
17:43:26 <simonv3> ahoi bnvk
17:43:27 <hellais> is there more to be talked about?
17:43:41 <simonv3> I think we can do a better job than censys.io
17:43:51 <simonv3> (linked to from scans.io
17:43:52 <simonv3> )
17:44:01 <hellais> simonv3: yeah I agree
17:44:38 <hellais> though the amount of data they collect is huge, so I don't think we are ready just yet to support that amount of data
17:45:18 <hellais> also they approach is much less directed than ours, they don't have a specific goal in mind (except for heartbleed)
17:45:59 <hodgepodge> We don't really have a data mining capability at this time from what I can tell - once we've established ooni-api it should be a lot easier to support these types of things (and truly leverage Apache Spark/Hive/etc.)
17:46:02 <hellais> the DNS data is I think the one that we could more directly act upon in the immediate future
17:46:10 <hodgepodge> ^
17:46:18 <hellais> hodgepodge: yes I agree
17:46:41 <hodgepodge> @hellais do you need any help with implementing JSON within ooni-pipeline?
17:46:50 <hodgepodge> It should be trivial for you, since you wrote it, but if you need a hand, let me know.
17:47:28 <hellais> hodgepodge: Yeah it's quite trivial, but I want to take this as an opportunity to do a bit of refactoring of the pipeline to use luigi in a smarter way
17:48:24 <hellais> in particular we currently have all this indirection with pyinvoke and I think since 2.0.0 their binary has a lot of useful feature that we are missing out on by using it from the command line
17:48:29 <hellais> our way of using it is also deprecated
17:49:09 <hodgepodge> That's a good idea - it'd be nice to use the luigid scheduler too.
17:49:22 <hodgepodge> During the Spark stage of luigi do you perform anomaly detection using ALL of the data?
17:49:25 <hellais> my plan is to properly document what the ETL pipeline looks like and what are the various discrete steps
17:49:34 <hellais> if you want we can then split up the work of refactoring them
17:49:46 <hellais> hodgepodge: yes, we go through all the data
17:49:49 <hodgepodge> That'd be awesome. It'll make it a lot easier for other people to contribute to pipeline dev.
17:49:52 <hodgepodge> Hm.
17:50:16 <hellais> I also want it to be possible to use the ooni-pipeline locally so that people can run it on reports they have on their own machine
17:50:33 <hodgepodge> Once ooni-api is refactored, we should be able to make that process much, much more efficient by narrowing the scope of what Spark will be looking at.
17:50:48 <hodgepodge> You should be able to set-up a pseudo-distributed cluster, I've done that in the past, and it wasn't too bad.
17:50:52 <hellais> that will make it easier to debug it and also empower users that run their own collectors with their own data to do their own data analysis even if they don't use s3
17:50:58 <hodgepodge> I might even be able to put together a vagrant box for that.
17:51:39 <hellais> also this would mean we could have continous integration and proper testing of the pipeline code
17:51:53 <hodgepodge> #PipelineGoals
17:51:56 <hellais> I also thought a bit about putting everything in the database
17:52:02 <hodgepodge> Do it do it do it.
17:52:19 <hellais> and I came to the conclusion that it does make sense to put all measurements in the database, but to exclude the HTTP response bodies
17:52:36 <hellais> those should instead be stored as flat files containing the HTML files tokenised on report ID
17:52:49 <hellais> this way we can have rapid access to the body, but without having a huge database
17:52:54 <hellais> hodgepodge: do you think this makes sense?
17:53:10 <hodgepodge> Hm. I'm a little on the fence, but then again, I'm not entirely sure what the HTTP responses would be used for.
17:53:50 <hellais> well currently we only look at the body length of the response to assess blocking, but we could, in the future, have more accurate heuristics that look for specific known block pages in them
17:54:29 <hellais> I would still keep the headers in the database as that is not much data
17:54:36 <hodgepodge> The only issue with using S3 for storing HTTP response bodies is that you'll introduce a dependency on S3 which will be slow. I was thinking that we should store (most) things in the DB to allow for fast block page detection. Hm.
17:54:52 <hodgepodge> This will be a tough one.
17:55:38 <hellais> arg, tradeoffs
17:55:39 <hodgepodge> We don't need to put everything in PostgreSQL of course, but, I'm not sure if S3 is the right place, unless we were to cache the HTTP response bodies locally in MongoDB or Redis.
17:55:55 <hodgepodge> Since you'd have to download all of the reports to do the blockpage detection.
17:56:56 <hodgepodge> I think that for now in order to accelerate PoC dev. we should put everything in PostgreSQL and then determine if we will have issues.
17:57:25 <hodgepodge> It's easy to back-out, we'd just remove the HTTP response bodies from the DB, and then put them in S3, Redis, MongoDB, or, whatever.
17:57:28 <hellais> ok let's do that
17:57:42 <hodgepodge> Awesome.
17:58:15 <hellais> we should however be sure to do some benchmarks on this so that whatever strategy we decide to opt for in the future we can compare how and if it's an improvement
17:58:42 <hellais> so let's do this next steps business since we are running out of time
17:59:03 <hodgepodge> Of course. I have a fair bit of DBA experience, so I should be able to help you out with that.
17:59:21 <simonv3> hellais: where do you want me to focus this week? I was just going to take the next steps on the sketches and implement those
17:59:26 <hellais> I shall work on documenting the various stages of the ETL pipeline and refactoring it to support putting all the things in postgres
18:00:29 <sbs-mobile> this week I'll probably release the first stabile MeasurementKit EOF
18:00:47 <hellais> simonv3: yes that sounds good. I would focus on implementing a mock of the full workflow with some mocked out models so that we can come up with what are the needed API endpoints and required queries to extract the needed views from the DB
18:01:33 <hodgepodge> @simonv3 maybe look into how to perform queries in strongloop from JavaScript? I have a few models locally that I'll push to a dev. branch later tonight.
18:01:49 <hodgepodge> We'll need to modify a few of the models to support queries for each of the pages.
18:02:24 <simonv3> hodgepodge: I’d definitely like to see that
18:08:03 <hellais> simonv3: you can add to https://github.com/TheTorProject/ooni-api/blob/master/server/boot/create-fixtures.js the mocks of what you would like to have
18:09:50 <simonv3> That’s what I’ve been doing locally (don’t remember if I added those to my PR)
18:10:28 <hellais> simonv3: ah, I didn't see them in the PR
18:10:46 <simonv3> Looks like I left ito ut
18:10:49 <simonv3> it out*
18:11:45 <sbs-mobile> I have to go! bye!
18:13:08 * hodgepodge → lunch (･‿･)ﾉ゛
18:13:45 <hellais> excellent
18:13:50 <hellais> I think we are done here in that case
18:14:00 <hellais> thank you all for attending
18:14:05 <hellais> #endmeeting