16:59:55 #startmeeting 16:59:55 Meeting started Mon Dec 7 16:59:55 2015 UTC. The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:59:55 Useful Commands: #action #agreed #help #info #idea #link #topic. 17:00:01 so who is here? 17:00:22 hi 17:00:44 I’m here 17:00:54 hi 17:01:37 hi 17:01:56 gxg3: Hi! 17:01:59 ʕ ͡◔ ᴥ ͡◔ʔ/ 17:02:22 hodgepodge: are you trying to pwn our terminals? 17:02:32 lol 17:02:33 I might be fuzzing. 17:02:36 pew pew. 17:03:00 hi! 17:05:28 so let's get this started 17:05:32 what were you up to last week? 17:06:27 more MeasurementKit development 17:06:53 released couple of beta versions 17:07:08 and helped with the proposal 17:07:12 EOF 17:08:45 I started working on the requirements for the ooni-api, getting largely familiar with the project 17:09:07 I did some basic getting-to-know-the code exploring and made some small changes ready for the next steps 17:09:28 Also, had a pretty good convo about the requirements of the app and other things with helias, hodgepodge, and bnvk 17:09:29 EOF 17:10:19 I'll go next. 17:10:24 * Worked on deployment scripts for ooni-pipeline and ooni-api 17:10:24 * Did some brainstorming with simonv3 and hodgepodge on the api and pipeline requirements 17:10:27 * Implemented some minor fixes to ooni-api 17:10:30 * Worked on submitting the OTF proposal 17:10:32 * Started refactoring ooni-probe to support JSON upload of reports 17:10:35 * Started refactoring the ooni-probe web GUI 17:10:38 EOF 17:10:43 * hodgepodge looks around 17:10:44 I'll go next. 17:10:48 • Opened issue #4 (suggestions to improve the usability of ooni-api by machines, and humans): https://github.com/TheTorProject/ooni-api/issues/4 - proposes storing test results as a JSON object within the PgSQL DB used by ooni-api. Commentary would be (greatly) appreciated. In response @hellais has opened tickets in ooni-probe and ooni-pipeline to accommodate this request. 17:10:53 • On a call with bnvk/hellias/simonv3/gxg to discuss requirements for ooni-api, as well as the proposed changes to PgSQL DB 17:10:58 • Proposed an API specification for ooni-api - proposed models, relationships, and functionality: https://github.com/TheTorProject/ooni-api/wiki/API-Design 17:11:01 • Familiarized myself with strongloop, and developed models as described in the API spec. 17:11:06 • Started working on ooni-api in a local staging environment 17:11:43 • May work with @hellais regarding the JSON format used by ooni-pipeline 17:11:46 * hodgepodge shrugs 17:11:47 EOF 17:12:54 resolved some lepidopter bugs, revise some PT install scripts, closed a number of tickets/issues 17:12:58 EOF 17:16:52 I have worked on measurement-kit with sbs to relaese the first beta version and I'm working to improve and fix some issues before the first release 17:16:54 EOF 17:19:47 solved some minor issues to finish with the psiphon and openvpn tests,thx anadahz, hellais and sbs for the review 17:19:51 EOF 17:23:52 very good 17:24:46 are there some particular topics for discussion? 17:26:34 Are there plans of tying in metrics from services other than ooni-probe for ooni-api? 17:26:43 Might be best for OoB, but, eeeehhh. 17:26:57 e.g. measurementkit. 17:27:31 hodgepodge: as a matter of fact I have been thinking quite a bit about this and there are some other projects data that we may wish to integrate into ooni-api 17:27:50 some things that I would like to have integrated into are a subset of the data from http://scans.io/ 17:28:06 Thanks, @hellais! That'll definitely impact the API design, and is great to know ahead of time. 17:28:35 Oh wow, this is pretty badass. 17:28:55 particularly for their potential usefulness as baselines and how it can be used as inputs to the data 17:29:24 s/data/measurements/ 17:29:38 I was thinking of developing a tool that can aggregate links to articles that talk about censorship for a given country. That might be neat to have. 17:29:52 that would be very cool too 17:29:56 It'd be a neat little M/L exercise. 17:30:13 That's it from me, anyone else have anything they'd like to talk about? 17:30:25 we should also factor into the design of the API the difference between tests that are run from the network vantage point of a country and those that are run from the outside 17:30:45 Hm. So tests run w/oonib as a proxy, you mean? 17:30:56 I think that oonib works as a test helper sometimes. 17:31:01 as an example there is a test that is VERY accurate at determining DNS blocking in china, but it doesn't have to be run necessarily from china 17:31:15 Oh neat. 17:32:19 hodgepodge: no, I mean like this test: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/nettests/experimental/dns_injection.py 17:32:39 another thing we can do, for example, is run dns_consistency measurements on open DNS resolvers 17:33:17 That would be a great idea - I'm a little surprised that certain countries would try to respond to DNS requests sent from a non-existent DNS server. 17:33:28 though these tests would require some specific data model that takes into account the fact that they are not run from the vantage point of where the network interference is highlighted 17:34:22 Wouldn't that be captured by probe_cc/probe_ip? 17:34:53 that's the thing, it wouldn't, that would be from wherever the test is run from, which can also be another country 17:35:14 but the measurement would actually be relevant, in the case of the dns_injection test, the target non-existent DNS resolver 17:35:23 or in the case of the dns_consistency test it would be the CC of the input 17:35:46 Oh, gotcha. It shouldn't be too difficult to modify the test-specs to handle that. 17:38:18 yeah, in general I think that in this process of making the API we will improve the test specification as well 17:38:45 and if you do notice some inconsistencies in the test specs, please point them out as they are not set in stone 17:38:54 That's true. By the way, when is the PoC due? December 31st? 17:39:12 hodgepodge: yeah that is the target date 17:40:00 hallo party people 17:40:18 Raise the roof, bnvk is in the house. 17:42:09 :P 17:43:26 ahoi bnvk 17:43:27 is there more to be talked about? 17:43:41 I think we can do a better job than censys.io 17:43:51 (linked to from scans.io 17:43:52 ) 17:44:01 simonv3: yeah I agree 17:44:38 though the amount of data they collect is huge, so I don't think we are ready just yet to support that amount of data 17:45:18 also they approach is much less directed than ours, they don't have a specific goal in mind (except for heartbleed) 17:45:59 We don't really have a data mining capability at this time from what I can tell - once we've established ooni-api it should be a lot easier to support these types of things (and truly leverage Apache Spark/Hive/etc.) 17:46:02 the DNS data is I think the one that we could more directly act upon in the immediate future 17:46:10 ^ 17:46:18 hodgepodge: yes I agree 17:46:41 @hellais do you need any help with implementing JSON within ooni-pipeline? 17:46:50 It should be trivial for you, since you wrote it, but if you need a hand, let me know. 17:47:28 hodgepodge: Yeah it's quite trivial, but I want to take this as an opportunity to do a bit of refactoring of the pipeline to use luigi in a smarter way 17:48:24 in particular we currently have all this indirection with pyinvoke and I think since 2.0.0 their binary has a lot of useful feature that we are missing out on by using it from the command line 17:48:29 our way of using it is also deprecated 17:49:09 That's a good idea - it'd be nice to use the luigid scheduler too. 17:49:22 During the Spark stage of luigi do you perform anomaly detection using ALL of the data? 17:49:25 my plan is to properly document what the ETL pipeline looks like and what are the various discrete steps 17:49:34 if you want we can then split up the work of refactoring them 17:49:46 hodgepodge: yes, we go through all the data 17:49:49 That'd be awesome. It'll make it a lot easier for other people to contribute to pipeline dev. 17:49:52 Hm. 17:50:16 I also want it to be possible to use the ooni-pipeline locally so that people can run it on reports they have on their own machine 17:50:33 Once ooni-api is refactored, we should be able to make that process much, much more efficient by narrowing the scope of what Spark will be looking at. 17:50:48 You should be able to set-up a pseudo-distributed cluster, I've done that in the past, and it wasn't too bad. 17:50:52 that will make it easier to debug it and also empower users that run their own collectors with their own data to do their own data analysis even if they don't use s3 17:50:58 I might even be able to put together a vagrant box for that. 17:51:39 also this would mean we could have continous integration and proper testing of the pipeline code 17:51:53 #PipelineGoals 17:51:56 I also thought a bit about putting everything in the database 17:52:02 Do it do it do it. 17:52:19 and I came to the conclusion that it does make sense to put all measurements in the database, but to exclude the HTTP response bodies 17:52:36 those should instead be stored as flat files containing the HTML files tokenised on report ID 17:52:49 this way we can have rapid access to the body, but without having a huge database 17:52:54 hodgepodge: do you think this makes sense? 17:53:10 Hm. I'm a little on the fence, but then again, I'm not entirely sure what the HTTP responses would be used for. 17:53:50 well currently we only look at the body length of the response to assess blocking, but we could, in the future, have more accurate heuristics that look for specific known block pages in them 17:54:29 I would still keep the headers in the database as that is not much data 17:54:36 The only issue with using S3 for storing HTTP response bodies is that you'll introduce a dependency on S3 which will be slow. I was thinking that we should store (most) things in the DB to allow for fast block page detection. Hm. 17:54:52 This will be a tough one. 17:55:38 arg, tradeoffs 17:55:39 We don't need to put everything in PostgreSQL of course, but, I'm not sure if S3 is the right place, unless we were to cache the HTTP response bodies locally in MongoDB or Redis. 17:55:55 Since you'd have to download all of the reports to do the blockpage detection. 17:56:56 I think that for now in order to accelerate PoC dev. we should put everything in PostgreSQL and then determine if we will have issues. 17:57:25 It's easy to back-out, we'd just remove the HTTP response bodies from the DB, and then put them in S3, Redis, MongoDB, or, whatever. 17:57:28 ok let's do that 17:57:42 Awesome. 17:58:15 we should however be sure to do some benchmarks on this so that whatever strategy we decide to opt for in the future we can compare how and if it's an improvement 17:58:42 so let's do this next steps business since we are running out of time 17:59:03 Of course. I have a fair bit of DBA experience, so I should be able to help you out with that. 17:59:21 hellais: where do you want me to focus this week? I was just going to take the next steps on the sketches and implement those 17:59:26 I shall work on documenting the various stages of the ETL pipeline and refactoring it to support putting all the things in postgres 18:00:29 this week I'll probably release the first stabile MeasurementKit EOF 18:00:47 simonv3: yes that sounds good. I would focus on implementing a mock of the full workflow with some mocked out models so that we can come up with what are the needed API endpoints and required queries to extract the needed views from the DB 18:01:33 @simonv3 maybe look into how to perform queries in strongloop from JavaScript? I have a few models locally that I'll push to a dev. branch later tonight. 18:01:49 We'll need to modify a few of the models to support queries for each of the pages. 18:02:24 hodgepodge: I’d definitely like to see that 18:08:03 simonv3: you can add to https://github.com/TheTorProject/ooni-api/blob/master/server/boot/create-fixtures.js the mocks of what you would like to have 18:09:50 That’s what I’ve been doing locally (don’t remember if I added those to my PR) 18:10:28 simonv3: ah, I didn't see them in the PR 18:10:46 Looks like I left ito ut 18:10:49 it out* 18:11:45 I have to go! bye! 18:13:08 * hodgepodge → lunch (･‿･)ﾉ゛ 18:13:45 excellent 18:13:50 I think we are done here in that case 18:14:00 thank you all for attending 18:14:05 #endmeeting