17:00:03 #startmeeting OONI gathering 2016-07-25 17:00:03 Meeting started Mon Jul 25 17:00:03 2016 UTC. The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:00:03 Useful Commands: #action #agreed #help #info #idea #link #topic. 17:00:29 hello there! 17:00:44 who is in here? 17:02:01 * landers here 17:05:02 so it seems like we are agenda-less this time around. Is there something people would like to talk about? 17:06:20 * hodgepodge waves, a little late to the party 17:07:10 curious: are the MK-{android, ios} things on github complete apps, or like libraries to be incorporated in an app someday? 17:08:14 https://github.com/measurement-kit 17:08:58 landers: fine question. The apps are currently work in progress, but they are available on github as well as on the market in test-flight mode. 17:09:12 if you are interested in trying them out lorenzo can give you an invite to install the app 17:09:39 * landers should get a smart phone 17:09:45 these are the two android and ios apps: https://github.com/measurement-kit/measurement-kit-app-ios https://github.com/measurement-kit/measurement-kit-app-android 17:11:23 what's the timeline for them being out of test-flight mode? i think you mentioned they don't talk to bouncers? 17:11:35 a related question: is the plan to switch ooni to test using the mk code base? 17:15:05 landers: They should be out of test-flight around September, that is when the web_connectivity test will be released in measurement-kit and we will have switched to using the main collector of ooni. 17:16:33 willscott: yes, that it the plan. The work I am doing to have a GUI in ooniprobe is also leading to doing fair amount of refactoring of the ooniprobe codebase and hopefully this will make it also easier to integrate something like measurement-kit inside of it. I wouldn't expect that to happen, though sooner than October/November. 17:19:05 hellais: are there any plans to increase the number of tests that are run on a daily basis by ooni-probes? I've noticed that the lion's share of the data that we have corresponds predominantly to the http-requests, http-invalid-request-line, and dns-consistency tests with a bit of web connectivity and mp-traceroute thrown in here, and there. 17:21:32 those tests will probably always be the lions share, because they run once per url, while other tests will only need to be run once 17:22:07 many of the service tests in ooni now are hard to run because dependencies need to be installed correctly 17:22:32 are there other tests you'd like to see run more often? 17:23:24 yeah I think the main issue has to do with the fact that the other tests are either complex to run due to depedencies or complex to run due to the fact that they require special configuration, hence they can't be easily put inside of the daily deck. 17:24:22 Around September we will have implemented also tests that verify reachability of IM apps and those will be implemented in a way that they don't require external depedencies and will hence be easier to integrate 17:24:37 Gotcha, thanks willscott. I was actually kind of curious about the Tor bridge reachability test as well as some of the experimental tests. It kind of seems like there's overlap in the tests that I was thinking of, now that I think about it. 17:24:52 (e.g. dns_consistency vs. dns_spoofing) 17:25:01 (e.g. http-host, tls-handshake) 17:28:07 hodgepodge: the bridge reachability test would indeed be something interestnig to run always, but in order to see it make it's way into the default deck I think we ought to have it 1) Come with a set of default bridges to test (these can be the bridges that are shipped as part of TBB, but we need to ensure they are kept up to date) 2) Better handling of cases where a particular pluggable transport is not ins 17:28:13 talled and gracefully handle that situation by testing only the other ones 17:29:21 regarding the tls-handshake test I think that is something that we should be working towards integrating inside of the web-connectivity test and expanding it, since we are already generating the network traffic necessary to extract the certificate information and it would make sense to have it also be included in the report. 17:31:13 the http-host test is a bit of special test, in the sense that it's mostly useful when you know that a certain thing is being blocked and you suspect it being blocked by means of a transparent http proxy and you want to see what evasion technique work. However if you run it against a set of sites that are not blocked you will get inconsistent results and will not be testing for anything. 17:33:51 Out of curiosity, is the web-connectivity test finished, or is it still under development? 17:34:39 it has replaced http-requests and dns-consistency 17:34:40 I'm thinking predominantly about how to perform ETL of metrics tied to that test within the scope of the pipeline, especially if the contents change dramatically. 17:35:15 hodgepodge: it's finished and is run by default by ooniprobe version > 1.5 17:35:28 Awesome, thanks! 17:35:53 well since all tests are versioned, we could in theory add some extra keys and handle the parsing of the extra keys conditionally on the version number. 17:37:41 That's true. We could probably handle missing data with that in mind as long as the test version number is always continuously bumped. Have you, and willscott looked into the stack for ooni-pipeline-ng-ng? 17:39:24 hellais has proposed airflow for top-level structure -> https://github.com/TheTorProject/ooni-pipeline/issues/32 17:40:50 Right on. I'm playing around with Airflow right now, and so far it's looking really nice. The framework provides sub-DAGs which will help with all of the test-specific, and moreover version-specific normalization that we'll be doing. 17:41:34 We pretty much have three choices of workflow managers, so, Airflow seems like a natural choice. :P 17:41:39 yeah I quite like airflow. I also think that it should work by having some sort of message queue to handle new incoming measurements from the collectors (regarding message queue I haven't looked that much into it, though kafka seems fairly used and has some good properties). 17:43:57 That's a good point. You could always upload to S3, and simultaneously send metrics to the pipeline. 17:45:34 I'm not really sure how you could maintain a log of all of the state transformations that an ooni-probe undergoes during sanitisation and normalisation. Any ideas? 17:46:24 With ooni-pipeline we simply store a new file that represents the work that was performed in a given phase, but that requires a ton of storage space. 17:48:55 hodgepodge: if the pipeline focuses on aggregation of results, then its partial products will be much smaller and shouldn't be a problem to store 17:49:26 for original measurements, ideally we store the originals and can re-run them if the pipeline changes and new analysis is needed 17:49:55 Oh, I was referring to the intermediate state of metrics while they're being normalized and sanitized as opposed to the product of a particular aggregation. 17:50:52 I was thinking of storing a binary diff representative of the work performed for each file in each phase, but that might be expensive, especially if the intermediary products are compressed. 17:51:22 hodgepodge: I don't think having a lot of data stored while the pipeline is running is not a big concern. As long as we clean them up after it has reached the final stage. 17:51:56 i'll need to look through the code before i have an oppinon. i don't have a good sense of how many stages of intermediate sanitization/normalization are happening right now 17:52:50 it's beneficial to keep the intermediate stages only for the purpose of resuming the processing of a set of measurements if it failed before finishing it, but once it's done it shouldn't be that much of a problem. 17:53:27 If we're only storing a days run of metrics locally, then it's certainly a non-issue. 17:53:48 Hm. Thats true too, hellais. It's not very expensive to run the sanitisation / normalisation bits as far as I know. 17:56:20 yeah, they aren't that expensive. The most expensive part of it is converting the legacy YAML reports to JSON 17:58:08 Sorry, my connection dropped. Did I miss anything after my last message? 17:58:27 yeah, they aren't that expensive. The most expensive part of it is converting the legacy YAML reports to JSON 17:58:49 Thanks, willscott. 17:59:00 And yeah, that part is absolutely brutal. 18:00:42 heh, yeah 18:00:49 do we have anything else we would like to discuss? 18:04:21 if we don't have anything else I say we can end this here. Thank you all for attending! Safe hacking! 18:04:25 #endmeeting