#ooni log

17:01:32 <hellais> #startmeeting
17:01:32 <MeetBot> Meeting started Mon Jan 18 17:01:32 2016 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:32 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:01:37 <hellais> ok so let's start this
17:01:40 <hellais> who is here?
17:01:56 * willscott waves
17:02:21 * sbs is here!
17:02:32 <simonv3> is here
17:02:41 <hellais> excellent
17:02:57 <hellais> so what have you been up to last week?
17:03:26 <sbs> I made progress with Tor integration stuff for MeasurementKit
17:03:42 <sbs> All what I did is now published in this repository:
17:03:44 <sbs> https://github.com/bassosimone/mkok-experimental-tor
17:04:06 <sbs> The plan is to merge the code in there inside the main MeasurementKit repository in the next month
17:04:51 <sbs> Tor integration is needed to send reports generated by MK to the OONI collector, of course
17:04:55 <sbs> EOF
17:06:25 <hellais> I have been working on 1) Implementing views for detecting blockpages in the OONI reports (found confirmed cases of censorship based on blockpage detection in 7 countries) 2) Fixed some bugs in the data pipeline 3) Setup a cronjob to mirror the data pipeline on the servers at humboldt 4) Did some postgres performance tuning on the server at humboldt (doing some relatively trivial tweaks query times for our c
17:06:31 <hellais> ommon queries can decrease by a factor of 6x)
17:07:08 <hellais> 5) Reviewed the PRs on the Censorship circumvention tools (openvpn and psiphon)
17:07:23 <hellais> EOF
17:12:07 <simonv3> I’ve been working on visualizing the sites that are blocked verses the measurements made in the country view for the ooni-api. Pretty satisfied with it for now, but it probably needs tweaking.
17:12:10 <simonv3> EOF
17:15:03 <hellais> simonv3: regarding that it would be ideal to have some way to go back a forward between past and present measurements, because when there are many identified cases of censorship shown on the map the bar charts end up crossing over each other
17:15:30 <simonv3> oh, yeah that would make sense
17:15:42 <simonv3> do you have an example country to test with?
17:16:08 <hellais> I am also not super convinced of having the number of blocked sites/number of tested sites count when you don't mouse of it be white, because it's nor completely hidden, nor completely unreadable
17:16:18 <hellais> simonv3: yeah a good example is Turkey
17:17:20 <simonv3> I originally had it set to transparent
17:17:30 <simonv3> but that had a really weird fading effect happening on the lettering
17:17:45 <simonv3> could make the transition faster of course, that’d be less weird
17:20:18 <hellais> simonv3: what if we make the number of blocked elements be always visible in a stronger color (black or some darker shade of gray) and have only the total number of sites come in?
17:21:18 <simonv3> hmm, the total was less of a problem because it was on top - though the blocked could be inside the bar chart, and I guess it doesn’t need the “blocked” label
17:21:51 <simonv3> maybe getting rid of the blocked label and leaving it to the side is enough
17:22:41 <simonv3> (“blocked” was constantly overlapping with other bars, which is why I faded it out in the first place)
17:23:08 <hellais> ah I see
17:23:43 <hellais> I think that if the blocked count "blocked red" and is to the side without the "blocked" label it's probably quite understandable
17:24:26 <hellais> it's also that for certain sites it's really hard to understand if there even is some blocking detected because the red section of the chart is so small in comparison to the total tested sites
17:24:41 <hellais> having an always visible label would improve that aspect I believe
17:28:35 <simonv3> I think you’re right
17:29:28 <hellais> also I came to the conclusion that it's perhaps ideal to not give so much importance to the comparison of the body of the control and the experiment in the http requests view
17:30:15 <hellais> the elements that I think should be highlighted the most should instead be control_failure and the headers
17:31:11 <simonv3> Sounds good
17:31:26 <simonv3> Oh, I also did some work on the loader stuff - are you still getting errors with that?
17:32:02 <hellais> ah no that's fixed now, thanks!
17:32:51 * ww waves hello
17:32:53 <simonv3> yay
17:33:32 <simonv3> hellais: I can also explore to see if there’s a diffing angular plug in or something similar
17:33:47 <simonv3> and use that for the headers / body text (but collapse the body text)
17:35:02 <hellais> simonv3: for the headers that would be neat (though it's quite trivial to do the diff of the headers also directly in the controller), but I don't think we will be getting good results for doing diffs of the response body
17:35:55 <hellais> in the end I think most of the heavy lifting there should be done at the pipeline stage and do some better categorising of the response bodies similarly to what dcf did here: https://lists.torproject.org/pipermail/ooni-dev/2015-September/000339.html
17:36:18 <ww> is https://ooni.torproject.org/reports/ the right place to be looking for data? it seems... somewhat stale...
17:36:57 <hellais> I have been looking at the stuff in his repository and there are some pretty good classifiers also for censorship pages as well as cloudfront pages
17:36:58 <simonv3> hellais ah okay
17:38:04 <hellais> ww: yes sorry, that has not been updated in a while as we have been re-engineering the data processing pipeline. If you want to have access to the most up to date reports NOW, the best thing is probably for me to give you an s3 access key.
17:38:53 <hellais> also keep in mind that reports will be in the future published using JSON instead of YAML
17:40:29 <hellais> ww: I have a question for you, would it be most covenient for you to have access to the reports as individual files (one for each run of ooni-probe) or is it better to have daily batches that include all the measurements gathered for a specific day?
17:40:47 <ww> i see. i don't necessarily need a key just now, just to know the answer for when i will be asked. unless you think it would be useful for me to mirror it
17:41:09 <ww> also don't mind yaml vs. json, they're nearly isomorphic
17:41:49 <hellais> ww: what about having single report files for every experiment vs daily batches?
17:42:45 <ww> the main reason that i can think of that would matter is size.
17:43:05 <ww> if we have a lot of reports, and they are all in one file, many json parsers are not streaming
17:43:46 <ww> so i guess it depends on how soon we can expect to have data that will not comfortably fit into memory
17:44:24 <hellais> ww: well when I say JSON what I mean is actually a JSON stream, that is a series of JSON documents separated by newlines
17:44:44 <willscott> it'll be a while before a day's worth of measurements reach that point
17:45:31 <hellais> the batch of measurements for yesterday was 1.2GB
17:45:53 <ww> that should be workable, newline separated documents (that don't contain newlines!)
17:46:31 <hellais> newlines are escaped anyways when they are inside of a JSON string
17:46:33 <ww> more convenient (and efficient if a read means an HTTP GET), i'd say, than many small documents
17:47:02 <ww> hellais: yes, so long as nobody gets the bright idea to pretty-print the json! (i have seen people do that)
17:47:22 <hellais> heh
17:48:13 <hellais> yeah the reason for batching is for performance reasons, it's more convenient to download and it's also more convenient when processing as you don't have to seek to many parts of the disk to process the reports, but you do at least sequential reads of some GB
17:49:40 <hellais> I should probably post details about this potential change to the mailing list to see if there are any nee's, because it could potentially impact some consumers of the data
17:50:23 <hellais> the only reason why I would see this not being convenient to some is if they are interesting in downloading only reports of a specific sort and instead they have to download all the batches and extract from those the ones they are interested in
17:50:30 <hellais> (as is the case of the study done by dcf)
17:52:32 <ww> in principle some sort of select operation might be good...
17:54:41 <ww> but the indexes to support that should hang off the side of the flat datastore...
17:55:48 <hellais> ww: you mean like having some sort of mapping of offsets into the files that indicate from where to where reports of a certain type start and end?
17:56:29 <ww> hellais actually gchq has a nice slide about that -- the `blackhole - QFD' one
17:58:26 <hellais> haha
18:00:40 <hellais> ok, so I guess we should now swiftly move into the next steps
18:03:38 <hellais> I will be working on 1) Refactoring and cleaning up of the data pipeline and potentially re-running it on the data (When setting up the mirror I realised that a lot of reports are missing from the master node) 2) Collaborating with simonv3 on the ooni-api 3) Start designing and specifying the web_connectivity test 4) Restoring the publishing of the collected measurements EOF
18:04:55 <simonv3> I’ll be focusing on the things we discussed above, and also incorporating the rest of the test data for each country that hellais made available last week
18:06:15 <sbs> I'll continue to work on Tor integration, fixing bugs that remain to be fixed and adding more test cases EOF
18:08:23 <ww> I'll set up a probe in AS60241
18:08:33 <ww> and another at UoE
18:08:49 <hellais> excellent!
18:08:49 <ww> and wait for students EOF
18:09:11 <hellais> thank you all for attending! Have fun!
18:09:18 <hellais> #endmeeting