17:01:32 <hellais> #startmeeting 17:01:32 <MeetBot> Meeting started Mon Jan 18 17:01:32 2016 UTC. The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:01:32 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 17:01:37 <hellais> ok so let's start this 17:01:40 <hellais> who is here? 17:01:56 * willscott waves 17:02:21 * sbs is here! 17:02:32 <simonv3> is here 17:02:41 <hellais> excellent 17:02:57 <hellais> so what have you been up to last week? 17:03:26 <sbs> I made progress with Tor integration stuff for MeasurementKit 17:03:42 <sbs> All what I did is now published in this repository: 17:03:44 <sbs> https://github.com/bassosimone/mkok-experimental-tor 17:04:06 <sbs> The plan is to merge the code in there inside the main MeasurementKit repository in the next month 17:04:51 <sbs> Tor integration is needed to send reports generated by MK to the OONI collector, of course 17:04:55 <sbs> EOF 17:06:25 <hellais> I have been working on 1) Implementing views for detecting blockpages in the OONI reports (found confirmed cases of censorship based on blockpage detection in 7 countries) 2) Fixed some bugs in the data pipeline 3) Setup a cronjob to mirror the data pipeline on the servers at humboldt 4) Did some postgres performance tuning on the server at humboldt (doing some relatively trivial tweaks query times for our c 17:06:31 <hellais> ommon queries can decrease by a factor of 6x) 17:07:08 <hellais> 5) Reviewed the PRs on the Censorship circumvention tools (openvpn and psiphon) 17:07:23 <hellais> EOF 17:12:07 <simonv3> I’ve been working on visualizing the sites that are blocked verses the measurements made in the country view for the ooni-api. Pretty satisfied with it for now, but it probably needs tweaking. 17:12:10 <simonv3> EOF 17:15:03 <hellais> simonv3: regarding that it would be ideal to have some way to go back a forward between past and present measurements, because when there are many identified cases of censorship shown on the map the bar charts end up crossing over each other 17:15:30 <simonv3> oh, yeah that would make sense 17:15:42 <simonv3> do you have an example country to test with? 17:16:08 <hellais> I am also not super convinced of having the number of blocked sites/number of tested sites count when you don't mouse of it be white, because it's nor completely hidden, nor completely unreadable 17:16:18 <hellais> simonv3: yeah a good example is Turkey 17:17:20 <simonv3> I originally had it set to transparent 17:17:30 <simonv3> but that had a really weird fading effect happening on the lettering 17:17:45 <simonv3> could make the transition faster of course, that’d be less weird 17:20:18 <hellais> simonv3: what if we make the number of blocked elements be always visible in a stronger color (black or some darker shade of gray) and have only the total number of sites come in? 17:21:18 <simonv3> hmm, the total was less of a problem because it was on top - though the blocked could be inside the bar chart, and I guess it doesn’t need the “blocked” label 17:21:51 <simonv3> maybe getting rid of the blocked label and leaving it to the side is enough 17:22:41 <simonv3> (“blocked” was constantly overlapping with other bars, which is why I faded it out in the first place) 17:23:08 <hellais> ah I see 17:23:43 <hellais> I think that if the blocked count "blocked red" and is to the side without the "blocked" label it's probably quite understandable 17:24:26 <hellais> it's also that for certain sites it's really hard to understand if there even is some blocking detected because the red section of the chart is so small in comparison to the total tested sites 17:24:41 <hellais> having an always visible label would improve that aspect I believe 17:28:35 <simonv3> I think you’re right 17:29:28 <hellais> also I came to the conclusion that it's perhaps ideal to not give so much importance to the comparison of the body of the control and the experiment in the http requests view 17:30:15 <hellais> the elements that I think should be highlighted the most should instead be control_failure and the headers 17:31:11 <simonv3> Sounds good 17:31:26 <simonv3> Oh, I also did some work on the loader stuff - are you still getting errors with that? 17:32:02 <hellais> ah no that's fixed now, thanks! 17:32:51 * ww waves hello 17:32:53 <simonv3> yay 17:33:32 <simonv3> hellais: I can also explore to see if there’s a diffing angular plug in or something similar 17:33:47 <simonv3> and use that for the headers / body text (but collapse the body text) 17:35:02 <hellais> simonv3: for the headers that would be neat (though it's quite trivial to do the diff of the headers also directly in the controller), but I don't think we will be getting good results for doing diffs of the response body 17:35:55 <hellais> in the end I think most of the heavy lifting there should be done at the pipeline stage and do some better categorising of the response bodies similarly to what dcf did here: https://lists.torproject.org/pipermail/ooni-dev/2015-September/000339.html 17:36:18 <ww> is https://ooni.torproject.org/reports/ the right place to be looking for data? it seems... somewhat stale... 17:36:57 <hellais> I have been looking at the stuff in his repository and there are some pretty good classifiers also for censorship pages as well as cloudfront pages 17:36:58 <simonv3> hellais ah okay 17:38:04 <hellais> ww: yes sorry, that has not been updated in a while as we have been re-engineering the data processing pipeline. If you want to have access to the most up to date reports NOW, the best thing is probably for me to give you an s3 access key. 17:38:53 <hellais> also keep in mind that reports will be in the future published using JSON instead of YAML 17:40:29 <hellais> ww: I have a question for you, would it be most covenient for you to have access to the reports as individual files (one for each run of ooni-probe) or is it better to have daily batches that include all the measurements gathered for a specific day? 17:40:47 <ww> i see. i don't necessarily need a key just now, just to know the answer for when i will be asked. unless you think it would be useful for me to mirror it 17:41:09 <ww> also don't mind yaml vs. json, they're nearly isomorphic 17:41:49 <hellais> ww: what about having single report files for every experiment vs daily batches? 17:42:45 <ww> the main reason that i can think of that would matter is size. 17:43:05 <ww> if we have a lot of reports, and they are all in one file, many json parsers are not streaming 17:43:46 <ww> so i guess it depends on how soon we can expect to have data that will not comfortably fit into memory 17:44:24 <hellais> ww: well when I say JSON what I mean is actually a JSON stream, that is a series of JSON documents separated by newlines 17:44:44 <willscott> it'll be a while before a day's worth of measurements reach that point 17:45:31 <hellais> the batch of measurements for yesterday was 1.2GB 17:45:53 <ww> that should be workable, newline separated documents (that don't contain newlines!) 17:46:31 <hellais> newlines are escaped anyways when they are inside of a JSON string 17:46:33 <ww> more convenient (and efficient if a read means an HTTP GET), i'd say, than many small documents 17:47:02 <ww> hellais: yes, so long as nobody gets the bright idea to pretty-print the json! (i have seen people do that) 17:47:22 <hellais> heh 17:48:13 <hellais> yeah the reason for batching is for performance reasons, it's more convenient to download and it's also more convenient when processing as you don't have to seek to many parts of the disk to process the reports, but you do at least sequential reads of some GB 17:49:40 <hellais> I should probably post details about this potential change to the mailing list to see if there are any nee's, because it could potentially impact some consumers of the data 17:50:23 <hellais> the only reason why I would see this not being convenient to some is if they are interesting in downloading only reports of a specific sort and instead they have to download all the batches and extract from those the ones they are interested in 17:50:30 <hellais> (as is the case of the study done by dcf) 17:52:32 <ww> in principle some sort of select operation might be good... 17:54:41 <ww> but the indexes to support that should hang off the side of the flat datastore... 17:55:48 <hellais> ww: you mean like having some sort of mapping of offsets into the files that indicate from where to where reports of a certain type start and end? 17:56:29 <ww> hellais actually gchq has a nice slide about that -- the `blackhole - QFD' one 17:58:26 <hellais> haha 18:00:40 <hellais> ok, so I guess we should now swiftly move into the next steps 18:03:38 <hellais> I will be working on 1) Refactoring and cleaning up of the data pipeline and potentially re-running it on the data (When setting up the mirror I realised that a lot of reports are missing from the master node) 2) Collaborating with simonv3 on the ooni-api 3) Start designing and specifying the web_connectivity test 4) Restoring the publishing of the collected measurements EOF 18:04:55 <simonv3> I’ll be focusing on the things we discussed above, and also incorporating the rest of the test data for each country that hellais made available last week 18:06:15 <sbs> I'll continue to work on Tor integration, fixing bugs that remain to be fixed and adding more test cases EOF 18:08:23 <ww> I'll set up a probe in AS60241 18:08:33 <ww> and another at UoE 18:08:49 <hellais> excellent! 18:08:49 <ww> and wait for students EOF 18:09:11 <hellais> thank you all for attending! Have fun! 18:09:18 <hellais> #endmeeting