17:00:33 <hellais> #startmeeting
17:00:33 <MeetBot> Meeting started Mon Jan 25 17:00:33 2016 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:33 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:00:40 <hellais> so, who is here?
17:00:42 <landers> here
17:00:57 * sbs is here!
17:01:57 * willscott is here
17:01:57 <poly> hello!
17:02:04 * poly is here
17:03:46 * hodgepodge_ is here, kind of
17:04:09 * anarcat is here but has no clue what tha meeting is about :)
17:05:19 <hellais> anarcat: it's basically a chance to report back on what we have been up to last week, discuss topics that you would like to bring to the attention of everybody and say what are your plans for next week.
17:05:30 <hellais> so what have you been up to this week?
17:06:36 * simonv3 here
17:07:04 <hodgepodge_> I'll go: I was working on adding paralellism to ooni-lyzer in order to take advantage of the luigi scheduler, and experiment a little bit with distributed workloads. I've implemented a fan-out topology to fetch reports on a per-date basis, but I'm not sure how to fan back in. If anyone knows, let me know!
17:07:19 <hodgepodge_> I also worked a little more on httpt test normalisation.
17:07:20 <hodgepodge_> EOF
17:07:49 <landers> i've been travelling the past month, got home yesterday; today am paging things back into memory and catching up on the latest developments. EOF
17:10:43 <anarcat> i traveled to cuba and ran ooni-probe there and figured out i don't know what to do with the results, but apparently internet is generally not censored in fancy cuban hotels for tourists
17:10:44 <sbs> I did some changes to the code for integrating Tor into MK that I'm working on, based on feedback from hellais. I have also reviewed the new web_connectivity spec test, and really appreciated it. EOF
17:10:59 <anarcat> i am going to write a blog post about this soonish, probably today, on https://anarc.at/blog/
17:11:06 <anarcat> EOF
17:11:27 <hellais> I have been working with simonv3 on the frontend to the reports, written a specification for the new web_connectivity test, setup SSL on the postgres database at humboldt, reviewed some pending PR, started to figure out how to integrate the work of hodgepodge into our pipeline EOF
17:12:02 <hellais> hodgepodge: what do you mean by "fan back in"?
17:13:27 <hellais> hodgepodge: I also did some benchmarks of pickle and it seems like it's actually slower to use pickle than the ujson parser (about 2.2x slower)
17:14:09 <hodgepodge> I want to do this: Identify(date_range) <- Fetch(date) <- Normalise (date) <- Sanitise(date)
17:14:39 <hodgepodge> Aw, that's no good. I'll switch to the ujson parser, then, so it's easier to merge my work into ooni-pipeline. That's good to know, though.
17:15:23 <hodgepodge> Right now, it does this: Identify(date) <- Fetch(date) <- Normalise(date) <- Sanitise(date for date in dates)
17:15:54 <hodgepodge> Which is okay, but, I'd rather not have to do (n) key listings, as opposed to 1 key listing for all of the date prefixes.
17:18:16 <hodgepodge> If you want to run the pipeline across all of the date prefixes, that's easily 1000+ prefixes to search by. Listing, and filtering a list of all prefixes takes about 15 seconds for me at home.
17:21:28 <hellais> hodgepodge: I would instead have put a WrapperTask on top of it all that generates the tasks with subdependencies and do the listing inside of that. I started to outline the structure of it here: https://github.com/TheTorProject/ooni-pipeline/blob/feature/refactor-luigi/pipeline/batch/daily_workflow.py#L543
17:23:30 <hodgepodge> Would that work for me since Identify() is an ExternalTask? What would the benefit be? I've never worked with that class before
17:26:24 <hodgepodge> From the luigi docs: "Use for tasks that only wrap other tasks and that by definition are done if all their requirements exist."
17:26:39 <hellais> hodgepodge: what I suggest is to drop the identify task and do the listing of the reports on each run. At that point the fetch task would become an ExternalTask. The benefit is that of being able to run any branch of the DAG without having to build the full tree.
17:27:25 <simonv3> Here’s my update real quick: I spent a couple of hours trying to fix small things in the ooni-api and building in some more visual niceties. I’m wondering what kind of information we want for the vendors - I think it’s just a fake list at the moment right? (There’s “Squid” in Saudi Arabia, I think?)
17:29:49 <hodgepodge> Gotcha. That seems kind of neat, but it doesn't address the performance issues associated with fetching keys (n) times as opposed to once. Do you have any ideas as to how I might be able to accomplish that? I've noticed you can use dicts for workflow definition - maybe I'd need to explore that avenue?
17:29:52 <simonv3> (EOF, I guess)
17:31:35 <hellais> simonv3: the list is not a "fake list" it's currently being generated from the actual measurements. In SA they are actually using a solution from a local company called "WireFilter"
17:32:41 <poly> As agreed on a while ago, I put togather an example visualization for Network Meter's output. I've put up the graphic with an explanation of how I think we'll implement data visualization inside NM. sbs has agreed to review it soon
17:33:04 <poly> Find that here: https://github.com/measurement-kit/network-meter/issues/31
17:33:32 <poly> Once this is implemented, I think we would be pretty close to the first fully functional version of network meter :)
17:35:28 <simonv3> hellias ah okay
17:35:49 <hellais> hodgepodge: tbh I don't think it's a major issue since the bulk of the operation is in the download and parsing
17:36:35 <poly> (EOF)
17:37:19 <hellais> simonv3: I also have made a CSV with the censorship methods by hand. I am thinking of making another table for this though currently there is no simple way of accurately determining that from the existing tests, but it requires a bit of manual digging.
17:42:42 <simonv3> I’m not sure I understand
17:43:10 <simonv3> You mean it’s not easy to tell from the measurements what the censorship method was?
17:44:13 <hellais> simonv3: basically the techniques used that can be inferred from the reports are one of the following: Transparent HTTP proxy, DNS hijacking, RST packet injection
17:45:01 <simonv3> right
17:45:02 <hellais> simonv3: however to learn that what needs to be done is compare the results from a dns_consistency test with those from a http_requests test and see if the DNS responses are tampered with or if they are not
17:45:20 <simonv3> gotcha
17:46:06 <hellais> however since the dns_consistency test presents a lot of false positives I go through them manually and inspect the DNS answers to see if the IP that we got a response from is actually that of a blockpage or not
17:49:45 <hellais> I would say we move into the next steps at this point
17:52:47 <hellais> I plan to continue working on the data processing pipeline, start updating the test specifications to reflect the normalisations that we plan on doing, create the tables for the censorship method, ideally reach a release candidate for the ooni-api EOF
17:55:55 <simonv3> release candidate sounds good
17:59:52 <sbs> I plan to continue development of Tor integration and to review this nice JSON library to see whether it makes sense to use it as MK dependency https://github.com/nlohmann/json EOF
18:00:03 <hellais> hodgepodge: we should also ASAP come up with a way of joining the efforts on the data pipeline, because it doesn't make sense for us to duplicate effort in developing two pipelines. I think overall for the normalisations yours is better, but is still missing 1) The insertion of the reports into postgres 2) The fixing of the bodies of the HTTP responses (encoding the binary ones, stripping null characters and
18:00:10 <hellais> validating for valid unicode)
18:01:08 <hodgepodge> Yeah - it would definitely be a good idea to merge the two pipelines sooner, rather than later. I've intentionally left out metric insertion for the time being because report normalisation isn't working yet.
18:01:46 <hodgepodge> Well it's working, but, the associated reports aren't well, normalised afterwards.
18:08:14 <hodgepodge> While we're integrating ooni-lyzer into ooni-pipeline we should also look into removing dead code. Right now, the pipeline is on the order of 4.5k SLoC
18:08:21 <hellais> we also need to make sure that the normalisation procedure works well also in the case in which the reports are generated already normalised by ooni-probe
18:08:44 <hellais> hodgepodge: yes I began pruning a bunch of dead code in that branch
18:08:53 <hodgepodge> Wait, what? Does ooni-probe perform normalisation now?
18:09:32 <hellais> hodgepodge: no it doesn't, but I would like that by release 1.4.0 once we have established what is the best data format to use we change how it gets generated by ooni-probe
18:09:57 <hellais> this would also entail making ooni-probe generate reports in JSON format directly so we can reduce the load on the parsing stage
18:10:31 <hodgepodge> Ah, I see...
18:12:48 <hodgepodge> It might make more sense to have test result normalisation in one place - submitting to collectors in a canonical data format is a good idea, though.
18:13:18 <hodgepodge> I'm not really sure, though, tbh.
18:13:52 <hodgepodge> Theoretically, normalisation could also happen post-submission to the collector.
18:18:41 <hellais> It's just I don't particularly like the idea of having the report that is generated by the probe mismatch with the final report ending up being published. Also there are some obvious absurdities in the data format generated by probes that I feel like must be changed (for example having some values nested inside of a list with only 1 item, the repr of the queries, etc.)
18:20:07 <hodgepodge> Yeeeaaaaahhhhh.
18:20:15 <hodgepodge> I didn't want to say anything.
18:20:50 <hodgepodge> It would have been easy to switch the data format out way back when, but, now, not so much.
18:21:58 <hellais> heh
18:22:39 <hellais> we can anyways use the data_format_version key to learn what entries require which sort of normalisation
18:23:54 <hellais> I think the root cause of this all has been the fact that we began doing automated batch analysis of the reports too late and that we adopted YAML as a dataformat the leads you to be lazy (basically anything can be serialised in YAML and you don't have to think to much of the consequences)
18:24:10 <hodgepodge> ^
18:24:35 <hodgepodge> That sounds accurate to me. Hence why I'm using evil() somtimes.
18:24:40 <hodgepodge> Ahem. eval()*
18:25:24 <willscott> hellais: switching to json doesn't solve that
18:26:17 <hodgepodge> He wants to use json because using yaml is slow as hell. Odds are, you can serialize buttloads of weird stuff in yaml that the json serializer would complain about.
18:26:30 <hodgepodge> Now, we have buttloads of repr() and eval()
18:27:57 <hellais> willscott: it does solve issues though, like the fact that it leads you to consider a bit more the type of certain values, for example encoding binary data in a way that is different from how you encode strings
18:28:38 <willscott> not saying it doesn't :) just that having a pre-defined schema is a separate and also probably good thing to do
18:29:25 <hodgepodge> We should probably tweak the base data format for ooni-probe to ensure that only primitive data types and arrays, etc. can be used in JSON.
18:29:48 <hodgepodge> (I know, primitives is ambiguous, but, you all know what I mean)
18:30:24 <hodgepodge> Even though repr(mycoolfunction()) is a string.
18:30:53 <hodgepodge> Or rather, repr(mycoolfunction)
18:31:04 <hellais> hodgepodge: absolutely
18:33:46 <hodgepodge> Is there a way to detect repr() and eval() in Python? I know you could use reflection in Java to do that.
18:34:14 <hodgepodge> I guess you could repr() the ooni-probe test and search for any of those two function calls.
18:36:35 <hellais> hodgepodge: the only place where repr is used is actually the DNS test template based tests and in that case it's not too difficult to do the parsing of the needed fields without having to eval
18:36:45 <hellais> even with just regular expressions
18:37:11 <hodgepodge> Yeah, you mentioned that on jabber a little while ago.
18:41:21 <hellais> anyways I think this meeting has gone a bit overtime so we should close it here
18:41:26 <hellais> thanks for attending!
18:41:27 <hellais> #endmeeting