17:01:22 <hellais> #startmeeting
17:01:22 <MeetBot> Meeting started Mon May 23 17:01:22 2016 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:22 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:01:25 * willscott waves
17:01:26 <hellais> indeed!
17:01:31 <anadahz> hi landers
17:01:51 <landers> hey MeetBot
17:02:30 * sbs here
17:03:55 <hellais> great so let's get this started
17:04:25 <willscott> great! what are we talking about today?
17:04:50 <hellais> so I had proposed 1 item for the agenda of today, I guess we can start with that
17:05:04 <hellais> #topic providing dumps of views of the ooni measurement data
17:05:32 <willscott> who would this be for?
17:06:22 <hellais> I guess people that are interested in working with the OONI data, but for some reason don't want to download the full dump of the data and do minimisation of the records themselves
17:07:30 <landers> like, slice by test and country?
17:07:37 <agrabeli_> hi
17:07:41 <willscott> so, i guess i have two thoughts here:
17:08:03 <willscott> 1. it would be great have a way to download a 'full dump' that's more obvious and easier than spidering measurements.
17:08:16 <willscott> e.g. a full tar-ball
17:08:36 <hellais> landers: I don't think the slicing is the issue, but mainly about minimising them and providing them in a format that works better with tools that people are familiar with
17:09:07 <willscott> 2. there's some bar of effort that we want to maintain for people doing research so that they're serious enough about using the data that they take the time to understand what it is
17:09:21 <hellais> as an example I got an email from a researcher that was interested in using the OONI data and told me that they couldn't load it into their "data analysis tool" and then went on to say that also loading it into Excel didn't work.
17:09:26 <anadahz> (related: https://lists.torproject.org/pipermail/ooni-dev/2016-January/000374.html)
17:09:45 <willscott> if we can't guarantee that a view is actually what we want someone to use without context, it could easily turn into someone misrepresenting the data
17:09:49 <landers> bar of effort > excel?
17:10:40 <landers> is yaml -> csv some easy thing we could do to make this person happy?
17:10:45 <willscott> i think excel users would generally be better off getting a downstream analysis ooni produces, not raw data
17:10:51 <hellais> hum that argument of the bar of effort is something I hadn't considered
17:11:24 <agrabeli_> similarly: researchers from Amsterdam's Digital Methods Initiative have expressed interest in working with the OONI data, but the tools that they said they are familiar probably can't help much in terms of the current data format
17:11:33 <willscott> by which i mean, events and classification of interference, similar to what you could see on explorer
17:11:58 <anadahz> willscott: what is 'a downstream analysis ooni produces' ?
17:12:48 <willscott> finding the reports that are flagged, maybe with a couple levels of 'how sure we are this is really an anomaly' and getting date, location, type of interference
17:13:59 <willscott> i guess i'm thinking of similar processing to what explorer is doing to generate its map
17:14:29 <willscott> i guess my high-level feeling is that this would be nice, but is probably a bit lower priority than some of the other work we have
17:14:53 <willscott> the current situation also has the (benefit?) that other groups tend to get in touch with OONI
17:16:37 <anadahz> currently there is no export functionality from pipeline, many tests are not being listed, even blocked fingerprints cannot be integrated to the current pipeline due to processing difficulties
17:17:12 <anadahz> so having an "inaccurate" image of what is out in terms of data is not that nice
17:18:38 <anadahz> however finding some tools that process json files and posih our current raw format it's doable, doesn't require long development process and give the complete image of what data OONI has
17:20:31 <agrabeli_> perhaps we can provide some csvs with some basic types of data, while explaining obviously that they only include small sub-sets of the overall data?
17:21:26 <agrabeli_> this could encourage some researchers who aren't familiar with json to get started with (some) ooni data and, depending on their research questions, we could potentially provision them with more/different cvs or other data sets
17:23:12 <landers> what are some classification/aggregation things we'd feel comfortable putting a stamp on and sending downstream? blockpage detection?
17:24:29 <sbs> perhaps a similar question to landers' one: what are those people who contacted us interested to?
17:26:10 <agrabeli_> unclear, though you're right that depending on their questions, we would potentially create relevant data sets
17:26:42 <willscott> i guess maybe the other way to think about this is if we put in the work to create a view of data for such a group, it would be great to build that in a reproducible / regularly published way, rather than as a one-off
17:26:55 <sbs> willscott: agree!
17:27:00 <agrabeli_> agreed
17:27:04 <hellais> landers: I think the blockpage detection is a simple thing that we can feel confident about putting a "stamp" on.
17:27:47 <hellais> Other things that I think could be useful, but are subject to misinterpretation could be the failure codes for the http_requests test
17:28:16 <hellais> the middlebox stuff could also be an interesting view to present, though that is a bit more tricky perhaps.
17:29:54 <agrabeli_> hellais: though we can come up with some values for middle box stuff, such as fingerprints,anomalies detected (true or false), etc
17:30:34 <agrabeli_> we can probably also create values for the reachability of tor bridges and other circumvenion tools, in similar fashion
17:32:38 <anadahz> I will add to the list the dns_connectivity test reports that return records to 127.0.0.1
17:32:52 <sbs> anadahz: yeah, makes sense!
17:33:21 <sbs> what about talking with these interested people presenting the various possibility and finding out the ones in which they are most interested for their research?
17:33:42 <landers> sbs: +1
17:33:51 <agrabeli_> sure
17:34:15 <agrabeli_> though I think the idea is to also provide certain types of data in a format that people can easily use, regardless of their questions
17:34:35 <agrabeli_> perhaps we can start off by creating some csvs for http-requests, middle boxes and reachability of services?
17:35:35 <willscott> anadahz: dns responses to private networks in general
17:36:06 <anadahz> On a previous talk with hellais I mentioned that we should also bring back the per country and add some extra slices of per test reports and per AS reports as directory structures additionally to our current daily raw reports
17:36:25 <hellais> in looking for DNS responses to private IP space I have found some hostnames that consistently resolve to private IP space across resolvers
17:36:39 <hellais> I mean it's not necessarily an indication of censorship
17:36:53 <hellais> it's still perhaps a good start
17:37:06 <willscott> true, but there could also be hostnames that resolve to 127.0.0.1 consistently :p
17:37:39 <andresazp> but the the thing would be wether that mathces with the control or not
17:37:46 <andresazp> right?
17:39:26 <hellais> andresazp: yeah I guess looking at the control could filter those out
17:40:09 <willscott> cool. this seems like a line of coding that would be nice to have on the server
17:40:19 <willscott> should we also spend some time syncing on web_connectivity?
17:40:31 <willscott> hellais: you have a branch ready for others to review?
17:40:42 <hellais> willscott: yes I do
17:41:01 <hellais> https://github.com/TheTorProject/ooni-probe/pull/457
17:41:57 <hellais> I also have setup a test bouncer for use with that branch. In here I explain the one liner to run it: https://github.com/TheTorProject/ooni-probe/pull/457#issuecomment-219748547
17:42:52 <hellais> the rate of false positives is much lower than the http_requests test, but there are still some cases of false positives.
17:43:32 <hellais> geo localised content is the main issue here and I don't think there is a simple way to overcome this at the end of the day
17:48:47 <anadahz> going back to the measurement data views
17:51:24 <anadahz> can we support changes to the pipeline in such a way that can help us provide any usable measurement data views without re-developing major parts of the pipeline ?
17:54:10 <anadahz> if not then this should happen first to the raw data reports
17:56:26 <anadahz> --pausing-- since we are running out of time
17:57:07 <anadahz> andresazp: we has some discussion about the deprecation of dns_consistency and http_requests test on last meeting
17:57:41 <andresazp> I’m sorry I couldn’t attend
17:57:54 <andresazp> I will check it out
17:57:56 <willscott> hellais: left a couple comments on your pull request
17:58:02 <andresazp> but in general my ywo cents are
17:58:34 <andresazp> pelase don’t depricate DNS_consitency or allow fo the new web_conectivity to run onlu DNS tests
17:58:56 <andresazp> only* as an attibute passed somehow
17:59:12 <hellais> anadahz: we can do it quick and dirtily by exporting the data from the database, but I think that the proper way to do this is to start the process of reducing some load from the database layer and doing the analysis and minimisation using the JSON files as source.
17:59:52 <hellais> anadahz: so I guess the answer is, yes some re-developing of the data pipeline is going to be needed.
18:00:01 <andresazp> DNs_consistency is incredibly useful, catches a huge ammount of censoshorship cases and connectivity-wise is very light
18:00:08 <anadahz> andresazp: log of the previous meeting http://meetbot.debian.net/ooni/2016/ooni.2016-05-16-17.01.log.html
18:01:59 <hellais> andresazp: the problem with the dns_consistency test is that it leads to a lot of false positives, due to geolocation DNS based load balancing. In the web_connectivity test you get what the DNS_consistency test is doing + tcp_connect + http_requests
18:02:31 <hellais> andresazp: is the concern that of too much traffic being generated?
18:03:05 <andresazp> hellais: I undestand, the thing is that for low bandwith contexts you might want http_requests running at the same frecuency
18:03:11 <willscott> hellais: the 'false positives' is us not doing enough analysis
18:03:15 <andresazp> as the other two
18:03:33 <willscott> if we look at reverse ptrs, whois info, and cnames we ought to be able to do a pretty good job on dns alone
18:04:06 <andresazp> might not want* sorry
18:04:53 <andresazp> in VE we even had to schedule dns and http_requests in sucha  way to minimize impact to the host household connection
18:05:16 <andresazp> actual details of the schedule are in a email I sent you today
18:05:21 <andresazp> for the corius
18:05:43 <andresazp> curious
18:06:49 <hellais> willscott: we used to do PTR lookups as well in the dns_consistency test, but it turns out that PTR records are very rarely properly configured. whois info also is not always going to help you much if the IP in question is owned by some CDN. What do you mean by cnames?
18:06:52 <andresazp> but I understand that we might have an edge case
18:07:04 <andresazp> compared to other deployments
18:08:10 <willscott> akamai has ptrs configured, but whois won't be helpful. whois works for cloudflare, generally one of the two will
18:08:56 <willscott> if you dig www.apple.com, you'll see that it indirects through "www.apple.com.edgekey.net" -> "www.apple.com.edgekey.net.globalredir.akadns.net" -> "e6858.dscc.akamaiedge.net"
18:09:29 <willscott> seeing the expected CDN aliases is also a good sign
18:10:44 <hellais> yes I see
18:11:22 <anadahz> In a similar I also find MX records pretty useful, almost every DNS Hijacking "blocking implementation" disregards the MX records although they were never instructed to do so
18:12:46 <anadahz> ie. even if the website "should" be (acording to a regulation) blocked, email communication should not be blocked in any way
18:13:25 <anadahz> or at least I haven't head/read of blocking requests to MX records apart from SPAM reasons
18:13:30 <landers> (what's the reason for apple's cname -> cname -> a record thing?)
18:15:02 <willscott> landers: it's how akamai distributes to the nearest edge server. the first cname is set by apple to delegate to akamai. the third is akamai delegating to a specific physical place
18:15:12 <willscott> dunno what the second one is :)
18:15:20 <landers> ah cool
18:16:53 <hellais> andresazp: On your point of deprecating dns_consistency I guess we can keep that one, but drop http_requests and tcp_connect. Would that work for you? I mean would there be a reason for you to still prefer http_requests to web_connectivity?
18:20:05 <andresazp> reading last weeks conversation I’m even more convinced of the importance of the web_conectivity test
18:20:06 <andresazp> we didn’t use tpc_connect in VE out of ignorance on our part, but we would have loed to
18:20:07 <andresazp> However, for low bandwith aplications http_reqeust can be too network-demanding; an option of just running the rest of the test (dns and tcp) in the same tun would be ideal
18:20:40 <andresazp> no reasona t all for http_rewuest over web_coenctivity
18:22:15 <andresazp> (sidenote: I’m sorry for all the typos, dislexya + lack of spellcheck )
18:24:30 <willscott> we certainly shouldn't wait over a month to contact people we like. we're likely to loose many applicants that way
18:24:39 <willscott> whoops, wrong channel
18:29:18 <willscott> anything else for today, or should we end this meeting?
18:30:08 <hellais> andresazp: sounds good :)
18:30:19 <hellais> yes I guess we can end this, thank you all for attending
18:30:22 <hellais> #endmeeting