17:01:22 #startmeeting 17:01:22 Meeting started Mon May 23 17:01:22 2016 UTC. The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:01:22 Useful Commands: #action #agreed #help #info #idea #link #topic. 17:01:25 * willscott waves 17:01:26 indeed! 17:01:31 hi landers 17:01:51 hey MeetBot 17:02:30 * sbs here 17:03:55 great so let's get this started 17:04:25 great! what are we talking about today? 17:04:50 so I had proposed 1 item for the agenda of today, I guess we can start with that 17:05:04 #topic providing dumps of views of the ooni measurement data 17:05:32 who would this be for? 17:06:22 I guess people that are interested in working with the OONI data, but for some reason don't want to download the full dump of the data and do minimisation of the records themselves 17:07:30 like, slice by test and country? 17:07:37 hi 17:07:41 so, i guess i have two thoughts here: 17:08:03 1. it would be great have a way to download a 'full dump' that's more obvious and easier than spidering measurements. 17:08:16 e.g. a full tar-ball 17:08:36 landers: I don't think the slicing is the issue, but mainly about minimising them and providing them in a format that works better with tools that people are familiar with 17:09:07 2. there's some bar of effort that we want to maintain for people doing research so that they're serious enough about using the data that they take the time to understand what it is 17:09:21 as an example I got an email from a researcher that was interested in using the OONI data and told me that they couldn't load it into their "data analysis tool" and then went on to say that also loading it into Excel didn't work. 17:09:26 (related: https://lists.torproject.org/pipermail/ooni-dev/2016-January/000374.html) 17:09:45 if we can't guarantee that a view is actually what we want someone to use without context, it could easily turn into someone misrepresenting the data 17:09:49 bar of effort > excel? 17:10:40 is yaml -> csv some easy thing we could do to make this person happy? 17:10:45 i think excel users would generally be better off getting a downstream analysis ooni produces, not raw data 17:10:51 hum that argument of the bar of effort is something I hadn't considered 17:11:24 similarly: researchers from Amsterdam's Digital Methods Initiative have expressed interest in working with the OONI data, but the tools that they said they are familiar probably can't help much in terms of the current data format 17:11:33 by which i mean, events and classification of interference, similar to what you could see on explorer 17:11:58 willscott: what is 'a downstream analysis ooni produces' ? 17:12:48 finding the reports that are flagged, maybe with a couple levels of 'how sure we are this is really an anomaly' and getting date, location, type of interference 17:13:59 i guess i'm thinking of similar processing to what explorer is doing to generate its map 17:14:29 i guess my high-level feeling is that this would be nice, but is probably a bit lower priority than some of the other work we have 17:14:53 the current situation also has the (benefit?) that other groups tend to get in touch with OONI 17:16:37 currently there is no export functionality from pipeline, many tests are not being listed, even blocked fingerprints cannot be integrated to the current pipeline due to processing difficulties 17:17:12 so having an "inaccurate" image of what is out in terms of data is not that nice 17:18:38 however finding some tools that process json files and posih our current raw format it's doable, doesn't require long development process and give the complete image of what data OONI has 17:20:31 perhaps we can provide some csvs with some basic types of data, while explaining obviously that they only include small sub-sets of the overall data? 17:21:26 this could encourage some researchers who aren't familiar with json to get started with (some) ooni data and, depending on their research questions, we could potentially provision them with more/different cvs or other data sets 17:23:12 what are some classification/aggregation things we'd feel comfortable putting a stamp on and sending downstream? blockpage detection? 17:24:29 perhaps a similar question to landers' one: what are those people who contacted us interested to? 17:26:10 unclear, though you're right that depending on their questions, we would potentially create relevant data sets 17:26:42 i guess maybe the other way to think about this is if we put in the work to create a view of data for such a group, it would be great to build that in a reproducible / regularly published way, rather than as a one-off 17:26:55 willscott: agree! 17:27:00 agreed 17:27:04 landers: I think the blockpage detection is a simple thing that we can feel confident about putting a "stamp" on. 17:27:47 Other things that I think could be useful, but are subject to misinterpretation could be the failure codes for the http_requests test 17:28:16 the middlebox stuff could also be an interesting view to present, though that is a bit more tricky perhaps. 17:29:54 hellais: though we can come up with some values for middle box stuff, such as fingerprints,anomalies detected (true or false), etc 17:30:34 we can probably also create values for the reachability of tor bridges and other circumvenion tools, in similar fashion 17:32:38 I will add to the list the dns_connectivity test reports that return records to 127.0.0.1 17:32:52 anadahz: yeah, makes sense! 17:33:21 what about talking with these interested people presenting the various possibility and finding out the ones in which they are most interested for their research? 17:33:42 sbs: +1 17:33:51 sure 17:34:15 though I think the idea is to also provide certain types of data in a format that people can easily use, regardless of their questions 17:34:35 perhaps we can start off by creating some csvs for http-requests, middle boxes and reachability of services? 17:35:35 anadahz: dns responses to private networks in general 17:36:06 On a previous talk with hellais I mentioned that we should also bring back the per country and add some extra slices of per test reports and per AS reports as directory structures additionally to our current daily raw reports 17:36:25 in looking for DNS responses to private IP space I have found some hostnames that consistently resolve to private IP space across resolvers 17:36:39 I mean it's not necessarily an indication of censorship 17:36:53 it's still perhaps a good start 17:37:06 true, but there could also be hostnames that resolve to 127.0.0.1 consistently :p 17:37:39 but the the thing would be wether that mathces with the control or not 17:37:46 right? 17:39:26 andresazp: yeah I guess looking at the control could filter those out 17:40:09 cool. this seems like a line of coding that would be nice to have on the server 17:40:19 should we also spend some time syncing on web_connectivity? 17:40:31 hellais: you have a branch ready for others to review? 17:40:42 willscott: yes I do 17:41:01 https://github.com/TheTorProject/ooni-probe/pull/457 17:41:57 I also have setup a test bouncer for use with that branch. In here I explain the one liner to run it: https://github.com/TheTorProject/ooni-probe/pull/457#issuecomment-219748547 17:42:52 the rate of false positives is much lower than the http_requests test, but there are still some cases of false positives. 17:43:32 geo localised content is the main issue here and I don't think there is a simple way to overcome this at the end of the day 17:48:47 going back to the measurement data views 17:51:24 can we support changes to the pipeline in such a way that can help us provide any usable measurement data views without re-developing major parts of the pipeline ? 17:54:10 if not then this should happen first to the raw data reports 17:56:26 --pausing-- since we are running out of time 17:57:07 andresazp: we has some discussion about the deprecation of dns_consistency and http_requests test on last meeting 17:57:41 I’m sorry I couldn’t attend 17:57:54 I will check it out 17:57:56 hellais: left a couple comments on your pull request 17:58:02 but in general my ywo cents are 17:58:34 pelase don’t depricate DNS_consitency or allow fo the new web_conectivity to run onlu DNS tests 17:58:56 only* as an attibute passed somehow 17:59:12 anadahz: we can do it quick and dirtily by exporting the data from the database, but I think that the proper way to do this is to start the process of reducing some load from the database layer and doing the analysis and minimisation using the JSON files as source. 17:59:52 anadahz: so I guess the answer is, yes some re-developing of the data pipeline is going to be needed. 18:00:01 DNs_consistency is incredibly useful, catches a huge ammount of censoshorship cases and connectivity-wise is very light 18:00:08 andresazp: log of the previous meeting http://meetbot.debian.net/ooni/2016/ooni.2016-05-16-17.01.log.html 18:01:59 andresazp: the problem with the dns_consistency test is that it leads to a lot of false positives, due to geolocation DNS based load balancing. In the web_connectivity test you get what the DNS_consistency test is doing + tcp_connect + http_requests 18:02:31 andresazp: is the concern that of too much traffic being generated? 18:03:05 hellais: I undestand, the thing is that for low bandwith contexts you might want http_requests running at the same frecuency 18:03:11 hellais: the 'false positives' is us not doing enough analysis 18:03:15 as the other two 18:03:33 if we look at reverse ptrs, whois info, and cnames we ought to be able to do a pretty good job on dns alone 18:04:06 might not want* sorry 18:04:53 in VE we even had to schedule dns and http_requests in sucha way to minimize impact to the host household connection 18:05:16 actual details of the schedule are in a email I sent you today 18:05:21 for the corius 18:05:43 curious 18:06:49 willscott: we used to do PTR lookups as well in the dns_consistency test, but it turns out that PTR records are very rarely properly configured. whois info also is not always going to help you much if the IP in question is owned by some CDN. What do you mean by cnames? 18:06:52 but I understand that we might have an edge case 18:07:04 compared to other deployments 18:08:10 akamai has ptrs configured, but whois won't be helpful. whois works for cloudflare, generally one of the two will 18:08:56 if you dig www.apple.com, you'll see that it indirects through "www.apple.com.edgekey.net" -> "www.apple.com.edgekey.net.globalredir.akadns.net" -> "e6858.dscc.akamaiedge.net" 18:09:29 seeing the expected CDN aliases is also a good sign 18:10:44 yes I see 18:11:22 In a similar I also find MX records pretty useful, almost every DNS Hijacking "blocking implementation" disregards the MX records although they were never instructed to do so 18:12:46 ie. even if the website "should" be (acording to a regulation) blocked, email communication should not be blocked in any way 18:13:25 or at least I haven't head/read of blocking requests to MX records apart from SPAM reasons 18:13:30 (what's the reason for apple's cname -> cname -> a record thing?) 18:15:02 landers: it's how akamai distributes to the nearest edge server. the first cname is set by apple to delegate to akamai. the third is akamai delegating to a specific physical place 18:15:12 dunno what the second one is :) 18:15:20 ah cool 18:16:53 andresazp: On your point of deprecating dns_consistency I guess we can keep that one, but drop http_requests and tcp_connect. Would that work for you? I mean would there be a reason for you to still prefer http_requests to web_connectivity? 18:20:05 reading last weeks conversation I’m even more convinced of the importance of the web_conectivity test 18:20:06 we didn’t use tpc_connect in VE out of ignorance on our part, but we would have loed to 18:20:07 However, for low bandwith aplications http_reqeust can be too network-demanding; an option of just running the rest of the test (dns and tcp) in the same tun would be ideal 18:20:40 no reasona t all for http_rewuest over web_coenctivity 18:22:15 (sidenote: I’m sorry for all the typos, dislexya + lack of spellcheck ) 18:24:30 we certainly shouldn't wait over a month to contact people we like. we're likely to loose many applicants that way 18:24:39 whoops, wrong channel 18:29:18 anything else for today, or should we end this meeting? 18:30:08 andresazp: sounds good :) 18:30:19 yes I guess we can end this, thank you all for attending 18:30:22 #endmeeting