#ooni log

17:00:10 <hellais> #startmeeting
17:00:10 <MeetBot> Meeting started Mon Dec 21 17:00:10 2015 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:10 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:00:11 * gxg wavesp[
17:00:22 <hellais> gxg: super fast
17:00:27 <hellais> who else is here?
17:00:42 <vtduncan> Hi
17:00:47 <landers> here
17:02:37 * sbs here
17:04:24 <hellais> perfect, so what have you been up to last week?
17:08:30 <landers> i've been doing pipeline stuff, moving that secret branch forward, trying not to do too many terrible bugs ;_;
17:08:33 <landers> eof
17:08:55 <anadahz> here
17:10:57 <hellais> I have been working mainly on the ooni-api development, I implemented the tables to store the details about tests and country code information. Moreover I have been following the development on the ooni-pipeline catching bugs here and there. EOF
17:11:33 <sbs> I have release v0.1.0 of measurement-kit (the library), I have updated the prototype iOS app to use it, and I am working to finalize the prototype of the Android app. EOF
17:12:22 <simonv3> Last week I did mainly some refactoring of the ooni-api, working off of hellias’ changes, and making it easier to build specific pages for each test. I’ve also been working on the visual of the api. EOF
17:13:44 <simonv3> I also came to the conclusion that I’m going to need to sit down and understand all the specific tests properly and come up with layouts for them all, and implement those. I’ll do that this week along with finding basic country information. Any one have suggestions for either of those?
17:14:52 <hellais> simonv3: with maria we started writing up a bit of the copy to go along with the tests that I will send you or add directly to the views of the specific test pages. That should make it clearer exactly what each test does and perhaps also how to best present it.
17:15:40 <hellais> simonv3: regarding the country information I believe that it's probably best to at this stage not include any sort of text describing the country as anything we scrape elsewhere will be imho heavily biased and probably not very relevant to our use case
17:16:03 <hellais> I would perhaps just limit the country pages to including some basic statistics about the country like ONI used to do:
17:16:22 <hellais> https://opennet.net/research/profiles/vietnam
17:16:31 <hellais> (I am talking about the key indicators section)
17:17:06 <simonv3> Re: test copy: That would be really useful. Re: country information: I was just thinking of doing population, internet penetration, actual facts
17:17:22 <hellais> simonv3: yes that sounds good then
17:17:39 <simonv3> I’d even be hesitant to put in stuff like “Rule of Law, and Democracy Index” unless we verify what those mean ourselves
17:18:12 <hellais> yeah I agree
17:18:36 <hellais> the Digital Opportunity Index as a pretty good methodology in my opinion
17:18:41 <hellais> *has
17:19:42 <hellais> in other news we now have most of the reports inside of the database, hurray!
17:19:55 <hellais> the sad news is that now the database is incredibly slow to run certain queries
17:20:10 <hellais> with landers we were before discussing some possible strategies we can adopt in order to speed things up
17:20:23 <simonv3> is that in the new staging db?
17:20:32 <simonv3> (i hadn’t updated the IP address yet)
17:21:18 <hellais> the problem lies in the fact that amazon RDS doesn't actually give you a lot of IO speed and so doing certain queries can take up to some minutes
17:21:44 <hellais> indexing on certain keys does improve the situation, but the country counts are still very problematic
17:21:56 <anadahz> hellais: do you think that a separate bare metal server will give more IO speed?
17:22:06 <sbs> hellais: is it possible to create a cluster of databases?
17:22:11 <hellais> I think probably the best option would be to have those be some stored procedures that are registered using a TRIGGER
17:22:30 <anadahz> hellais: that could potentially decrease the response time?
17:22:48 <hellais> anadahz: it depends on what disks are mounted on it. Without SSDs I don't think the performance gain would be significant
17:23:32 <hellais> sbs: yes that is also an option we were considering since postgres does support sharding, however it's unclear what would be the ideal key to use for the sharding
17:24:16 <sbs> hellais: I see. Yes, the key would indeed be critical!
17:25:11 <sbs> hellais: ideally you want the key to be randomly distributed... is there a random id for each measurement? if so, that maybe could qualify as a good key?
17:25:33 <hellais> I am also thinking that perhaps it could be beneficial to reduce the table to 3NF by ripping out all the common fields (test_start_time, probe_cc, probe_asn, ... etc basically what is in the report header) and putting them in another table and making a reference
17:26:28 <hellais> sbs: currently there is no key for the measurement, but I think we will re-run the import and add a key to the measurement, because we actually ran into a bug when importing the current batch due to using the tuple (input, report_id) as a key because of the NOT NULL constraint of the primary key
17:27:53 <sbs> hellais: I think a unique key could also be useful to identify each measurement with its own uri
17:29:11 <hellais> true
17:29:23 <sbs> hellais: normalizing the table isn't something that conflicts with sharding the measurements using a unique key?
17:29:42 <hellais> the main reason why we were initially hesitant is that it could change if we re-import the data into the database since it's not being applied by the probe itself
17:29:44 <sbs> hellais: I mean, how do you guarantee that you spread stuff equally then?
17:30:24 <sbs> hellais: could the key be something like prefix/uuid and then you choose a prefix to indicated that such uuid was generated ex post?
17:31:06 <hellais> the report_id is also random and the input and report_id should be unique
17:32:19 <anadahz> hellais: can you run this on the psql server? # dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=dsync
17:33:00 <hellais> anadahz: it's not a server, it's a SAAS deployment
17:34:00 <hellais> currently I am using the "cheap" solution that doesn't have guaranteed IO speed, but am also spinning up another one that has a guaranteed IO speed
17:35:27 <willscott> how many minutes & how many total records need to be scanned for the current country queries?
17:35:30 <sbs> hellais: ah, I understand: so, you have random keys available both for the whole report and for individual tests
17:37:43 <hellais> sbs: correct
17:39:41 <hellais> willscott: currently there are ~6M entries in the DB. It takes about 5 minutes to run one of those queries.
17:40:16 <sbs> hellais: speaking of generating files on disk: what do you expect ooni-probe (or measurement-kit) to generate on disk right after a test? A .json file for the header and a .json file for each test, or just a .json file for each test?
17:40:17 <hellais> (I just re-ran it and it's actually just 90 seconds)
17:40:51 <sbs> (I am talking about the new .json format as opposed to the current .yaml format)
17:43:26 <hellais> sbs: I expect the format to be used when writing to disk to remain YAML and optionally be JSON (I still see the benefit in having YAML for quickly inspecting the results manually). When JSON is used I would image there not to be such a thing as a report header anymore, but have the keys of the header duplicated in every JSON row
17:43:55 <hellais> in the end I think it's more reasonable to sacrifice a little bit of storage space in exchange for making it easier to process the results
17:45:20 <sbs> hellais: ack
17:49:54 <simonv3> Hey just saw this: https://datatracker.ietf.org/doc/draft-ietf-httpbis-legally-restricted-status/?include_text=1
17:50:21 <simonv3> > An HTTP Status Code to Report Legal Obstacles
17:54:06 <anadahz> reporting back
17:54:21 <anadahz> From my side: worked on the ansible roles deployment for tor pluggable transports installation in ooniprobe,
17:54:22 <anadahz> improvements on the decks and upload reports scripts as well as many bug fixes for lepidopter. Additionally we have worked with sbs on the mini hackathon report.
17:54:27 <anadahz> EOF
17:55:33 <MightyOctopus> [13lepidopter] 15joelanders pushed 1 new commit to 06master: 02http://git.io/vEmNj
17:55:33 <MightyOctopus> 13lepidopter/06master 140ce1f91 15Joe Landers: Merge pull request #24 from TheTorProject/hotfix/track-empty-dirs...
17:55:34 <MightyOctopus> [13lepidopter] 15joelanders closed pull request #24: Hotfix: empty directory structure required (06master...06hotfix/track-empty-dirs) 02http://git.io/vEqS9
17:55:38 <MightyOctopus> [13lepidopter] 15joelanders 04deleted 06hotfix/track-empty-dirs at 140d65a8d: 02http://git.io/vEmAf
17:56:50 <willscott> anadahz and I are giving a talk on the censorship measurement space next week: https://events.ccc.de/congress/2015/Fahrplan/events/7143.html
18:02:26 <hellais> good stuff
18:02:52 <hellais> well if there is nothing else I thank you all for attending
18:03:06 <hellais> I guess next meeting a lot of us will be at CCC, so we can probably do something there in person
18:11:36 <landers> i'll be there
18:19:39 <hellais> #endmeeting