16:00:32 <hellais> #startmeeting OONI weekly gathering 2017-01-30
16:00:32 <MeetBot> Meeting started Mon Jan 30 16:00:32 2017 UTC.  The chair is hellais. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:32 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:00:43 <hellais> pad with agenda: pad.riseup.net/p/ooni-irc-pad
16:00:51 <hellais> hellos!
16:01:20 * darkk throws a snowball to #ooni
16:02:19 * landers slips on ice
16:05:11 <hellais> Does anybody have something they would like to bring up for discussion this week?
16:05:15 <slacktopus> <agrabeli> hello
16:06:00 <sbs> hello!
16:06:16 <hellais> I would like to bring up the fact that the specification for the OONI Probe Orchestration System is basically ready to be merged for me.
16:06:48 <hellais> Something more that wasn't there last week is a measurement and URL policy that I encourage people from the community to also take a look at if it's something they are confortable with
16:06:52 <sbs> as far as I am concerned, I did some review but still need to look into your latest changes
16:06:54 <hellais> It is this: https://github.com/TheTorProject/ooni-spec/blob/183e4ac7038382c64350bd3f87915d02a54a477c/opos/Measurements-and-url-policy.md
16:07:12 <hellais> The actual design document is here: https://github.com/TheTorProject/ooni-spec/blob/183e4ac7038382c64350bd3f87915d02a54a477c/opos/OONI-Probe-Orchestration-System-Design.md
16:08:16 <hellais> darkk: do you want to give us some updates on what is going on in data pipeline world?
16:11:43 <darkk> hellais: raw-reports are shrunk down to ~380Gb with end-to-end accounting for truncated records. sanitized.lz4 reports are on-going at the moment (indexing tar.lz4 is a bit hacky),
16:12:12 <darkk> https://github.com/TheTorProject/ooni-pipeline/issues/32#issuecomment-276094341 -- some more words here
16:14:14 <hellais> darkk: you mention in there something about indexes, what are you using as keys for the index?
16:17:01 <darkk> that's not "B-Tree database key-value index" but metainformation that allows to pinpoint the measurement within the bytestream, reverse indexing will be handled by PG. Do you want to ask "what acts as the measurement permalink" ?
16:17:43 <sbs> darkk: is this thing that you talk about index.json?
16:19:07 <hellais> darkk: yes I got that, what I meant is what will you use as the bit of information to find the correct offset for the measurement you want to lookup in the compressed lz4
16:19:09 <sbs> darkk: when you say "we need some "streaming" way to read and write reports" you mean that this is the reason why tar is used?
16:19:10 <darkk> sbs: where? :) there is private/canned/YYYY-MM-DD/index.json that stores metainformation about tar files (what files are compressed, what are their checksum)
16:19:54 <sbs> darkk: is the diff you shared there is mention of `index.json` and explanation that it is not a manifest, IIRC
16:19:57 <hellais> I am assuming you are using some sort of unique ID for each measurement
16:19:58 <darkk> there is unnamed.json that stores metainformation for the tar file after autoclaving
16:21:23 <sbs> darkk: it would probably help me if you could describe (perhaps also in the documentation and/or offline later) what happens when a user requests a specific report
16:22:02 <darkk> sbs: `tar` is used to be "compatible" with people wanting to parse same blobs as well (with as little non-standard code as possible), that's why it's used instead of plain concatenation
16:23:01 <sbs> darkk: okay, so in one use case one downloads a .tar.lz4 and decompresses it and then tar provides an index to the stored-inside files
16:23:32 <sbs> what happens instead if one requests report X to our backend?
16:23:59 <sbs> (I like the idea of providing a simple way for people to process the data using a standard tool such as tar)
16:24:04 <darkk> sbs: private/canned/YYYY-MM-DD/index.json does not know anything about measurements yet, canning just pickles raw files into a can, user does not download cans (as they are private) user downloads sanitized cans (tarballs) that passed through autoclave
16:25:31 <darkk> sbs: if someone reports single report X (e.g. ooni-explorer webpage for specific measurement) then specific LZ4 frame of specific tar file is read, decompressed, part of frame is raw sanitized json and it's served as is
16:25:36 <sbs> darkk: mmm, okay, and the YYYY-MM-DD/index.json file is instrumental to further processing and probably does not even end up in the final tar?
16:26:57 <sbs> darkk: meaning that, for each .tar.lz, there is also an external file that gives mapping from report names to offsets in the tar, right?
16:27:11 <darkk> yep, that's internal one, private/canned is just a "safe" where raw reports are stored in untouched way, there are two benefits over current private/reports-raw: 1. less IOPS to read, 2. better compression than (single files|no compression now)
16:29:04 <darkk> sbs: private/canned/…/index.json doe not store offsets, public/sanitized/…/yetanotherindex.json stores them (and I think I need new name for "sanitized-tared-compressed" directory to be less ambiguous)
16:30:37 <sbs> darkk: okay, cool, one more question: so, the fact that you know at which offset to jump means (I assume) that lz4 is a format where you can start decompressing from certain specific points (is it like a frame as in, say, videos)?
16:31:49 <darkk> sbs: yep, exactly, they're called `frames` as well. actually that's true for gzip too, but lz4 decompressor is just much faster
16:32:36 <sbs> darkk: ah, okay, I didn't know... that's interesting: now I understand how one can "stream" both from lz4 and after un-lz4 from the tar
16:32:41 <sbs> beautiful
16:33:14 <sbs> I'll read the diff more carefully later and let you know if I have more questions or comments
16:33:23 <sbs> thanks
16:37:17 <sbs> since there are crickets, one more question:
16:37:21 <darkk> hellais: I don't actually have _uniq_ ids yet, there are several possible IDs that may be used for different purposes. permalink = ${bucket}/${report_fname}/${no_within_report}, data scrubbing = sha1(report_file), duplicate measurements = sha1(header_yaml + measurement_yaml)  or sha1(line_json). Also, I'm going to will `id` with "duplicate measurements" sha1 truncated down to 128bits instead of
16:37:23 <darkk> random uuid().
16:38:14 <sbs> why you compute both the crc32 and sha1?
16:39:14 <darkk> sbs: _sometimes_ crc32 is faster than sha1 and it's probably good enough for data scrubbing
16:39:34 <darkk> and it's hard to get anything that is both widespread and fast
16:39:52 <darkk> s/than sha1/than DISK/
16:39:57 <hellais> darkk: thanks for all these updates it's very useful. I will take a look at the diff in the branch and take note of comments there.
16:40:00 <sbs> darkk: ah, okay, so it's computed at this stage because it's gonna be useful later
16:40:42 <hellais> Ideally the permalink would be something somewhat close to what is outlined inside of the measurement_id ticket.
16:41:14 <hellais> BTW here is a notebook with the code for the DNS related analysis: https://gist.github.com/hellais/5b1ed13dfb74ba03b349af8ecf8ff208
16:42:19 <darkk> hellais: there is some difference between permalink (that can be supported with redirects) and ID that's used in "pagination" and measurement_id ticket mixes them
16:43:32 <sbs> darkk: where pagination is possibility to stream reports via API by asking for the next "page"?
16:44:35 <hellais> sbs: pagination is when I am a consumer of OONI measurements (for example I am the Tor metrics team) and I need to tell the measurements API, give me all the measurements since the last time I fetched them
16:44:56 <darkk> exactly
16:45:04 <darkk> e.g. ${sha1_report}/${n}/${sha1_measurement}/${m} is valid permalink is well (${n} and ${m} are some numbers, usually zero, used to resolve collisions according to some time-based ordering rule)
16:45:06 <sbs> hellais: got it
16:48:36 <hellais> darkk: yes I see your point. Though permalinks are more useful if they are something easy to share. That is why I was suggesting something like an N digit number.
16:51:18 <darkk> hellais: why? is it to avoid measurements.otpo/m/3i42h3s6nnfq2msvx7xzkyayscx5qbyj being wrapped with goo.gl in pdf papers? :)
16:51:52 <darkk> why does one want to type in these numbers?
16:55:21 <hellais> I guess I don't have a very compelling argument for it, but it's generally the case that a thing on the web is identified by some namespace (what the thing is) + some integer. I am thinking of tweets, reddit/facebook/google+ posts, flickr images, etc.
16:55:45 <sbs> perhaps a point that I see in _logical_ permalinks is that they are independent of the way in which we do the backend, which could be an advantage shall we decide to change the way in which it works
16:56:40 <hellais> anyways I think this aspect is at the moment not that high priority and it's something we can defer once we have made more progress on the pipeline
16:57:12 <sbs> sure
16:57:16 <darkk> anyway, we'll have to re-do it at least once (if OONI outlives another iteration of ooni-pipeline) and we'll have nginx with config file dedicated just to redirects.
16:58:04 <sbs> darkk: haha, that's gonna be a huge config file :)
16:58:52 <darkk> that's why I consider "pagination" id independent of "permalink" id -- pagination id can be easily changed with /api/epoch/2 scheme, permalink may be cited and should live as soon as something better than 404 can be given as a reply to URL :)
16:59:57 <sbs> sbs: understand, well I guess I'll think a bit more about it
17:00:20 <darkk> sbs: yep, I've heard of redirect-only nginx configs that were couple of gigabytes :-) Process bootstrap was rather slow but 301s were speedy :)
17:00:55 <sbs> darkk: speed and/or SPDY I guess
17:02:50 <hellais> do we have any more items to discuss?
17:03:08 <sbs> I think we can give just a brief update on our mobile progresses/
17:03:09 <sbs> ?
17:03:31 <darkk> I want to cite nuke (so the link is stored in log)
17:03:33 <darkk> <nuke> New android APK released : https://github.com/measurement-kit/ooniprobe-android/releases/tag/v0.1.0-rc.3
17:03:35 <darkk> <nuke> feedbacks and comments to my email lorenzo@openobservatory.org thanks
17:03:35 <hellais> sure go for it
17:04:23 <sbs> right, then
17:04:29 <sbs> <MobileUpdate>
17:05:02 <sbs> Basically, we're quite there: a couple of bug fixing is required on the MK side and final integration on the apps side
17:05:14 <sbs> The final beta should probably be blessed in 2-3 days
17:06:17 <sbs> We can probably make it to have the apps done and hopefully also published on the stores as planned within the first two weeks of February
17:06:19 <nuke> Hi all
17:06:20 <sbs> </>
17:06:31 <sbs> nuke: hello!
17:06:53 <nuke> There are just a few things to fix in the apps, and more more more testing to do. But basically apps are ready.
17:07:12 <nuke> I'm waiting some fixes in mk, but we can also release and then submit a quick update
17:08:38 <nuke> Everybody with android is invited to test the apk as dark reminded, I don't have many android testers at the moment and I own only two android phones
17:09:11 <sbs> ah, yes, important: we have tested on armeabi only with emulator, where things work
17:09:29 <sbs> we have had reports that on a Android 4.4 however the app was not working
17:10:05 <sbs> so, yes, testing on old Android versions and architectures (basically phones made targeting Android < 4 are quite likely to be armeabi) would help us
17:10:18 <darkk> IIRC, I have some 4.4 device on the shelf, but I'm unsure if the battery is still alive enough to power on
17:10:31 <darkk> What's the lowest android version you're interested in?
17:10:45 <sbs> API 9 which should be 2.3
17:12:07 <darkk> O_o?!? 2.3?! Ok, it makes perfect sense (low-cost old phones in all over the world), but I've not thought about it before. The oldest one I have is HTC Desire Z (qwerty!) flashed with 4.0.
17:13:05 <sbs> darkk: yes, the rationale is exactly as you say to be able to work on low-cost old phones -- the oldest working thing I have is armeabi-v7a originally flahsed for android 4
17:17:07 <darkk> OK, I'll keep in mind that some ancient 2.3 device may be useful for testing & debugging.
17:17:28 <sbs> darkk: thanks
17:18:26 <hellais> thanks for the updates on the mobile from and darkk for offerring to do testing on android
17:18:58 <hellais> do we have anything else to discuss or can we end with just 18 minutes past the scheduled end time?
17:19:36 <sbs> can we end in non even hours such as 17:15 and 17:30?
17:19:39 <sbs> :-o
17:20:02 <sbs> s/even/round/perhaps
17:24:02 <darkk> #endmeeting stats https://gist.github.com/darkk/bf3b240f1534172282ef52e8eff2494f
17:24:41 <hellais> lol
17:24:41 <sbs> lol
17:24:51 <hellais> ok let's aim for 17:25 then
17:24:59 <hellais> thank you all for attending
17:25:01 <hellais> #endmeeting