14:28:58 <karsten> #startmeeting metrics team
14:28:58 <MeetBot> Meeting started Thu Jul 26 14:28:58 2018 UTC.  The chair is karsten. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:28:58 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
14:29:00 <karsten> hi!
14:29:02 <irl> hi!
14:29:29 <karsten> https://storm.torproject.org/shared/5h1Goax5eNusxjXJ_Ty5Wl7hFR1uqCReUiN8xdlBG8T <- agenda pad
14:30:08 <karsten> looks like we have just one topic for today, which is fine.
14:30:36 <karsten> * OONI Vanilla Tor Data Analysis (irl)
14:31:16 <irl> this was mostly to give an update on what i've been doing here but there are a couple of questions
14:31:38 <karsten> the plan sounds good to me.
14:31:45 <karsten> the first question is about the table?
14:31:45 <irl> instead of having pre-processing the data as part of metrics-web, i've been looking at instead doing the preprocessing on the OONI side
14:32:15 <irl> here it is coming out of a postgres database and we can make queries there which makes it easier
14:32:40 <karsten> I think it's a fine plan to start doing this part on the ooni side.
14:32:55 <karsten> we can later change that, but we should then also think about archiving their data in collector.
14:33:13 <karsten> starting the aggregation on the ooni side doesn't make that necessary.
14:33:24 <irl> ok cool, so the csv that comes out is aggregated by country and date
14:33:36 <karsten> sure, sounds good.
14:33:46 <irl> as the data is quite sparse, i'm yet undecided if i want the date to be on a daily or weekly resolution
14:34:38 <karsten> ah, hmm.
14:34:40 <irl> the probes should be performing measurements daily and so making it weekly doesn't necessarily improve the data that much
14:34:49 <irl> if there's one broken probe it could just be broken all week
14:35:41 <karsten> I just added two comments to the pad.
14:36:17 <irl> success_rate is a fraction between 0 and 1
14:36:42 <karsten> would it be easier to have a success_count there?
14:36:52 <karsten> thinking of later aggregations and missing dates.
14:37:07 <irl> and also add a failure_count?
14:37:36 <karsten> maybe. though that should be report_count - success_count.
14:37:43 <karsten> unless there's a third category.
14:38:19 <karsten> but going back to the question whether to use daily or weekly data: most data on tor metrics is daily.
14:38:20 <irl> either it works or it doesn't, there is more detail in how far it gets in the bootstrap process but i don't think we want to go into that too deeply
14:38:44 <karsten> sure. adding a failure_count would be fine then.
14:38:50 <karsten> up to you.
14:39:21 <irl> ok
14:39:28 <karsten> by the way, this is an internal data format, not one we're giving out, right?
14:39:42 <karsten> as in, we can change it any time.
14:39:54 <irl> yes, we could change it
14:40:02 <karsten> okay.
14:40:56 <irl> the last question was whether it is worth also adding a table?
14:41:11 <karsten> probably not.
14:41:27 <karsten> there will be a CSV file download.
14:41:52 <karsten> with just the data that is displayed, and with ways to configure the start/end date and the country.
14:41:53 <irl> ok then
14:42:07 <irl> so i will turn all of this into tickets going forward
14:42:12 <karsten> I'm not even sure if we still need the current tables. but that's a discussion for another time.
14:42:20 <karsten> I also added another note:
14:42:24 <karsten> - Add section to reproducible metrics page
14:42:55 <irl> yes, this sounds good, and should be easier as we're going directly from the ooni measurements now
14:43:10 <karsten> under which category does the new graph go? Performance?
14:43:30 <irl> i guess it is reliability
14:43:49 <irl> so yes, performance
14:44:46 <karsten> the reproducible metrics page should then explain how to go from ooni measurements to this graph.
14:44:49 <karsten> probably.
14:45:07 <karsten> rather than say it's all magically there in the CSV file that we get from OONI.
14:45:38 <irl> yes, i would try to describe how to get it out of https://api.ooni.io/
14:45:46 <karsten> sounds good!
14:46:30 <karsten> okay, it looks like you'll have to turn the current Performance section into a subsection and then add another one for OONI data.
14:46:37 <karsten> there might even be something commented out in the HTML.
14:46:46 <karsten> you'll figure that out.
14:47:35 <irl> ok
14:47:36 <karsten> by the way, I'm happy to help with R/ggplot2/tidyr/dplyr details, if that would keep you busy for too long.
14:48:14 <karsten> alright.
14:48:28 <irl> ok cool
14:48:31 <slacktopus> <hellais> @irl if you want to do batch import of all the metrics of a particular `test_name` (in this case `vanilla_tor`), by far the fastest way of doing it is from the daily compressed tarbals in s3 buckets
14:49:16 <slacktopus> <hellais> See: https://ooni.torproject.org/post/mining-ooni-data/#jsonl-tar-lz4
14:49:21 <karsten> like a message from above.. :)
14:49:36 <irl> hellais: we don't need the whole tests, what you have in the postgres is everything we need
14:50:02 <slacktopus> <hellais> I was speaking about the reproducible metrics aspect
14:50:05 <irl> aah ok
14:50:13 <slacktopus> <hellais> The data we publish on s3 buckets will likely be there for a long time
14:50:19 <irl> yes, i was going to reference that blog post
14:50:24 <slacktopus> <hellais> While the postgres database is subject to change (as it’s an internal database)
14:50:38 <slacktopus> <hellais> (We may even eventually replace postgres with something else in the future)
14:52:02 <irl> yes, later we would want to archive the data in collector and at that point we wouldn't use the ooni postgres anymore
14:53:55 <karsten> okay. is there anything else on that topic?
14:54:11 <slacktopus> <hellais> Right
14:54:29 <irl> i think nothing else from me
14:54:35 <slacktopus> <hellais> One thing that may interest you on this topic, is that maybe it’s useful for you folks to have some sort of synchronization primitive for knowing when the OONI pipeline has ticket
14:54:46 <slacktopus> <hellais> So that you can sync your ingestion pipeline to it
14:55:12 <irl> this sounds like a thing to do later, maybe we can discuss it in mexico?
14:55:19 <karsten> I guess we would fetch a new file from OONI once per day, for now.
14:55:23 <irl> yes
14:55:47 <karsten> one thing that's worth figuring out is how long after the end of a UTC day there will be new measurements for that day.
14:56:04 <karsten> the question is whether we need to cut off days at the end, and if so, how many.
14:56:17 <slacktopus> <hellais> Currently our pipeline ticks +~ every 40h
14:56:27 <slacktopus> <hellais> We are soon going to bring that down to ~24h
14:56:33 <karsten> there was a ticket a while ago where we went through all data sources and came up with a number of days to cut off.
14:56:41 <slacktopus> <hellais> And eventually bring that down to sub-daily
14:57:19 <slacktopus> <hellais> This is probably something to keep in mind when you present your data as you should consider stuff that is older than the $OONI_TICK not ready to show yet
14:57:56 <irl> so we can cut off the last 2 days and then that would be fine for now
14:58:22 <karsten> try it out, I'd say. maybe we need 3, depending on when on the day things are running.
14:59:26 <karsten> still useful to see trends. just not the latest developments.
14:59:35 <karsten> but that's the case with other graphs, too.
15:00:14 <irl> ok
15:00:21 <karsten> alright. moving on.
15:00:28 <karsten> * Extend Statistics page (karsten)
15:00:51 <karsten> very quickly, I started extending the Statistics page on Tor Metrics to cover all per-graph CSV files as well.
15:01:12 <karsten> before deprecating the more general CSV files that we're planning to take out in the near future.
15:01:27 <karsten> (while still using them internally, but not providing them as interface anymore.)
15:01:59 <karsten> hmm, there might even be a ticket for that.
15:02:28 <irl> this is the page at /stats.html ?
15:02:37 <karsten> yes, it is.
15:02:49 <irl> ok, this sounds good
15:03:07 <karsten> we briefly thought about putting specifications in CSV file headers,
15:03:16 <karsten> but I fear that that doesn't scale very well.
15:03:23 <irl> also, people wouldn't find them
15:03:45 <karsten> and more importantly, we wouldn't have a good place for announce backward-incompatible updates.
15:04:02 <irl> ok yes
15:04:13 <karsten> we need something like onionoo's protocol page where people can look for major protocol bumps. and this would be similar.
15:04:43 <karsten> okay, I'll update the ticket as soon as I have something.
15:04:59 <irl> cool
15:05:09 <karsten> alright. I think we ran out of topics now.
15:05:18 <karsten> unless there's anything else?
15:05:22 <irl> nothing more from me
15:05:39 <karsten> great! thanks, and ttyl. bye!
15:05:44 <irl> bye!
15:05:45 <karsten> #endmeeting