14:28:58 #startmeeting metrics team 14:28:58 Meeting started Thu Jul 26 14:28:58 2018 UTC. The chair is karsten. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:28:58 Useful Commands: #action #agreed #help #info #idea #link #topic. 14:29:00 hi! 14:29:02 hi! 14:29:29 https://storm.torproject.org/shared/5h1Goax5eNusxjXJ_Ty5Wl7hFR1uqCReUiN8xdlBG8T <- agenda pad 14:30:08 looks like we have just one topic for today, which is fine. 14:30:36 * OONI Vanilla Tor Data Analysis (irl) 14:31:16 this was mostly to give an update on what i've been doing here but there are a couple of questions 14:31:38 the plan sounds good to me. 14:31:45 the first question is about the table? 14:31:45 instead of having pre-processing the data as part of metrics-web, i've been looking at instead doing the preprocessing on the OONI side 14:32:15 here it is coming out of a postgres database and we can make queries there which makes it easier 14:32:40 I think it's a fine plan to start doing this part on the ooni side. 14:32:55 we can later change that, but we should then also think about archiving their data in collector. 14:33:13 starting the aggregation on the ooni side doesn't make that necessary. 14:33:24 ok cool, so the csv that comes out is aggregated by country and date 14:33:36 sure, sounds good. 14:33:46 as the data is quite sparse, i'm yet undecided if i want the date to be on a daily or weekly resolution 14:34:38 ah, hmm. 14:34:40 the probes should be performing measurements daily and so making it weekly doesn't necessarily improve the data that much 14:34:49 if there's one broken probe it could just be broken all week 14:35:41 I just added two comments to the pad. 14:36:17 success_rate is a fraction between 0 and 1 14:36:42 would it be easier to have a success_count there? 14:36:52 thinking of later aggregations and missing dates. 14:37:07 and also add a failure_count? 14:37:36 maybe. though that should be report_count - success_count. 14:37:43 unless there's a third category. 14:38:19 but going back to the question whether to use daily or weekly data: most data on tor metrics is daily. 14:38:20 either it works or it doesn't, there is more detail in how far it gets in the bootstrap process but i don't think we want to go into that too deeply 14:38:44 sure. adding a failure_count would be fine then. 14:38:50 up to you. 14:39:21 ok 14:39:28 by the way, this is an internal data format, not one we're giving out, right? 14:39:42 as in, we can change it any time. 14:39:54 yes, we could change it 14:40:02 okay. 14:40:56 the last question was whether it is worth also adding a table? 14:41:11 probably not. 14:41:27 there will be a CSV file download. 14:41:52 with just the data that is displayed, and with ways to configure the start/end date and the country. 14:41:53 ok then 14:42:07 so i will turn all of this into tickets going forward 14:42:12 I'm not even sure if we still need the current tables. but that's a discussion for another time. 14:42:20 I also added another note: 14:42:24 - Add section to reproducible metrics page 14:42:55 yes, this sounds good, and should be easier as we're going directly from the ooni measurements now 14:43:10 under which category does the new graph go? Performance? 14:43:30 i guess it is reliability 14:43:49 so yes, performance 14:44:46 the reproducible metrics page should then explain how to go from ooni measurements to this graph. 14:44:49 probably. 14:45:07 rather than say it's all magically there in the CSV file that we get from OONI. 14:45:38 yes, i would try to describe how to get it out of https://api.ooni.io/ 14:45:46 sounds good! 14:46:30 okay, it looks like you'll have to turn the current Performance section into a subsection and then add another one for OONI data. 14:46:37 there might even be something commented out in the HTML. 14:46:46 you'll figure that out. 14:47:35 ok 14:47:36 by the way, I'm happy to help with R/ggplot2/tidyr/dplyr details, if that would keep you busy for too long. 14:48:14 alright. 14:48:28 ok cool 14:48:31 @irl if you want to do batch import of all the metrics of a particular `test_name` (in this case `vanilla_tor`), by far the fastest way of doing it is from the daily compressed tarbals in s3 buckets 14:49:16 See: https://ooni.torproject.org/post/mining-ooni-data/#jsonl-tar-lz4 14:49:21 like a message from above.. :) 14:49:36 hellais: we don't need the whole tests, what you have in the postgres is everything we need 14:50:02 I was speaking about the reproducible metrics aspect 14:50:05 aah ok 14:50:13 The data we publish on s3 buckets will likely be there for a long time 14:50:19 yes, i was going to reference that blog post 14:50:24 While the postgres database is subject to change (as it’s an internal database) 14:50:38 (We may even eventually replace postgres with something else in the future) 14:52:02 yes, later we would want to archive the data in collector and at that point we wouldn't use the ooni postgres anymore 14:53:55 okay. is there anything else on that topic? 14:54:11 Right 14:54:29 i think nothing else from me 14:54:35 One thing that may interest you on this topic, is that maybe it’s useful for you folks to have some sort of synchronization primitive for knowing when the OONI pipeline has ticket 14:54:46 So that you can sync your ingestion pipeline to it 14:55:12 this sounds like a thing to do later, maybe we can discuss it in mexico? 14:55:19 I guess we would fetch a new file from OONI once per day, for now. 14:55:23 yes 14:55:47 one thing that's worth figuring out is how long after the end of a UTC day there will be new measurements for that day. 14:56:04 the question is whether we need to cut off days at the end, and if so, how many. 14:56:17 Currently our pipeline ticks +~ every 40h 14:56:27 We are soon going to bring that down to ~24h 14:56:33 there was a ticket a while ago where we went through all data sources and came up with a number of days to cut off. 14:56:41 And eventually bring that down to sub-daily 14:57:19 This is probably something to keep in mind when you present your data as you should consider stuff that is older than the $OONI_TICK not ready to show yet 14:57:56 so we can cut off the last 2 days and then that would be fine for now 14:58:22 try it out, I'd say. maybe we need 3, depending on when on the day things are running. 14:59:26 still useful to see trends. just not the latest developments. 14:59:35 but that's the case with other graphs, too. 15:00:14 ok 15:00:21 alright. moving on. 15:00:28 * Extend Statistics page (karsten) 15:00:51 very quickly, I started extending the Statistics page on Tor Metrics to cover all per-graph CSV files as well. 15:01:12 before deprecating the more general CSV files that we're planning to take out in the near future. 15:01:27 (while still using them internally, but not providing them as interface anymore.) 15:01:59 hmm, there might even be a ticket for that. 15:02:28 this is the page at /stats.html ? 15:02:37 yes, it is. 15:02:49 ok, this sounds good 15:03:07 we briefly thought about putting specifications in CSV file headers, 15:03:16 but I fear that that doesn't scale very well. 15:03:23 also, people wouldn't find them 15:03:45 and more importantly, we wouldn't have a good place for announce backward-incompatible updates. 15:04:02 ok yes 15:04:13 we need something like onionoo's protocol page where people can look for major protocol bumps. and this would be similar. 15:04:43 okay, I'll update the ticket as soon as I have something. 15:04:59 cool 15:05:09 alright. I think we ran out of topics now. 15:05:18 unless there's anything else? 15:05:22 nothing more from me 15:05:39 great! thanks, and ttyl. bye! 15:05:44 bye! 15:05:45 #endmeeting