22:05:27 #startmeeting Prio and PrivCount in Tor 22:05:27 Meeting started Tue Nov 20 22:05:27 2018 UTC. The chair is teor. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:05:27 Useful Commands: #action #agreed #help #info #idea #link #topic. 22:05:56 Did you want to start with your questions? 22:06:15 Yes, sounds good. 22:06:33 This is a meeting about Prio, the privacy-preserving algorithm used to collect some Firefox user data 22:06:49 and PrivCount in Tor, the privacy-preserving design for Tor relay statistics 22:06:50 I have a couple of general questions: 22:07:07 1) I'm curious what the status is of the PrivCount implementation in Tor. 22:07:30 2) I'd be interested to hear if any of the ideas from Prio would be useful for Tor. 22:08:05 3) I'm curious how much overlap there is between libprio (our Prio implementation for Mozilla) and the PrivCount implementation for Tor. 22:08:51 If there is a lot of overlap, then it might be worth thinking if libprio could use some of the code you've already written (or vice versa). 22:09:23 That's it for my questions. 22:09:28 Ok, so let's start with 1) 22:10:46 PrivCount in Tor has a specification for the low-level blinding, secret sharing, and aggregation 22:11:04 #link https://gitweb.torproject.org/torspec.git/tree/proposals/288-privcount-with-shamir.txt 22:11:44 nickm has implemented those parts of the specification in Rust 22:12:03 #link https://github.com/nmathewson/privcount_shamir 22:12:29 We're still revising the differential privacy and configuration parts of the specification 22:12:51 Great. 22:13:07 Can I ask one question about the specification? 22:13:10 #link https://trac.torproject.org/projects/tor/wiki/org/meetings/2018MexicoCity/Notes/PrivCount 22:13:22 #link https://trac.torproject.org/projects/tor/wiki/org/meetings/2018MexicoCity/Notes/PrivCountTechnical 22:14:11 We have a roadmap here, I'll update it with the last two links now: 22:14:16 #link https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/PrivCountInTor 22:14:55 We deferred work on PrivCount in Tor until December/January, because another sponsor is ending soon 22:15:15 Ah, I see. 22:15:50 In lines 81-84 of the spec, it sketches a way to defend against "bogus data from corrupting the tally" 22:16:39 Is there some security/privacy analysis of this technique? 22:17:19 It's different from what we do in Prio, and I'm curious about the pros/cons of the different approaches. 22:17:38 There isn't a detailed analysis. 22:17:54 Every extra aggregation is a separate query of the data under differential privacy 22:18:33 So we would need to add extra noise for every subset that we query 22:18:46 Ah, I see. That makes sense. 22:19:36 So if you want to defend against bogus data from M malicious data collectors, how many different subsets do you take? Something like 3M? 22:20:20 Defending against data corruption requires one subset with no corrupt data 22:20:25 So it's a probabilistic thing 22:20:34 3M is about right 22:20:45 Wouldn't you need a majority of subsets to have no corrupted data? 22:21:15 It depends how obvious the corrupt data is 22:21:31 And whether we do a median, or "exclude outliers, then median" 22:22:11 Alternately, we could use a small subset (with low probability of corrupt data) to check a few large subsets (with higher probability of corrupt data) 22:23:31 But we could get much more accurate data with some non-interactive proof that the data is in a realistic range 22:24:23 Which leads nicely into 2) I'd be interested to hear if any of the ideas from Prio would be useful for Tor. 22:24:32 Just so that I understand your use case a little better: How many bad data collectors are you expecting? Something like 5% or more like 5 (total)? Or more/less? 22:24:51 Yes, I'm happy to sketch what Prio does. 22:25:34 We're not sure. I'd have to ask someone from metrics how many outliers they exclude. 22:26:08 #action Ask the metrics team: How many outliers are excluded during metrics processing? 22:26:11 Makes sense, thanks. 22:26:15 About Prio: 22:26:33 We expect there might be more if corrupt data is harder to identify. 22:27:17 The "SNIP" primitive, described in the Prio paper, is a sort of "zero-knowledge proof on secret-shared data." 22:28:06 Say that a data collector splits its private data value into shares and sends one share to each of the tally reporters. 22:28:37 The data collector can then prove to the tally reporters that they are holding shares of a value x such that x is "well formed." 22:29:01 For example, the data collector could prove to the TRs that it sent them shares of an integer x where 0 <= x < 127. 22:29:28 The TRs learn nothing about x, apart from the fact that x is well formed. 22:29:53 Additionally, the proof uses no public-key crypto and requires the servers to exchange only a constant number of bits. 22:30:04 There are some limitations though: 22:31:02 * The DC-to-TR communication scales linearly with the complexity of the circuit that tests whether x is well formed. For many natural applications (i.e., 0 <= x < 2^8) this cost is not too bad. 22:31:52 * As Aaron Johnson mentioned in his email on the tor-dev list, out-of-the-box Prio provides privacy as long as one TR is honest but it provides correctness only if _all_ TRs are honest. 22:32:57 It would be possible to strengthen the correctness requirement with a bit more work if that turns out to be useful for some applications, but libprio does not yet support this. 22:33:16 Does that make sense, more or less? 22:34:30 Yes. Range proofs would be useful to PrivCount in Tor, and we would probably be looking at ranges like 0 <= x < 2^N, 16 < N < 64. 22:34:49 But we want to deploy a minimum viable product, then iterate. 22:35:02 Right. That makes sense. 22:35:06 So we'll look at the pros and cons of all the available non-interactive proof schemes once the initial system is deployed. 22:35:57 Data volume, security, peer-review, code quality, other users of the code, and available specifications are all criteria we look at before we add a dependency. 22:36:09 (There are probably more that I've forgotten.) 22:36:54 Right -- libprio is pretty heavily tailored to the Mozilla application, so it almost certainly would not suffice for Tor without quite a bit of work. 22:37:17 Do you have a sense of how many data values the DCs will send to the TRs? Is it more like tens or thousands or more? 22:38:06 We did an analysis in the spec, I think. Let me look. 22:41:26 If there are N tally reporters, and M counters, the random blinding seeds are 32*N bytes, and the blinded, noisy values are M*N*8 bytes 22:42:02 So for a reasonable number of counters (100) and tally reporters (10), it's about 8 kilobytes 22:42:19 We only do a round once a day, so that's entirely reasonable 22:43:19 Got it, thanks. 22:43:56 Slightly larger than the current relay statistics, which are about 2 kB per relay 22:44:11 But we don't currently have 100 stats 22:45:05 We're now effectively answering 3) I'm curious how much overlap there is between libprio (our Prio implementation for Mozilla) and the PrivCount implementation for Tor. 22:45:36 Yup. 22:46:27 It seems like there's a lot of conceptual overlap, and parts of the specifications would be similar 22:46:34 (Does Prio have a formal specification?) 22:46:40 No, it does not yet. 22:47:03 I agree that there is some overlap, with maybe the two differences being: 22:47:22 1) PrivCount is using Shamir secret sharing to get security against malicious TRs, while libprio uses simple additive secret sharing. 22:47:24 2) 22:48:02 2) PrivCount has support for adding differential privacy noise on the DC side, while libprio doesn't aim to provide differential privacy (yet). 22:48:23 and 3) libprio has the proofs of correctness, which PrivCount doesn't yet implement. 22:48:35 Does that look more or less right to you? 22:49:05 We have a research implementation for noise allocation, but we haven't specified it yet. And we want to re-specify the noise distribution using integer arithmetic. 22:49:36 I see. Is there a reason that the DCs add the noise instead of having the TRs add it? 22:50:23 (Differential privacy breaks down in strange ways under floating-point inaccuracy, and it's hard to calculate the differential privacy bounds for floating-point gaussian distributions.) 22:51:05 If the DCs add the noise, it is not possible to reconstruct non-noisy totals, even if all the TRs are malicious and collaborate 22:52:24 There may have been other reasons, I'd have to ask the original researchers 22:52:27 It seems like the DCs would have to add a very large amount of noise to get meaningful privacy even if all TRs collude though. 22:53:05 Yes. Resistance to entirely malicious TRs is not part of the formal security guarantee. 22:53:17 But it's still a useful property of the implementation. 22:53:47 I see. 22:54:38 I should also mention that it's unlikely that the details in the Prio and PrivCount code are compatible. 22:54:41 PrivCount in Tor is in Rust with Rust data structures, and a specified interchange format. 22:56:05 Right, just in terms of functionality, it seems like if the PrivCount implementation supported proofs of correctness, it would subsume the functionality of libprio. 22:56:28 and affine aggregateable encodings 22:56:37 (Prio is in C and assembly, and uses msgpack for interchange (I think)) 22:57:35 I have another meeting soon, so any final questions? 22:57:43 (I can swap back and forth, but that's not ideal) 22:57:53 No -- I think that's it. Thanks for your time! 22:58:03 No worries. 22:58:10 It's very helpful to hear what else is out there. 22:58:28 To give you a likely timeframe, we should have some kind of deployment in 2019, depending on our other priotiries. 22:58:31 * priorities 22:58:57 Okay, I'll look out for that. Feel free to ping me if other Prio-related questions come up. 22:59:05 I'd say "the start of 2019", but let's be realistic. 22:59:27 OK, I'll end the meeting, and then meetbot will generate a transcript. 22:59:49 #endmeeting