22:05:27 <teor> #startmeeting Prio and PrivCount in Tor 22:05:27 <MeetBot> Meeting started Tue Nov 20 22:05:27 2018 UTC. The chair is teor. Information about MeetBot at http://wiki.debian.org/MeetBot. 22:05:27 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 22:05:56 <teor> Did you want to start with your questions? 22:06:15 <henrycg> Yes, sounds good. 22:06:33 <teor> This is a meeting about Prio, the privacy-preserving algorithm used to collect some Firefox user data 22:06:49 <teor> and PrivCount in Tor, the privacy-preserving design for Tor relay statistics 22:06:50 <henrycg> I have a couple of general questions: 22:07:07 <henrycg> 1) I'm curious what the status is of the PrivCount implementation in Tor. 22:07:30 <henrycg> 2) I'd be interested to hear if any of the ideas from Prio would be useful for Tor. 22:08:05 <henrycg> 3) I'm curious how much overlap there is between libprio (our Prio implementation for Mozilla) and the PrivCount implementation for Tor. 22:08:51 <henrycg> If there is a lot of overlap, then it might be worth thinking if libprio could use some of the code you've already written (or vice versa). 22:09:23 <henrycg> That's it for my questions. 22:09:28 <teor> Ok, so let's start with 1) 22:10:46 <teor> PrivCount in Tor has a specification for the low-level blinding, secret sharing, and aggregation 22:11:04 <teor> #link https://gitweb.torproject.org/torspec.git/tree/proposals/288-privcount-with-shamir.txt 22:11:44 <teor> nickm has implemented those parts of the specification in Rust 22:12:03 <teor> #link https://github.com/nmathewson/privcount_shamir 22:12:29 <teor> We're still revising the differential privacy and configuration parts of the specification 22:12:51 <henrycg> Great. 22:13:07 <henrycg> Can I ask one question about the specification? 22:13:10 <teor> #link https://trac.torproject.org/projects/tor/wiki/org/meetings/2018MexicoCity/Notes/PrivCount 22:13:22 <teor> #link https://trac.torproject.org/projects/tor/wiki/org/meetings/2018MexicoCity/Notes/PrivCountTechnical 22:14:11 <teor> We have a roadmap here, I'll update it with the last two links now: 22:14:16 <teor> #link https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/PrivCountInTor 22:14:55 <teor> We deferred work on PrivCount in Tor until December/January, because another sponsor is ending soon 22:15:15 <henrycg> Ah, I see. 22:15:50 <henrycg> In lines 81-84 of the spec, it sketches a way to defend against "bogus data from corrupting the tally" 22:16:39 <henrycg> Is there some security/privacy analysis of this technique? 22:17:19 <henrycg> It's different from what we do in Prio, and I'm curious about the pros/cons of the different approaches. 22:17:38 <teor> There isn't a detailed analysis. 22:17:54 <teor> Every extra aggregation is a separate query of the data under differential privacy 22:18:33 <teor> So we would need to add extra noise for every subset that we query 22:18:46 <henrycg> Ah, I see. That makes sense. 22:19:36 <henrycg> So if you want to defend against bogus data from M malicious data collectors, how many different subsets do you take? Something like 3M? 22:20:20 <teor> Defending against data corruption requires one subset with no corrupt data 22:20:25 <teor> So it's a probabilistic thing 22:20:34 <teor> 3M is about right 22:20:45 <henrycg> Wouldn't you need a majority of subsets to have no corrupted data? 22:21:15 <teor> It depends how obvious the corrupt data is 22:21:31 <teor> And whether we do a median, or "exclude outliers, then median" 22:22:11 <teor> Alternately, we could use a small subset (with low probability of corrupt data) to check a few large subsets (with higher probability of corrupt data) 22:23:31 <teor> But we could get much more accurate data with some non-interactive proof that the data is in a realistic range 22:24:23 <teor> Which leads nicely into 2) I'd be interested to hear if any of the ideas from Prio would be useful for Tor. 22:24:32 <henrycg> Just so that I understand your use case a little better: How many bad data collectors are you expecting? Something like 5% or more like 5 (total)? Or more/less? 22:24:51 <henrycg> Yes, I'm happy to sketch what Prio does. 22:25:34 <teor> We're not sure. I'd have to ask someone from metrics how many outliers they exclude. 22:26:08 <teor> #action Ask the metrics team: How many outliers are excluded during metrics processing? 22:26:11 <henrycg> Makes sense, thanks. 22:26:15 <henrycg> About Prio: 22:26:33 <teor> We expect there might be more if corrupt data is harder to identify. 22:27:17 <henrycg> The "SNIP" primitive, described in the Prio paper, is a sort of "zero-knowledge proof on secret-shared data." 22:28:06 <henrycg> Say that a data collector splits its private data value into shares and sends one share to each of the tally reporters. 22:28:37 <henrycg> The data collector can then prove to the tally reporters that they are holding shares of a value x such that x is "well formed." 22:29:01 <henrycg> For example, the data collector could prove to the TRs that it sent them shares of an integer x where 0 <= x < 127. 22:29:28 <henrycg> The TRs learn nothing about x, apart from the fact that x is well formed. 22:29:53 <henrycg> Additionally, the proof uses no public-key crypto and requires the servers to exchange only a constant number of bits. 22:30:04 <henrycg> There are some limitations though: 22:31:02 <henrycg> * The DC-to-TR communication scales linearly with the complexity of the circuit that tests whether x is well formed. For many natural applications (i.e., 0 <= x < 2^8) this cost is not too bad. 22:31:52 <henrycg> * As Aaron Johnson mentioned in his email on the tor-dev list, out-of-the-box Prio provides privacy as long as one TR is honest but it provides correctness only if _all_ TRs are honest. 22:32:57 <henrycg> It would be possible to strengthen the correctness requirement with a bit more work if that turns out to be useful for some applications, but libprio does not yet support this. 22:33:16 <henrycg> Does that make sense, more or less? 22:34:30 <teor> Yes. Range proofs would be useful to PrivCount in Tor, and we would probably be looking at ranges like 0 <= x < 2^N, 16 < N < 64. 22:34:49 <teor> But we want to deploy a minimum viable product, then iterate. 22:35:02 <henrycg> Right. That makes sense. 22:35:06 <teor> So we'll look at the pros and cons of all the available non-interactive proof schemes once the initial system is deployed. 22:35:57 <teor> Data volume, security, peer-review, code quality, other users of the code, and available specifications are all criteria we look at before we add a dependency. 22:36:09 <teor> (There are probably more that I've forgotten.) 22:36:54 <henrycg> Right -- libprio is pretty heavily tailored to the Mozilla application, so it almost certainly would not suffice for Tor without quite a bit of work. 22:37:17 <henrycg> Do you have a sense of how many data values the DCs will send to the TRs? Is it more like tens or thousands or more? 22:38:06 <teor> We did an analysis in the spec, I think. Let me look. 22:41:26 <teor> If there are N tally reporters, and M counters, the random blinding seeds are 32*N bytes, and the blinded, noisy values are M*N*8 bytes 22:42:02 <teor> So for a reasonable number of counters (100) and tally reporters (10), it's about 8 kilobytes 22:42:19 <teor> We only do a round once a day, so that's entirely reasonable 22:43:19 <henrycg> Got it, thanks. 22:43:56 <teor> Slightly larger than the current relay statistics, which are about 2 kB per relay 22:44:11 <teor> But we don't currently have 100 stats 22:45:05 <teor> We're now effectively answering 3) I'm curious how much overlap there is between libprio (our Prio implementation for Mozilla) and the PrivCount implementation for Tor. 22:45:36 <henrycg> Yup. 22:46:27 <teor> It seems like there's a lot of conceptual overlap, and parts of the specifications would be similar 22:46:34 <teor> (Does Prio have a formal specification?) 22:46:40 <henrycg> No, it does not yet. 22:47:03 <henrycg> I agree that there is some overlap, with maybe the two differences being: 22:47:22 <henrycg> 1) PrivCount is using Shamir secret sharing to get security against malicious TRs, while libprio uses simple additive secret sharing. 22:47:24 <henrycg> 2) 22:48:02 <henrycg> 2) PrivCount has support for adding differential privacy noise on the DC side, while libprio doesn't aim to provide differential privacy (yet). 22:48:23 <henrycg> and 3) libprio has the proofs of correctness, which PrivCount doesn't yet implement. 22:48:35 <henrycg> Does that look more or less right to you? 22:49:05 <teor> We have a research implementation for noise allocation, but we haven't specified it yet. And we want to re-specify the noise distribution using integer arithmetic. 22:49:36 <henrycg> I see. Is there a reason that the DCs add the noise instead of having the TRs add it? 22:50:23 <teor> (Differential privacy breaks down in strange ways under floating-point inaccuracy, and it's hard to calculate the differential privacy bounds for floating-point gaussian distributions.) 22:51:05 <teor> If the DCs add the noise, it is not possible to reconstruct non-noisy totals, even if all the TRs are malicious and collaborate 22:52:24 <teor> There may have been other reasons, I'd have to ask the original researchers 22:52:27 <henrycg> It seems like the DCs would have to add a very large amount of noise to get meaningful privacy even if all TRs collude though. 22:53:05 <teor> Yes. Resistance to entirely malicious TRs is not part of the formal security guarantee. 22:53:17 <teor> But it's still a useful property of the implementation. 22:53:47 <henrycg> I see. 22:54:38 <teor> I should also mention that it's unlikely that the details in the Prio and PrivCount code are compatible. 22:54:41 <teor> PrivCount in Tor is in Rust with Rust data structures, and a specified interchange format. 22:56:05 <henrycg> Right, just in terms of functionality, it seems like if the PrivCount implementation supported proofs of correctness, it would subsume the functionality of libprio. 22:56:28 <teor> and affine aggregateable encodings 22:56:37 <teor> (Prio is in C and assembly, and uses msgpack for interchange (I think)) 22:57:35 <teor> I have another meeting soon, so any final questions? 22:57:43 <teor> (I can swap back and forth, but that's not ideal) 22:57:53 <henrycg> No -- I think that's it. Thanks for your time! 22:58:03 <teor> No worries. 22:58:10 <henrycg> It's very helpful to hear what else is out there. 22:58:28 <teor> To give you a likely timeframe, we should have some kind of deployment in 2019, depending on our other priotiries. 22:58:31 <teor> * priorities 22:58:57 <henrycg> Okay, I'll look out for that. Feel free to ping me if other Prio-related questions come up. 22:59:05 <teor> I'd say "the start of 2019", but let's be realistic. 22:59:27 <teor> OK, I'll end the meeting, and then meetbot will generate a transcript. 22:59:49 <teor> #endmeeting