22:05:27 <teor> #startmeeting Prio and PrivCount in Tor
22:05:27 <MeetBot> Meeting started Tue Nov 20 22:05:27 2018 UTC.  The chair is teor. Information about MeetBot at http://wiki.debian.org/MeetBot.
22:05:27 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
22:05:56 <teor> Did you want to start with your questions?
22:06:15 <henrycg> Yes, sounds good.
22:06:33 <teor> This is a meeting about Prio, the privacy-preserving algorithm used to collect some Firefox user data
22:06:49 <teor> and PrivCount in Tor, the privacy-preserving design for Tor relay statistics
22:06:50 <henrycg> I have a couple of general questions:
22:07:07 <henrycg> 1) I'm curious what the status is of the PrivCount implementation in Tor.
22:07:30 <henrycg> 2) I'd be interested to hear if any of the ideas from Prio would be useful for Tor.
22:08:05 <henrycg> 3) I'm curious how much overlap there is between libprio (our Prio implementation for Mozilla) and the PrivCount implementation for Tor.
22:08:51 <henrycg> If there is a lot of overlap, then it might be worth thinking if libprio could use some of the code you've already written (or vice versa).
22:09:23 <henrycg> That's it for my questions.
22:09:28 <teor> Ok, so let's start with 1)
22:10:46 <teor> PrivCount in Tor has a specification for the low-level blinding, secret sharing, and aggregation
22:11:04 <teor> #link https://gitweb.torproject.org/torspec.git/tree/proposals/288-privcount-with-shamir.txt
22:11:44 <teor> nickm has implemented those parts of the specification in Rust
22:12:03 <teor> #link https://github.com/nmathewson/privcount_shamir
22:12:29 <teor> We're still revising the differential privacy and configuration parts of the specification
22:12:51 <henrycg> Great.
22:13:07 <henrycg> Can I ask one question about the specification?
22:13:10 <teor> #link https://trac.torproject.org/projects/tor/wiki/org/meetings/2018MexicoCity/Notes/PrivCount
22:13:22 <teor> #link https://trac.torproject.org/projects/tor/wiki/org/meetings/2018MexicoCity/Notes/PrivCountTechnical
22:14:11 <teor> We have a roadmap here, I'll update it with the last two links now:
22:14:16 <teor> #link https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/PrivCountInTor
22:14:55 <teor> We deferred work on PrivCount in Tor until December/January, because another sponsor is ending soon
22:15:15 <henrycg> Ah, I see.
22:15:50 <henrycg> In lines 81-84 of the spec, it sketches a way to defend against "bogus data from corrupting the tally"
22:16:39 <henrycg> Is there some security/privacy analysis of this technique?
22:17:19 <henrycg> It's different from what we do in Prio, and I'm curious about the pros/cons of the different approaches.
22:17:38 <teor> There isn't a detailed analysis.
22:17:54 <teor> Every extra aggregation is a separate query of the data under differential privacy
22:18:33 <teor> So we would need to add extra noise for every subset that we query
22:18:46 <henrycg> Ah, I see. That makes sense.
22:19:36 <henrycg> So if you want to defend against bogus data from M malicious data collectors, how many different subsets do you take? Something like 3M?
22:20:20 <teor> Defending against data corruption requires one subset with no corrupt data
22:20:25 <teor> So it's a probabilistic thing
22:20:34 <teor> 3M is about right
22:20:45 <henrycg> Wouldn't you need a majority of subsets to have no corrupted data?
22:21:15 <teor> It depends how obvious the corrupt data is
22:21:31 <teor> And whether we do a median, or "exclude outliers, then median"
22:22:11 <teor> Alternately, we could use a small subset (with low probability of corrupt data) to check a few large subsets (with higher probability of corrupt data)
22:23:31 <teor> But we could get much more accurate data with some non-interactive proof that the data is in a realistic range
22:24:23 <teor> Which leads nicely into 2) I'd be interested to hear if any of the ideas from Prio would be useful for Tor.
22:24:32 <henrycg> Just so that I understand your use case a little better: How many bad data collectors are you expecting? Something like 5% or more like 5 (total)? Or more/less?
22:24:51 <henrycg> Yes, I'm happy to sketch what Prio does.
22:25:34 <teor> We're not sure. I'd have to ask someone from metrics how many outliers they exclude.
22:26:08 <teor> #action Ask the metrics team: How many outliers are excluded during metrics processing?
22:26:11 <henrycg> Makes sense, thanks.
22:26:15 <henrycg> About Prio:
22:26:33 <teor> We expect there might be more if corrupt data is harder to identify.
22:27:17 <henrycg> The "SNIP" primitive, described in the Prio paper, is a sort of "zero-knowledge proof on secret-shared data."
22:28:06 <henrycg> Say that a data collector splits its private data value into shares and sends one share to each of the tally reporters.
22:28:37 <henrycg> The data collector can then prove to the tally reporters that they are holding shares of a value x such that x is "well formed."
22:29:01 <henrycg> For example, the data collector could prove to the TRs that it sent them shares of an integer x where 0 <= x < 127.
22:29:28 <henrycg> The TRs learn nothing about x, apart from the fact that x is well formed.
22:29:53 <henrycg> Additionally, the proof uses no public-key crypto and requires the servers to exchange only a constant number of bits.
22:30:04 <henrycg> There are some limitations though:
22:31:02 <henrycg> * The DC-to-TR communication scales linearly with the complexity of the circuit that tests whether x is well formed. For many natural applications (i.e., 0 <= x < 2^8) this cost is not too bad.
22:31:52 <henrycg> * As Aaron Johnson mentioned in his email on the tor-dev list, out-of-the-box Prio provides privacy as long as one TR is honest but it provides correctness only if _all_ TRs are honest.
22:32:57 <henrycg> It would be possible to strengthen the correctness requirement with a bit more work if that turns out to be useful for some applications, but libprio does not yet support this.
22:33:16 <henrycg> Does that make sense, more or less?
22:34:30 <teor> Yes. Range proofs would be useful to PrivCount in Tor, and we would probably be looking at ranges like 0 <= x < 2^N, 16 < N < 64.
22:34:49 <teor> But we want to deploy a minimum viable product, then iterate.
22:35:02 <henrycg> Right. That makes sense.
22:35:06 <teor> So we'll look at the pros and cons of all the available non-interactive proof schemes once the initial system is deployed.
22:35:57 <teor> Data volume, security, peer-review, code quality, other users of the code, and available specifications are all criteria we look at before we add a dependency.
22:36:09 <teor> (There are probably more that I've forgotten.)
22:36:54 <henrycg> Right -- libprio is pretty heavily tailored to the Mozilla application, so it almost certainly would not suffice for Tor without quite a bit of work.
22:37:17 <henrycg> Do you have a sense of how many data values the DCs will send to the TRs? Is it more like tens or thousands or more?
22:38:06 <teor> We did an analysis in the spec, I think. Let me look.
22:41:26 <teor> If there are N tally reporters, and M counters, the random blinding seeds are 32*N bytes, and the blinded, noisy values are M*N*8 bytes
22:42:02 <teor> So for a reasonable number of counters (100) and tally reporters (10), it's about 8 kilobytes
22:42:19 <teor> We only do a round once a day, so that's entirely reasonable
22:43:19 <henrycg> Got it, thanks.
22:43:56 <teor> Slightly larger than the current relay statistics, which are about 2 kB per relay
22:44:11 <teor> But we don't currently have 100 stats
22:45:05 <teor> We're now effectively answering 3) I'm curious how much overlap there is between libprio (our Prio implementation for Mozilla) and the PrivCount implementation for Tor.
22:45:36 <henrycg> Yup.
22:46:27 <teor> It seems like there's a lot of conceptual overlap, and parts of the specifications would be similar
22:46:34 <teor> (Does Prio have a formal specification?)
22:46:40 <henrycg> No, it does not yet.
22:47:03 <henrycg> I agree that there is some overlap, with maybe the two differences being:
22:47:22 <henrycg> 1) PrivCount is using Shamir secret sharing to get security against malicious TRs, while libprio uses simple additive secret sharing.
22:47:24 <henrycg> 2)
22:48:02 <henrycg> 2) PrivCount has support for adding differential privacy noise on the DC side, while libprio doesn't aim to provide differential privacy (yet).
22:48:23 <henrycg> and 3) libprio has the proofs of correctness, which PrivCount doesn't yet implement.
22:48:35 <henrycg> Does that look more or less right to you?
22:49:05 <teor> We have a research implementation for noise allocation, but we haven't specified it yet. And we want to re-specify the noise distribution using integer arithmetic.
22:49:36 <henrycg> I see. Is there a reason that the DCs add the noise instead of having the TRs add it?
22:50:23 <teor> (Differential privacy breaks down in strange ways under floating-point inaccuracy, and it's hard to calculate the differential privacy bounds for floating-point gaussian distributions.)
22:51:05 <teor> If the DCs add the noise, it is not possible to reconstruct non-noisy totals, even if all the TRs are malicious and collaborate
22:52:24 <teor> There may have been other reasons, I'd have to ask the original researchers
22:52:27 <henrycg> It seems like the DCs would have to add a very large amount of noise to get meaningful privacy even if all TRs collude though.
22:53:05 <teor> Yes. Resistance to entirely malicious TRs is not part of the formal security guarantee.
22:53:17 <teor> But it's still a useful property of the implementation.
22:53:47 <henrycg> I see.
22:54:38 <teor> I should also mention that it's unlikely that the details in the Prio and PrivCount code are compatible.
22:54:41 <teor> PrivCount in Tor is in Rust with Rust data structures, and a specified interchange format.
22:56:05 <henrycg> Right, just in terms of functionality, it seems like if the PrivCount implementation supported proofs of correctness, it would subsume the functionality of libprio.
22:56:28 <teor> and affine aggregateable encodings
22:56:37 <teor> (Prio is in C and assembly, and uses msgpack for interchange (I think))
22:57:35 <teor> I have another meeting soon, so any final questions?
22:57:43 <teor> (I can swap back and forth, but that's not ideal)
22:57:53 <henrycg> No -- I think that's it. Thanks for your time!
22:58:03 <teor> No worries.
22:58:10 <henrycg> It's very helpful to hear what else is out there.
22:58:28 <teor> To give you a likely timeframe, we should have some kind of deployment in 2019, depending on our other priotiries.
22:58:31 <teor> * priorities
22:58:57 <henrycg> Okay, I'll look out for that. Feel free to ping me if other Prio-related questions come up.
22:59:05 <teor> I'd say "the start of 2019", but let's be realistic.
22:59:27 <teor> OK, I'll end the meeting, and then meetbot will generate a transcript.
22:59:49 <teor> #endmeeting