15:29:22 #startmeeting proposals for authority hardening 15:29:22 Meeting started Fri Feb 19 15:29:22 2016 UTC. The chair is nickm. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:29:22 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:30:24 So I think we're talking about proposals prop#257 and prop#258 by me and andrea respectively. 15:30:33 Proposal 257: Refactoring authorities and taking parts offline 15:30:34 Proposal 258: Denial-of-service resistance for directory authorities 15:30:50 I think isis and dgoulet and athena can make it? 15:30:55 armadev: ? 15:31:34 again, I apologize for getting the time wrong :p 15:31:35 hi hi meeting 15:31:39 hi hi 15:32:03 so, let's do prop#258 first because it's shorter? 15:32:29 anybody (other than athena) want to summarize prop#258, or should I ? 15:32:42 (I generally want non-authors to summarize, so that the author can correct if they're wrong) 15:33:29 ok, I'll try. Let's hope more people filter in too. 15:33:54 The basic idea is that if a dirauth gets too many connections from an IP, or too many begindir requests on a circuit, it should start rejecting. 15:34:09 sections 3 and onwards explain how. 15:34:34 * isis is here 15:34:40 https://gitweb.torproject.org/torspec.git/tree/proposals/257-hiding-authorities.txt 15:34:45 https://gitweb.torproject.org/torspec.git/tree/proposals/258-dirauth-dos.txt 15:34:51 isis: we're doing 258 now 15:34:56 got it 15:35:12 that is the one i haven't read yet… 15:35:29 the approach is to classify requests in several categories, and have a count per-IP or per-circuit of how many we've seen, using the EWMA algorithm so that we don't have to store a lot of data. 15:35:44 we can probably be a bit aggressive about the timeout here, even 15:35:53 since remembering old requests isn't important. 15:36:08 This is going to need some thinking for IPv6, where everybody can trivially generate a billion addresses. 15:36:12 athena: did I get it about right? 15:37:28 * dgoulet here 15:37:34 yeah it's about rigth! 15:37:37 hi dgoulet 15:38:01 it sounds like it's not per-IP but per bucket in some hash table 15:38:26 "get a single bucket per source IP each" 15:39:21 #4581 is the ticket 15:39:40 athena: did you agree that the description is more or less right? 15:39:41 nickm: yes, that's about correct 15:39:42 all: what did I miss? 15:40:07 at what point do we increment the counter? 15:40:32 dgoulet: ah, i see. but, if we want to limit memory exhaustion attacks, don't we want multiple source IPs per bucket? particularly for IPv6? 15:40:35 and yes, for IPv6 it would benefit from some tinkering - possibly, buckets for ranges, and an IP bumps every bucket for the successive /ns containing it, and higher thresholds for larger buckets 15:41:13 nickm: at the point of an incoming conection to the dirport for direct, or at the point of receiving a begindir cell 15:41:43 we probably want that for IPv4 also unless we can live with _large_ ok-ish HT 15:41:58 well, the HT doesn't need to get too big, right? 15:42:20 like, if we make it so that we expire after ~30 seconds, we're probably in the clear and the HT doesn't store too much. 15:42:43 yes, HT entries that drop below a threshold by EWMA get purged 15:43:30 and the threshold doesn't have to be super-low, since we can accept that anybody doing less than e.g. one request per 30-60 seconds is probably not doing a DoS of this kind 15:43:31 okay, it shouldn't grow too large in that case then 15:43:55 how will the timeout be specified? 15:45:04 probably by one of those config parameters. 15:45:10 oh, i see, §4 15:45:13 yeah 15:45:16 sorry! still reading… 15:45:22 hmm, it isn't an explicit timeout in the version i wrote, it's a decay constant for the EWMA and then it drops the counters when they get below (IIRC) 0.01 * the relevant DoS blocking threshold 15:46:51 isis: how are you typing all of those §s? 15:47:31 compose_key + s + o 15:48:55 compose_key is menu for me, specified in /etc/default/keyboard as `XKBOPTIONS="compose:menu,grp:sclk_toggle,grp_led:num,lv3:switch,keypad:future,c\ 15:48:58 aps:swapescape" 15:50:06 so I'm re-reading the ticket since I looked at the code for that proposal a while back and one comment could be useful for this discussion: 15:50:50 about the choice of the parameters, the code uses some values but I don't see anything in the proposal explainign why "32" or a table of possible values 15:51:08 maybe it's something we have no idea so we have to try it and tweak it as we go? 15:53:03 dgoulet: they are rather arbitrary - the algorithm is really "high enough they don't trip on the dirauths under normal conditions, low enough they cut off attacker traffic" 15:53:50 i suspect the real answer to this is "add a mode that blocks nothing but records stats on 95th percentile alues or whatever, ad see what the dirauths get in practice" 15:55:33 hrm ok 15:56:24 do we all agree that prop#258 should be accepted? 15:57:44 yes I think it's a useful addition! 15:57:54 concur, but i wrote it 15:58:20 we shouldbe very careful at first and monitor our dirauth quite a bit so we don't end up actually denying valid request but that's the tuning part 15:58:30 we can start with one single dirauth anyway 15:58:39 +1 to accept 16:00:13 #action change prop#258 status to accepted 16:00:24 more on 258, or on to prop#257? 16:00:39 can we make prop258 go in 028 ? 16:01:49 dgoulet: depends how much you care about getting good unit test coverage and how sophisticated you want to get on that load-measuring mode, i think 16:03:31 hrm we should then probably take a #yolo dirauth (winks at armadev :) and make it use that patch so we can tune it before making it stable and all 9 dirauth updates? 16:05:30 well, i don't think you should deploy a patch like this on all nine dirauths at once because most of its real-world load-testing will happen on the dirauths 16:05:53 this is sounding like "let's try to get this into the first week of 0.2.9 merges" then? 16:05:59 yeah ^ 16:06:00 and if it turns out to have some bug we didn't know about in advance, finding out on all nine dirauths at once will suck 16:06:07 athena: very true 16:08:01 so on to prop#257 ? :) 16:08:11 yup 16:09:28 somebody other than me summarize? 16:10:10 I can try, I've read it a while back but I skimmed it again before the meeting so I might get it right 16:11:14 go for it, sez I 16:11:57 basically we want to split dirauth role 1) from little-t tor and 2) into modules that are basically different processes. "Upload" component where relays send they descriptor, this goes to the voting process to create a consensus then a module for publishing which this one pushes it to the dircache (which I assume is the Distribution component in the proposal) 16:12:36 I'm unsure where the bwauth are plugged in that infrastructure but I assume "Upload" 16:13:30 then there are some considerations on how communication happens, which would be TLS for relays uploading their descriptor and then in between modules it could be something else like rsync-ssh 16:13:45 that's all I have in my head ^ -- 16:14:50 sounds roughly right. There's more detail in the proposal. 16:15:16 basically, the idea is to separate the "parts that need a public IP", "parts that need to hold a key", and "parts that need lots of bandwidth" into logically distinct parts. 16:15:37 because authorities do like 5-10 separate things right now, and there's no reasaon that they need to be done on the same host. 16:16:24 is it the Upload module that test the relays? 16:16:33 it doesn't have to be. 16:16:38 hrm 16:16:52 as long as that information is feed somehow inn the Voting module? 16:16:57 but I bet it might be efficient if it does an initial "does it work at all" test before passing it on. 16:17:00 right 16:17:36 so here is a question 16:18:03 ah nvm 16:19:43 so, IMO this proposal is too big and vague to actually sit down and build as at stands. It needs to get its pieces prioritized and individually specified. 16:19:44 I think the proposal could gain from adding modules that interacts with "Voting", there are at least two now that is bwauth and "testing relays" 16:20:31 nickm: indeed 16:20:36 very much 16:21:03 So maybe we should talk about the basic idea, and then try to figure out how the specifics would go 16:21:58 yes 16:22:07 nickm: anything you want to start with? 16:22:47 not really. Maybe, what order would we want to implement this stuff in? Is it worth it? How hard would it be? 16:23:43 I think it totally does worth it, what I like is that the "Voting" module which I think here is basically the dirauth creating a consensus from data fed from different places can be hidden in some capacity 16:24:28 it would be good to be clear about which parts need to be trusted, and how much they need to be trusted. 16:24:39 yah that was my next question 16:24:40 Like, the upload module needs to be trusted not to drop descriptors in a hostile way. 16:25:29 it does yes 16:27:37 well I guess we need to breakdown each module in "what task they do" and security considerations and communication mode (IPC/Network) 16:27:52 then we can start breaking down dirauth roles into those modules and implement 16:27:55 big task 16:28:22 as for what to start with, I would say Upload since it's the inbound "untrusted" data from the network 16:28:34 the rest we kind of control the flow 16:29:44 i presume we would not have the Voting module also serve the documents produced? 16:29:51 right 16:30:10 so we would kind of need three modules to start with: Upload, Voting, Serving 16:30:21 yeah. If we pull voting out of upload and serving... 16:30:28 where the BWAuth talks to Serving? 16:30:39 Voting 16:30:50 voting needs the info from bwauth. 16:31:01 I don't think I have a solid idea how voting gets that info 16:31:08 but then that would require that Voting has network access? which i thought we were trying to avoid? 16:31:26 well, voting has to get info somehow. 16:31:33 isis: in some way it will need network access if you want to isolate them 16:31:49 isis: both inbound data from Uplaod/bwauth/testing and then send them to Publishing 16:32:04 and outbound* 16:32:05 right, but i assumed that part was IPC on the same machine 16:32:23 whereas the BWAuths are (all?) on separate machines 16:32:29 I'm guessing at first we'll do that with IPC ^ and then we have the maybe luxury to move them to different machiens 16:32:36 I think that IPC on the same machine might be a default implementation, but I'd like to have them on separate machines and IPs. 16:32:47 ah, i see 16:32:52 One easy way to prepare would be to separate the following: 16:32:59 IP:fingerprint for servers where you must upload descriptors. 16:33:08 so this proposal is less about taking parts "offline" and more about process isolation 16:33:17 IP:fingerprint for servers where you may download canonical consensuses that were just signed. 16:33:29 fingerprints for authorities that sign consensuses. 16:33:47 (Only authorities need to care about this) IP:fingerprint for where you should upload/download votes 16:34:07 isis: yeah that would be more accurate, dirauth roles into separate processes 16:34:20 This would only be a matter of splitting our authority list into a few separate lists. 16:35:15 yeah 16:35:38 well, the motivation for doing separate processes is to minimize the amount of code that holds trust here... 16:35:46 AND to make it easier to isolate parts by taking them offline-ish. 16:36:05 okay, so "offline" still sort of makes sense, in that we can hide the Voting module of each DirAuth from the public network 16:36:14 like, I think it would be reasonable to have a voter in a separate VM that only interacts through the outside world through another VM that does its communications for it. 16:36:34 the voter needs some connection to the outside world, but that doesn't mean it needs to be opening connections directly. 16:36:35 maybe we should change the proposal to clarify that? when i read it, i thought "offline" meant, well, actually offline. 16:36:57 how would offline work? 16:37:08 so… basically we're going to build QubesOS into the DirAuths. awesome :) 16:37:09 #action clarify that we mean offline-ish 16:37:14 consensus created offline is superb magic :D 16:37:28 or rather, make dirauths able to exploite QubesOS 16:37:33 *exploit 16:39:39 i don't know how it would work… i guess in my head i was thinking: ORs → (network) → Upload → (ipc) → Voting → (ipc) → Serving → (network) → OPs 16:40:11 so, another way to do that is 16:40:16 and that the Voting module wouldn't have a public IP at all, or be in a VM or something 16:40:39 isis: yeah doable but that forces Upload and Voting on the same m achine 16:40:40 ORs -> network -> Upload -> rsync+ssh -> filesystem -> Voting -> filesystem -> rsync+ssh -> serving -> OPs 16:40:49 Is that more offline or less? 16:41:43 clearly Upload and Voting have massive trust so that sounds ok to me for the case we want them on seperate data centers 16:41:44 both are good designs with slightly different threat models 16:42:15 Upload needs high availability and a fair amount of BW. 16:42:26 (i'm not trying to argue for the version of this i had in my head, just point out that i didn't understand very clearly what this would look like in implementation) 16:43:01 Voting needs far less BW, Publishing needs far more. 16:43:28 Voting doesn't need to accept or open connections from unknown parties ever 16:43:41 separate machines would make more sense, security-wise, i think. plus it would give DirAuth operators the ability to rebuild/move a machine when there's a threat or a breakin. 16:44:00 isis: yes! 16:45:22 i have a slight preference for avoiding more usage of rsync+ssh in the network… mostly just because it feels sort of incomplete to have core parts of the network rely on cronjobs being correctly configured, rather than having some solution which is built into tor itself. 16:45:48 maybe "incomplete" is not the right word… 16:45:54 feel free to s/rsync+ssh+cron/tor-rsync-ssh-cron-driver/ ;) 16:46:12 stitched together? :) 16:46:14 more like "let's stop duct-taping things together" 16:47:12 I agree every thing that makes the network run should be packaged and tested and have a well-defined way that it's integrated. 16:47:21 I don't mind that way requiring ssh+rsync 16:47:37 I do agree that "and then you set up ssh+rsync and you had better do it right" should not be part of our instructions 16:47:50 also because it seems like a lot of that configuration is hidden/undocumented, and if we all got hit by a bus tomorrow, i would very much pity the poor bastards who have to figure out how to re-setup the DirAuths 16:47:59 yup 16:48:28 * nickm .oO ( one case of salmonella-tainted club mate ... ) 16:48:45 * isis spits out her mate 16:48:52 wat 16:49:11 Trying to make it smaller than a bus 16:49:27 it's hypothetical 16:49:30 ah, phew 16:49:47 so what do we do with this proposal? accept it as a "meta proposal" that is we needs moar proposal breaking it down? :) 16:49:48 * isis sotps pouring the mate down the kitchen sink 16:50:36 Could be. 16:50:40 i think breaking it down would make it more reasonable to implement sections of it? 16:50:53 but i am probably not the person to implement this so… 16:50:58 Even for implemeting sections, we'll need to specify them all more closely 16:51:09 well, for the bridgeauth i would gladly help 16:51:22 oh we have a "Meta" status wow 16:51:39 i should write a proposal for redeigning the bridgeauth to actually do useful stuff 16:51:42 doesn't apply much to this one though 16:52:14 having the "test the bridge reachability" part as a separate machine would definitely be extremely useful 16:52:19 Meta seems useful. 16:52:23 #action call this proposal meta 16:52:48 cool 16:53:23 now let's pitch that idea to a sponsor :D 16:53:48 #action open a ticket for that dirauth addr/key splitting in 0.2.9 16:54:37 and I'm sure C (as much as I love it) is probably the best choice for those componenent unless we need crazy performance 16:54:57 are you missing a "not" in that sentence? :) 16:55:09 YES! haha 16:55:13 is probably ONT 16:55:14 NOT* 16:55:15 ... 16:55:32 oh! maybe a chance to write some more rust? 16:55:36 :D 16:55:44 or python 16:55:47 or go 16:55:53 or haskell 16:56:07 the voter part seems like it would be very well suited to a functional language 16:58:24 oh, or julia could be interesting because then we could use the builtin distributed computation abilities to assign remote jobs (e.g. "check reachabillity for X", "deduplicate these descriptors") 16:58:38 this is starting to seem more fun now 16:58:59 somebody needs to rewrite the beatles' Julia to be about the programming language 16:59:15 that somebody should probably know the Julia language better than I do. 16:59:33 #action cover the beatles' "julia" to be about the programming language 17:00:04 ok I'll go afk for food, thanks all :) 17:00:13 ok. ready to endmeeting? 17:00:15 going once 17:00:20 dgoulet: see you! 17:00:23 going twice 17:00:30 isis: o/ 17:00:31 bbl 17:00:32 so mote it be 17:00:34 #endmeeting