21:04:51 #startmeeting prop#285 utf-8 in directory documents, 12 Feb 2018 21:04:51 Meeting started Mon Feb 12 21:04:51 2018 UTC. The chair is teor. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:04:51 Useful Commands: #action #agreed #help #info #idea #link #topic. 21:05:25 so, should we start with a summary? 21:05:42 that would be nice :) 21:06:03 Briefly: The proposal says that we should say that all directory documents should be UTF-8, for an appropriately strict definition of UTF-8. 21:06:25 It also proposes a migration path for us to get there without too much fingerprintability or breakage. 21:06:31 Teor: more to say? 21:08:00 that's ok for me. I read the prop 21:08:03 To avoid unicode versioning, we're going to define UTF-8 as any unicode scalar, including unallocated blocks 21:08:09 There are alternatives we could consider. We could require that they all be ASCII, or that they all be ASCII except for specifically designated lines 21:08:40 teor: oh! I thought we were specifically avoiding some ranges. Are those something other than unallocated blocks? 21:09:12 Yes 21:09:38 Unicode scalars use this definition: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt#n70 21:09:58 What's in :"U+D800 through U+DFFF" ? 21:10:15 Looking it up, we should use the right names for these blocks 21:10:44 hi! sorry, got up to make tea and spaced on starting the meeting 21:11:02 thanks teor 21:11:32 nickm: UTF-16 surrogate code points, not legal Unicode by themselves 21:12:01 ok 21:12:39 what other open discussion topics are there on this proposal? 21:12:49 Excluding them is important for interoperability with other UTF-8 implementations, and to reject UTF-16 21:13:26 I wonder if this definition of "utf-8" is close to the one used by rust 21:13:30 doesn't basically every compiler all use the same definition of unicode at this point? 21:13:49 there was some document i was once pointed to 21:15:50 nickm: it is exactly the one used by Rust, I checked 21:16:00 here is rust's implementation: https://doc.rust-lang.org/src/core/str/mod.rs.html#1429 21:16:29 I don't see anything there that specifically prevents a BOM... 21:17:17 Ok, Rust's char is a Unicode Scalar Value, which is what we use in the proposal 21:17:29 (if BOM wasn't allowed, how would one write farsi) 21:17:54 #action use "Unicode Scalar" to describe the set of code points in section "Which UTF-8 exactly?" 21:18:19 isis: byte order is different to LTR 21:18:20 +1 21:18:42 oh, huh, didn't know that 21:19:03 For Rust: "A char actually represents a Unicode Scalar Value, as it can contain any Unicode code point except high-surrogate and low-surrogate code points." 21:20:00 #action Since we exclude 0x00 (nul), note this difference from Rust 21:20:23 #action Since we exclude byte order marks, note this difference from Rust 21:20:38 Ok, there are the two Rust differences 21:20:51 well, is excluding BoM right? We could say that they aren't excluded... 21:20:58 #action Reference Rust definition of char at https://doc.rust-lang.org/1.0.0/unicode/char/ 21:22:09 what are our other points of uncertainty here? 21:23:22 I think we should allow but recommend against a BOM 21:23:23 "The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file as a transcoding artifact" 21:23:35 https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark 21:23:47 Can we say that Tor implementations should never generate them? 21:23:55 must we alter our implementations to accept them? 21:24:15 We can say that they should never be generated 21:24:38 They are legal UTF-8, so rejecting them complicates the parser 21:25:34 well, accepting them in our routerparse.c code would require a change 21:25:37 But our current specs would reject them anyway, because it's not legal to start a line with 0xFEFF 21:26:30 or 0xEF, 0xBB, 0xBF 21:27:09 I don't understand. Are you saying that our current dir-spec.txt excludes them? But I thought that they didn't count as part of the document they modified... 21:27:15 So I was wrong, our existing parser rejects them just fine 21:28:27 we'd have to make any rust parser reject them too 21:28:33 and specify that implementations must reject them 21:28:35 which is fine with me 21:29:17 using a BOM would complicate Stem and metrics-lib, because they prepend text to descriptors 21:29:38 and an internal-BOM is just forbidden? 21:29:48 No, but it's useless and confusing 21:29:54 ah yeah, that would break bridgedb too 21:29:58 "Byte order has no meaning in UTF-8" 21:30:21 The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature." 21:30:38 bridgedb prepends header stuffs to networkstatus-bridges, because the bridgeauth doesn't produce headers and stem expects them to be there 21:30:43 great 21:30:47 then let's forbid it 21:32:10 Are there any other things we should clarify or changes we should make? I know earlier you were pointing out that many current documents are de facto _ascii only_, and not just utf-8 only. 21:33:02 #action reference RFC 3629 section 6 to forbid the BOM https://tools.ietf.org/html/rfc3629#section-6 21:33:27 i'm a little worried about the bridge documents, since the bridge infrastructure has a lot of python and unfortunately it's all python2 21:33:44 say more? 21:34:42 I don't see what this will break; though the existing bridge infrastructure could _accept_ some stuff it shouldn't. 21:35:04 like if bridgedb were to get a utf-8 bridge descriptor, i'm pretty sure it'd poop its pants 21:35:23 So, people are already using utf-8 in contactinfo 21:35:35 utf-8 is pretty well handled by tools to handle C strings 21:36:00 oh huh, i wonder if bridgedb just has unicode() objects being handed to it by stem 21:36:07 in that case nothing would break 21:36:12 nickm: as long as we ban nul 21:36:25 Yes. Also, python handles nul fine. 21:36:36 isis: or str objects full of utf-8 21:36:45 mostly you can treat utf-8 like ascii and nothing breaks: 21:37:11 UTF-8 never uses ascii to encode anything but ascii, and has other properties that generally make it so stuff like strstr and regexes work okay 21:38:01 our current specs require that everything is ASCII, except for contact and platform, which are "arbitrary" 21:38:18 yeah, and technically speaking they can't be "arbitrary" 21:38:20 and "info" or "string" 21:38:23 since NL or NUL would break them 21:38:57 #action update dir-spec.txt to define contact as excluding NUL and NL 21:39:18 #action update dir-spec.txt to define platform as ASCII? 21:40:20 We talked on tor-dev@ about potential confusion attacks, I think I have a solution 21:40:48 as in, human-confusion? 21:41:10 It mitigates human confusion, and also helps with machine encoding issues 21:41:15 oh? 21:41:15 We recommend that any descriptor lines that can contain non-ASCII are placed last in the document 21:41:37 hm. 21:41:40 And we precede them with a line that deliberately contains unicode 21:41:44 For example: 21:42:07 utf-8 ✓ 21:42:23 This is what some websites do to detect browser encoding issues 21:42:27 hmmm 21:42:38 utf-8 ☺ 21:42:39 how would this help? 21:42:45 except they use ?utf8=✓ 21:43:34 For humans reading the descriptor in an editor, they would know that line and those following it are unicode 21:44:15 ..mmm or they are confused ... 21:44:18 If that line looks like three garbled bytes, they would know that the contact info might also be garbled by their editor 21:44:32 so would we have to explicitly forbid non-ascii in earlier lines? 21:44:52 nickm: no, I think we should recommend, and implement in Tor, but not forbid 21:45:03 otherwise, we break backwards-compatibility 21:45:08 would reordering break parsers that currently rely on ordering? 21:45:30 oh, probably. Do we support parsers that rely on ordering? 21:45:41 okay, so if we don't forbid it, people can't rely on it... 21:45:55 Also, if they rely on ordering, they probably won't cope well with UTF-8 21:46:22 nickm: we can forbid it after all tor versions upgrade? 21:47:04 I'm skeptical that this helps with confusion attacks.... 21:47:12 what if we require all keywords to be ascii? 21:47:45 confusion in the keywords is the really dangerous thing for human-reading 21:47:47 We can do that, I think it would be helpful 21:47:52 iirc some of our documents currently specify an order to lines like "line A must come after line B, and only if line A is present" 21:48:18 yes, but it's not a total order 21:48:33 Using a standard utf-8 character in descriptors also allows parsers to discover broken encodings 21:48:40 nickm: aha, i see what you mean 21:49:13 #action require directory document keywords to be ASCII to avoid confusion attacks 21:49:56 do we want to try to forbid unicode newlines? because another attack is: 21:50:13 contact fooplatform bar 21:50:27 hm. 21:50:56 at the very least, we should say that they're not a NL 21:51:01 so this is only an 'attack' to the extent that a human thinks it's one thing but a machine knows it's not. .. 21:51:11 we should indeed say that only NL counts as NL 21:51:15 yes, as long as the machine implements the spec correctly 21:51:36 #action specify that unicode newlines are not valid dir-spec NLs 21:53:03 we have 6 minutes left, any final thoughts or questions? 21:53:37 not from me 21:54:13 do you think we're good to move forward to implementation on this? 21:55:51 should we put the proposal into State: Accepted after it's edited? 21:56:02 fine with me, unless someone objects 21:56:16 yes, I think it's something we can start doing at the dirauth and relay/bridge level 21:56:38 #action email relay and bridge operators with non-UTF-8 in their contact info 21:56:54 ok, I need to run. going AFK for the evening. peace, all! 21:56:57 I think there is one, right? 21:57:58 ok, I'll close the meeting on the hour. isis, can you update tor-dev@ ? 21:59:17 Thanks everyone 21:59:18 #endmeeting