21:04:51 <teor> #startmeeting prop#285 utf-8 in directory documents, 12 Feb 2018 21:04:51 <MeetBot> Meeting started Mon Feb 12 21:04:51 2018 UTC. The chair is teor. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:04:51 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 21:05:25 <nickm> so, should we start with a summary? 21:05:42 <Samdney> that would be nice :) 21:06:03 <nickm> Briefly: The proposal says that we should say that all directory documents should be UTF-8, for an appropriately strict definition of UTF-8. 21:06:25 <nickm> It also proposes a migration path for us to get there without too much fingerprintability or breakage. 21:06:31 <nickm> Teor: more to say? 21:08:00 <Samdney> that's ok for me. I read the prop 21:08:03 <teor> To avoid unicode versioning, we're going to define UTF-8 as any unicode scalar, including unallocated blocks 21:08:09 <nickm> There are alternatives we could consider. We could require that they all be ASCII, or that they all be ASCII except for specifically designated lines 21:08:40 <nickm> teor: oh! I thought we were specifically avoiding some ranges. Are those something other than unallocated blocks? 21:09:12 <teor> Yes 21:09:38 <teor> Unicode scalars use this definition: https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt#n70 21:09:58 <nickm> What's in :"U+D800 through U+DFFF" ? 21:10:15 <teor> Looking it up, we should use the right names for these blocks 21:10:44 <isis> hi! sorry, got up to make tea and spaced on starting the meeting 21:11:02 <isis> thanks teor 21:11:32 <teor> nickm: UTF-16 surrogate code points, not legal Unicode by themselves 21:12:01 <nickm> ok 21:12:39 <nickm> what other open discussion topics are there on this proposal? 21:12:49 <teor> Excluding them is important for interoperability with other UTF-8 implementations, and to reject UTF-16 21:13:26 <nickm> I wonder if this definition of "utf-8" is close to the one used by rust 21:13:30 <isis> doesn't basically every compiler all use the same definition of unicode at this point? 21:13:49 <isis> there was some document i was once pointed to 21:15:50 <teor> nickm: it is exactly the one used by Rust, I checked 21:16:00 <nickm> here is rust's implementation: https://doc.rust-lang.org/src/core/str/mod.rs.html#1429 21:16:29 <nickm> I don't see anything there that specifically prevents a BOM... 21:17:17 <teor> Ok, Rust's char is a Unicode Scalar Value, which is what we use in the proposal 21:17:29 <isis> (if BOM wasn't allowed, how would one write farsi) 21:17:54 <teor> #action use "Unicode Scalar" to describe the set of code points in section "Which UTF-8 exactly?" 21:18:19 <teor> isis: byte order is different to LTR 21:18:20 <nickm> +1 21:18:42 <isis> oh, huh, didn't know that 21:19:03 <teor> For Rust: "A char actually represents a Unicode Scalar Value, as it can contain any Unicode code point except high-surrogate and low-surrogate code points." 21:20:00 <teor> #action Since we exclude 0x00 (nul), note this difference from Rust 21:20:23 <teor> #action Since we exclude byte order marks, note this difference from Rust 21:20:38 <teor> Ok, there are the two Rust differences 21:20:51 <nickm> well, is excluding BoM right? We could say that they aren't excluded... 21:20:58 <teor> #action Reference Rust definition of char at https://doc.rust-lang.org/1.0.0/unicode/char/ 21:22:09 <nickm> what are our other points of uncertainty here? 21:23:22 <teor> I think we should allow but recommend against a BOM 21:23:23 <teor> "The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file as a transcoding artifact" 21:23:35 <teor> https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark 21:23:47 <nickm> Can we say that Tor implementations should never generate them? 21:23:55 <nickm> must we alter our implementations to accept them? 21:24:15 <teor> We can say that they should never be generated 21:24:38 <teor> They are legal UTF-8, so rejecting them complicates the parser 21:25:34 <nickm> well, accepting them in our routerparse.c code would require a change 21:25:37 <teor> But our current specs would reject them anyway, because it's not legal to start a line with 0xFEFF 21:26:30 <teor> or 0xEF, 0xBB, 0xBF 21:27:09 <nickm> I don't understand. Are you saying that our current dir-spec.txt excludes them? But I thought that they didn't count as part of the document they modified... 21:27:15 <teor> So I was wrong, our existing parser rejects them just fine 21:28:27 <nickm> we'd have to make any rust parser reject them too 21:28:33 <nickm> and specify that implementations must reject them 21:28:35 <nickm> which is fine with me 21:29:17 <teor> using a BOM would complicate Stem and metrics-lib, because they prepend text to descriptors 21:29:38 <nickm> and an internal-BOM is just forbidden? 21:29:48 <teor> No, but it's useless and confusing 21:29:54 <isis> ah yeah, that would break bridgedb too 21:29:58 <teor> "Byte order has no meaning in UTF-8" 21:30:21 <teor> The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature." 21:30:38 <isis> bridgedb prepends header stuffs to networkstatus-bridges, because the bridgeauth doesn't produce headers and stem expects them to be there 21:30:43 <nickm> great 21:30:47 <nickm> then let's forbid it 21:32:10 <nickm> Are there any other things we should clarify or changes we should make? I know earlier you were pointing out that many current documents are de facto _ascii only_, and not just utf-8 only. 21:33:02 <teor> #action reference RFC 3629 section 6 to forbid the BOM https://tools.ietf.org/html/rfc3629#section-6 21:33:27 <isis> i'm a little worried about the bridge documents, since the bridge infrastructure has a lot of python and unfortunately it's all python2 21:33:44 <nickm> say more? 21:34:42 <nickm> I don't see what this will break; though the existing bridge infrastructure could _accept_ some stuff it shouldn't. 21:35:04 <isis> like if bridgedb were to get a utf-8 bridge descriptor, i'm pretty sure it'd poop its pants 21:35:23 <nickm> So, people are already using utf-8 in contactinfo 21:35:35 <nickm> utf-8 is pretty well handled by tools to handle C strings 21:36:00 <isis> oh huh, i wonder if bridgedb just has unicode() objects being handed to it by stem 21:36:07 <isis> in that case nothing would break 21:36:12 <teor> nickm: as long as we ban nul 21:36:25 <nickm> Yes. Also, python handles nul fine. 21:36:36 <nickm> isis: or str objects full of utf-8 21:36:45 <nickm> mostly you can treat utf-8 like ascii and nothing breaks: 21:37:11 <nickm> UTF-8 never uses ascii to encode anything but ascii, and has other properties that generally make it so stuff like strstr and regexes work okay 21:38:01 <teor> our current specs require that everything is ASCII, except for contact and platform, which are "arbitrary" 21:38:18 <nickm> yeah, and technically speaking they can't be "arbitrary" 21:38:20 <teor> and "info" or "string" 21:38:23 <nickm> since NL or NUL would break them 21:38:57 <teor> #action update dir-spec.txt to define contact as excluding NUL and NL 21:39:18 <teor> #action update dir-spec.txt to define platform as ASCII? 21:40:20 <teor> We talked on tor-dev@ about potential confusion attacks, I think I have a solution 21:40:48 <nickm> as in, human-confusion? 21:41:10 <teor> It mitigates human confusion, and also helps with machine encoding issues 21:41:15 <nickm> oh? 21:41:15 <teor> We recommend that any descriptor lines that can contain non-ASCII are placed last in the document 21:41:37 <nickm> hm. 21:41:40 <teor> And we precede them with a line that deliberately contains unicode 21:41:44 <teor> For example: 21:42:07 <teor> utf-8 ✓ 21:42:23 <teor> This is what some websites do to detect browser encoding issues 21:42:27 <nickm> hmmm 21:42:38 <isis> utf-8 ☺ 21:42:39 <nickm> how would this help? 21:42:45 <teor> except they use ?utf8=✓ 21:43:34 <teor> For humans reading the descriptor in an editor, they would know that line and those following it are unicode 21:44:15 <Samdney> ..mmm or they are confused ... 21:44:18 <teor> If that line looks like three garbled bytes, they would know that the contact info might also be garbled by their editor 21:44:32 <nickm> so would we have to explicitly forbid non-ascii in earlier lines? 21:44:52 <teor> nickm: no, I think we should recommend, and implement in Tor, but not forbid 21:45:03 <teor> otherwise, we break backwards-compatibility 21:45:08 <isis> would reordering break parsers that currently rely on ordering? 21:45:30 <teor> oh, probably. Do we support parsers that rely on ordering? 21:45:41 <nickm> okay, so if we don't forbid it, people can't rely on it... 21:45:55 <teor> Also, if they rely on ordering, they probably won't cope well with UTF-8 21:46:22 <teor> nickm: we can forbid it after all tor versions upgrade? 21:47:04 <nickm> I'm skeptical that this helps with confusion attacks.... 21:47:12 <nickm> what if we require all keywords to be ascii? 21:47:45 <nickm> confusion in the keywords is the really dangerous thing for human-reading 21:47:47 <teor> We can do that, I think it would be helpful 21:47:52 <isis> iirc some of our documents currently specify an order to lines like "line A must come after line B, and only if line A is present" 21:48:18 <nickm> yes, but it's not a total order 21:48:33 <teor> Using a standard utf-8 character in descriptors also allows parsers to discover broken encodings 21:48:40 <isis> nickm: aha, i see what you mean 21:49:13 <teor> #action require directory document keywords to be ASCII to avoid confusion attacks 21:49:56 <teor> do we want to try to forbid unicode newlines? because another attack is: 21:50:13 <teor> contact foo<unicode newline>platform bar 21:50:27 <nickm> hm. 21:50:56 <teor> at the very least, we should say that they're not a NL 21:51:01 <nickm> so this is only an 'attack' to the extent that a human thinks it's one thing but a machine knows it's not. .. 21:51:11 <nickm> we should indeed say that only NL counts as NL 21:51:15 <teor> yes, as long as the machine implements the spec correctly 21:51:36 <teor> #action specify that unicode newlines are not valid dir-spec NLs 21:53:03 <teor> we have 6 minutes left, any final thoughts or questions? 21:53:37 <nickm> not from me 21:54:13 <nickm> do you think we're good to move forward to implementation on this? 21:55:51 <isis> should we put the proposal into State: Accepted after it's edited? 21:56:02 <nickm> fine with me, unless someone objects 21:56:16 <teor> yes, I think it's something we can start doing at the dirauth and relay/bridge level 21:56:38 <teor> #action email relay and bridge operators with non-UTF-8 in their contact info 21:56:54 <nickm> ok, I need to run. going AFK for the evening. peace, all! 21:56:57 <teor> I think there is one, right? 21:57:58 <teor> ok, I'll close the meeting on the hour. isis, can you update tor-dev@ ? 21:59:17 <teor> Thanks everyone 21:59:18 <teor> #endmeeting