21:04:51 <teor> #startmeeting prop#285 utf-8 in directory documents, 12 Feb 2018
21:04:51 <MeetBot> Meeting started Mon Feb 12 21:04:51 2018 UTC.  The chair is teor. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:04:51 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
21:05:25 <nickm> so, should we start with a summary?
21:05:42 <Samdney> that would be nice :)
21:06:03 <nickm> Briefly: The proposal says that we should say that all directory documents should be UTF-8, for an appropriately strict definition of UTF-8.
21:06:25 <nickm> It also proposes a migration path for us to get there without too much fingerprintability or breakage.
21:06:31 <nickm> Teor: more to say?
21:08:00 <Samdney> that's ok for me. I read the prop
21:08:03 <teor> To avoid unicode versioning, we're going to define UTF-8 as any unicode scalar, including unallocated blocks
21:08:09 <nickm> There are alternatives we could consider.  We could require that they all be ASCII, or that they all be ASCII except for specifically designated lines
21:08:40 <nickm> teor: oh! I thought we were specifically avoiding some ranges.  Are those something other than unallocated blocks?
21:09:12 <teor> Yes
21:09:38 <teor> Unicode scalars use this definition:  https://gitweb.torproject.org/torspec.git/tree/proposals/285-utf-8.txt#n70
21:09:58 <nickm> What's in :"U+D800 through U+DFFF" ?
21:10:15 <teor> Looking it up, we should use the right names for these blocks
21:10:44 <isis> hi! sorry, got up to make tea and spaced on starting the meeting
21:11:02 <isis> thanks teor
21:11:32 <teor> nickm: UTF-16 surrogate code points, not legal Unicode by themselves
21:12:01 <nickm> ok
21:12:39 <nickm> what other open discussion topics are there on this proposal?
21:12:49 <teor> Excluding them is important for interoperability with other UTF-8 implementations, and to reject UTF-16
21:13:26 <nickm> I wonder if this definition of "utf-8" is close to the one used by rust
21:13:30 <isis> doesn't basically every compiler all use the same definition of unicode at this point?
21:13:49 <isis> there was some document i was once pointed to
21:15:50 <teor> nickm: it is exactly the one used by Rust, I checked
21:16:00 <nickm> here is rust's implementation: https://doc.rust-lang.org/src/core/str/mod.rs.html#1429
21:16:29 <nickm> I don't see anything there that specifically prevents a BOM...
21:17:17 <teor> Ok, Rust's char is a Unicode Scalar Value, which is what we use in the proposal
21:17:29 <isis> (if BOM wasn't allowed, how would one write farsi)
21:17:54 <teor> #action use "Unicode Scalar" to describe the set of code points in section "Which UTF-8 exactly?"
21:18:19 <teor> isis: byte order is different to LTR
21:18:20 <nickm> +1
21:18:42 <isis> oh, huh, didn't know that
21:19:03 <teor> For Rust: "A char actually represents a Unicode Scalar Value, as it can contain any Unicode code point except high-surrogate and low-surrogate code points."
21:20:00 <teor> #action Since we exclude 0x00 (nul), note this difference from Rust
21:20:23 <teor> #action Since we exclude byte order marks, note this difference from Rust
21:20:38 <teor> Ok, there are the two Rust differences
21:20:51 <nickm> well, is excluding BoM right?  We could say that they aren't excluded...
21:20:58 <teor> #action Reference Rust definition of char at https://doc.rust-lang.org/1.0.0/unicode/char/
21:22:09 <nickm> what are our other points of uncertainty here?
21:23:22 <teor> I think we should allow but recommend against a BOM
21:23:23 <teor> "The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file as a transcoding artifact"
21:23:35 <teor> https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
21:23:47 <nickm> Can we say that Tor implementations should never generate them?
21:23:55 <nickm> must we alter our implementations to accept them?
21:24:15 <teor> We can say that they should never be generated
21:24:38 <teor> They are legal UTF-8, so rejecting them complicates the parser
21:25:34 <nickm> well, accepting them in our routerparse.c code would require a change
21:25:37 <teor> But our current specs would reject them anyway, because it's not legal to start a line with 0xFEFF
21:26:30 <teor> or 0xEF, 0xBB, 0xBF
21:27:09 <nickm> I don't understand. Are you saying that our current dir-spec.txt excludes them?  But I thought that they didn't count as part of the document they modified...
21:27:15 <teor> So I was wrong, our existing parser rejects them just fine
21:28:27 <nickm> we'd have to make any rust parser reject them too
21:28:33 <nickm> and specify that implementations must reject them
21:28:35 <nickm> which is fine with me
21:29:17 <teor> using a BOM would complicate Stem and metrics-lib, because they prepend text to descriptors
21:29:38 <nickm> and an internal-BOM is just forbidden?
21:29:48 <teor> No, but it's useless and confusing
21:29:54 <isis> ah yeah, that would break bridgedb too
21:29:58 <teor> "Byte order has no meaning in UTF-8"
21:30:21 <teor> The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."
21:30:38 <isis> bridgedb prepends header stuffs to networkstatus-bridges, because the bridgeauth doesn't produce headers and stem expects them to be there
21:30:43 <nickm> great
21:30:47 <nickm> then let's forbid it
21:32:10 <nickm> Are there any other things we should clarify or changes we should make?  I know earlier you were pointing out that many current documents are de facto _ascii only_, and not just utf-8 only.
21:33:02 <teor> #action reference RFC 3629 section 6 to forbid the BOM https://tools.ietf.org/html/rfc3629#section-6
21:33:27 <isis> i'm a little worried about the bridge documents, since the bridge infrastructure has a lot of python and unfortunately it's all python2
21:33:44 <nickm> say more?
21:34:42 <nickm> I don't see what this will break; though the existing bridge infrastructure could _accept_ some stuff it shouldn't.
21:35:04 <isis> like if bridgedb were to get a utf-8 bridge descriptor, i'm pretty sure it'd poop its pants
21:35:23 <nickm> So, people are already using utf-8 in contactinfo
21:35:35 <nickm> utf-8 is pretty well handled by tools to handle C strings
21:36:00 <isis> oh huh, i wonder if bridgedb just has unicode() objects being handed to it by stem
21:36:07 <isis> in that case nothing would break
21:36:12 <teor> nickm: as long as we ban nul
21:36:25 <nickm> Yes.  Also, python handles nul fine.
21:36:36 <nickm> isis: or str objects full of utf-8
21:36:45 <nickm> mostly you can treat utf-8 like ascii and nothing breaks:
21:37:11 <nickm> UTF-8 never uses ascii to encode anything but ascii, and has other properties that generally make it so stuff like strstr and regexes work okay
21:38:01 <teor> our current specs require that everything is ASCII, except for contact and platform, which are "arbitrary"
21:38:18 <nickm> yeah, and technically speaking they can't be "arbitrary"
21:38:20 <teor> and "info" or "string"
21:38:23 <nickm> since NL or NUL would break them
21:38:57 <teor> #action update dir-spec.txt to define contact as excluding NUL and NL
21:39:18 <teor> #action update dir-spec.txt to define platform as ASCII?
21:40:20 <teor> We talked on tor-dev@ about potential confusion attacks, I think I have a solution
21:40:48 <nickm> as in, human-confusion?
21:41:10 <teor> It mitigates human confusion, and also helps with machine encoding issues
21:41:15 <nickm> oh?
21:41:15 <teor> We recommend that any descriptor lines that can contain non-ASCII are placed last in the document
21:41:37 <nickm> hm.
21:41:40 <teor> And we precede them with a line that deliberately contains unicode
21:41:44 <teor> For example:
21:42:07 <teor> utf-8 ✓
21:42:23 <teor> This is what some websites do to detect browser encoding issues
21:42:27 <nickm> hmmm
21:42:38 <isis> utf-8 ☺
21:42:39 <nickm> how would this help?
21:42:45 <teor> except they use ?utf8=✓
21:43:34 <teor> For humans reading the descriptor in an editor, they would know that line and those following it are unicode
21:44:15 <Samdney> ..mmm or they are confused ...
21:44:18 <teor> If that line looks like three garbled bytes, they would know that the contact info might also be garbled by their editor
21:44:32 <nickm> so would we have to explicitly forbid non-ascii in earlier lines?
21:44:52 <teor> nickm: no, I think we should recommend, and implement in Tor, but not forbid
21:45:03 <teor> otherwise, we break backwards-compatibility
21:45:08 <isis> would reordering break parsers that currently rely on ordering?
21:45:30 <teor> oh, probably. Do we support parsers that rely on ordering?
21:45:41 <nickm> okay, so if we don't forbid it, people can't rely on it...
21:45:55 <teor> Also, if they rely on ordering, they probably won't cope well with UTF-8
21:46:22 <teor> nickm: we can forbid it after all tor versions upgrade?
21:47:04 <nickm> I'm skeptical that this helps with confusion attacks....
21:47:12 <nickm> what if we require all keywords to be ascii?
21:47:45 <nickm> confusion in the keywords is the really dangerous thing for human-reading
21:47:47 <teor> We can do that, I think it would be helpful
21:47:52 <isis> iirc some of our documents currently specify an order to lines like "line A must come after line B, and only if line A is present"
21:48:18 <nickm> yes, but it's not a total order
21:48:33 <teor> Using a standard utf-8 character in descriptors also allows parsers to discover broken encodings
21:48:40 <isis> nickm: aha, i see what you mean
21:49:13 <teor> #action require directory document keywords to be ASCII to avoid confusion attacks
21:49:56 <teor> do we want to try to forbid unicode newlines? because another attack is:
21:50:13 <teor> contact foo<unicode newline>platform bar
21:50:27 <nickm> hm.
21:50:56 <teor> at the very least, we should say that they're not a NL
21:51:01 <nickm> so this is only an 'attack' to the extent that a human thinks it's one thing but a machine knows it's not. ..
21:51:11 <nickm> we should indeed say that only NL counts as NL
21:51:15 <teor> yes, as long as the machine implements the spec correctly
21:51:36 <teor> #action specify that unicode newlines are not valid dir-spec NLs
21:53:03 <teor> we have 6 minutes left, any final thoughts or questions?
21:53:37 <nickm> not from me
21:54:13 <nickm> do you think we're good to move forward to implementation on this?
21:55:51 <isis> should we put the proposal into State: Accepted after it's edited?
21:56:02 <nickm> fine with me, unless someone objects
21:56:16 <teor> yes, I think it's something we can start doing at the dirauth and relay/bridge level
21:56:38 <teor> #action email relay and bridge operators with non-UTF-8 in their contact info
21:56:54 <nickm> ok, I need to run. going AFK for the evening.  peace, all!
21:56:57 <teor> I think there is one, right?
21:57:58 <teor> ok, I'll close the meeting on the hour. isis, can you update tor-dev@ ?
21:59:17 <teor> Thanks everyone
21:59:18 <teor> #endmeeting