- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Mon, 24 Oct 2011 18:31:42 +0900
- To: "public-iri@w3.org" <public-iri@w3.org>
- CC: Mark Davis <mark@macchiato.com>
The message below was posted to the respective forum on the Unicode Web site (http://www.unicode.org/forum/viewforum.php?f=34). I'm sending it here because it's highly relevant for our work. It is also relevant to IDNA and EAI. These are my comments on "Extension of UBA for improved display of URL/IRIs", available from http://www.unicode.org/review/pri185/, as modified on Sept. 22 (presumably 2011). I have commented mostly on procedural issues in the last round, but have taken a deeper look at technical and editorial issues this time, too. I wrote part of this on a very long flight, so some references are missing; if you need some additional pointers, don't hesitate to ask. Procedural Issues ================= Opening the ability to comment via the Unicode Forum is some progress on the previous way of commenting via a Web form (which was essentially a black hole for outsiders). However, it is still very much a one-way street. This makes it difficult to involve affected communities such as the IETF Working Groups (WGs, or former WG) on Internationalized Domain Names, Email Address Internationalization, and Internationalized Resource Identifiers, the relevant groups at W3C, and in many ways most important, the actually affected users that use bidi IRIs. It also makes it difficult to find out how, and more importantly, why, comments have been addressed or not. It is still a far way away from how other organizations deal with public comments. In the W3C, providing a public list of all public comments and how they are addressed is standard practice. In the IETF, most of the discussions are held on public mailing list, and many WGs are using a public tracker (http://trac.tools.ietf.org/). Preexisting Specs and Parallel Work =================================== The IRI specification (RFC 3987, http://tools.ietf.org/html/rfc3987) as well as draft updates (http://tools.ietf.org/html/draft-ietf-iri-3987bis), and the specifications for Email Address Internationalization as well as their draft updates should be referenced at the start of the document. (There are no specs for file names as far as I know.) It's a good idea to point readers to more introductory material such as Richard Ishida's "idn-and-iri", but that's not enough for what is supposed to become (part of) a spec itself. RFC 3987 contains a section about the display of bidirectional IRIs (Section 4, http://tools.ietf.org/html/rfc3987#section-4.4). This should clearly be mentioned in the document. The section was written based on the following goals/assumptions: 1) That it would be desirable that bidi IRIs were displayed the same everywhere, both in places where they are identified as such (e.g. a browser's address/location bar) and in free text where no special processing could be applied to them. 2) That it was unfeasible to change the Unicode Bidirectional Algorithm (UBA) to deal with IRIs as a special case. The first assumption is shared by the current proposal; removing the second assumption is at the base of the current proposal. Now that changing (or extending) the UBA is on the table, we have to check what needs specifying, and where. My current take is that we have the following pieces: 1) Display of bidi IRIs once identified: UBA extension, with strong input from stakeholders in affected regions and from IRI WG. 2) Identifying IRIs in contexts: This would ideally be provided by the IETF. There is Appendix C of the URI spec (http://tools.ietf.org/html/rfc3986#appendix-C), Delimiting a URI in Context, and there was at least one attempt to do something in this direction (see http://tools.ietf.org/html/draft-yoneya-iri-recognition-00), but no wider interest and no pressure for standardization (the functionality seemed to work well where needed (e.g. email programs) and minor differences in implementation seemed to hurt nobody). So there's a rather large chance that this remains for the UTC to do, although with strong input from the IETF. 3) Restrictions on strong directionality mixing for components such as domain name labels: This is done for IDNA in RFC 5893 (http://tools.ietf.org/html/rfc5893) and is being updated and adapted based on the RFC 5893 effort for IRIs in the IRI WG. Input from the bidi experts in the UTC is greatly appreciated. We should make sure that we have got something like the above "pieces of the puzzle" right before we get too much into specific technical details. Document Target =============== There is talk about this being an experimental extension. Care should be given to be extremely clear what these two words mean, in particular because I don't know any other cases where this has been done in an Unicode context. Extension seems to mean "the bidi algorithm can be used with or without this". This is desirable from an implementer's perspective, but not from a security perspective. Experimental seems to mean "we aren't really sure yet whether this will fly, and whether we got the details right". It would be very good if this could be avoided by more careful deliberations and work up-front; the consequences of late changes for both security and implementers would be really bad. If this is an extension, I'd personally prefer this to be in a separate document rather than to be part of TR #9. Other Changes to the Bidi Algorithm =================================== With the exception of minor tweaks, the bidi algorithm stayed stable since almost 15 years. But in recent years, there has been increased activity with new ideas for modification, both in the bidi algorithm itself and on higher levels (see the HTML work initiated by Aharon Lanin). It looks like these changes are being added piecemeal without yet seeing a new horizon of stability (after their IUC talk on Tuesday morning, people from Microsoft said that their parenthesis detection solution solved 13% of reported bidi problems; that means there may easily be more fixes comming). But the bidi algorithm isn't an area where constant tinkering is advisable. It would therefore be very important that all these new initiatives are carefully checked against each other, and coordinated both in timing and in substance. It may be well advisable to wait with some of them so that many changes can be made 'in bulk' (the idea of an UBA 2.0), which will also help implementers. Readability and Self-Containedness of the Document ================================================== In order to gain valuable comments not only from total insiders, the document has to be much more accessible to potential commenters. This starts with the title and the start of the introduction, which explicitly should mention email addresses and filenames, because it is otherwise ignored by people interested in these items. The number of examples is extremely low (3). There are no examples of email addresses or filenames. There are no examples of non-generic (opaque syntax) URI schemes (e.g. mailto:,...). There are way too few examples to show what happens under different combinations of RTL and LTR components. There are no examples with realistic names (e.g. existing RTL top-level domains). There is a need for these to give people an everyday feel for the issue, while there is also a need to use abstract names (abc,...) to test usability when guessing is hard. [The IRI spec, RFC 3987, has 10 examples (see http://tools.ietf.org/html/rfc3987#section-4.4) just to explain a single solution to the problem.] All examples use the "uppercase is RTL" convention, which is good for outsiders, but doesn't show the potential end result for the people really affected. Parallel examples in Arabic and Hebrew are very important. [As an RFC (all US-ASCII), the IRI spec was not able to include Arabic or Hebrew, but we made sure we provided Arabic and Hebrew equivalents for the examples (see http://www.w3.org/International/iri-edit/BidiExamples.html) and referenced them from the spec. The 11th example has been added based on feedback. These examples are generated by a Ruby script, it should not be too difficult to change the script to produce examples for this spec.] Security ======== The document correctly notes that ambiguous displays of bidi IRIs,... can cause security problems. However, the document is wrong and/or misleading in stating and/or implying that the proposal will remove ambiguity and confusion, except potentially in the very long term (10 to 20 years). The current specification for the display of bidi IRIs (RFC 3987, Section 4) uses the current bidi algorithm applied in an LTR context. In current implementations, display in an RTL context may also happen. A new specification will introduce at least a third alternative. While it may help reduce tinkering by implementers, it still creates (at least) one more alternative, and this should be very, very clearly noted in the document. The document doesn't contain a security section, but it very clearly needs one. The IETF has an RFC on how to write good security sections. Terminology =========== The document uses 'fields' for e.g. individual domain name labels and path components. In the IETF, we have used 'component' for this; please align. 'surrogates' are mentioned as terminating characters. Are these surrogate pairs (in which case, it would be better to talk about non-BMP characters, but then it's totally unclear why these would terminate IRIs). Or are these unpaired surrogate units? In that case, I do not think the document should in any way prescribe how to handle stuff that is below the level of characters as codepoints. Otherwise, we would have to talk about incomplete UTF-8 byte sequences,... BNF, Syntax Issues ================== The document uses an ad-hoc and/or undefined syntactical notation. It says "This BNF uses a Perl-style syntax". Googling for "Perl-style" and "BNF" only leads to irrelevant stuff and the document itself. Please provide the syntax in a well-defined (with reference and syntax-checker, like e.g. the IETF ABNF) meta-syntax. The meta-syntax uses so-called "smart" quotes. This has to be fixed. Some non-terminals in the syntax are not defined. An example is <scheme>. Another is <percentEncodedUTF8>. Some non-terminals use names different from those in the IRI spec although they are exactly the same. An example may be <percentEncodedUTF8>. This seems to correspond to <pct-encoded> in the IRI spec. If it doesn't, then the difference may be that it assumes an underlying UTF-8 encoding; such an assumption would be wrong, <ptc-encoded> can be used to represent raw bytes both in URIs and in IRIs. The document only deals with the so-called "generic" syntax of IRIs. It always requires a double slash and a domain name after the scheme. However, many schemes do not use the "generic" syntax. An example is the mailto scheme; mailto:user@domain.tld would not be matched by the algorithm. The document doesn't allow <iuserinfo> and <iport> components in the <iauthority> part (where it simply uses <domain>). Why were they excluded? Including additional syntax won't lead to many more false positives (because such strings look even more like IRIs than those without these components) and will avoid some false negatives. With respect to potentially syntactically significant characters (i.e. all ASCII symbols), the document uses an approach completely different from the IRI spec, which makes checking of differences nearly impossible. Substractions in character classes are particularly confusing. The use of character classes, in particular [[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:]..., makes the syntax unreadable except to a very small set of regexperts, which have only a small overlap with Bidi and Uri experts. The IRI spec above ASCII excludes extremely little (just C1, the surrogate area, and non-characters, even private characters are allowed in query parts). It is unclear from the above cryptic syntax what is excluded, and why, and in asmuch as rare stuff is excluded, this doesn't really help making the extraction more precise. There should be a complete list of ASCII symbol characters with their role/function in the IRI spec and in this spec. This is the best way to check for completeness. As an example, in the current syntax, "-" and "~" don't appear anywhere. Are they supposed to be included or excluded? The IDN Label separators from IDNA 2003 are included despite the fact that they are not relevant in IDNA 2008 and they have never been allowed in IRIs. These definitely do appear in practice, but how often will they appear in IRIs involving RTL? My guess is that this chance is extremely low. If I had to cut corners, this is one instance where I'd do so; if somebody really cares about correct bidi display of an IRI with both RTL and ideographs, they should be able to use simple dots. Related, the use of UTS46 probably offers too much leeway. Some restriction, e.g. in the symbol area (and in the area of compatibility characters), could bring some benefits for detection. After all, the overlap between leftovers from IDNA 2003 vanity symbol domain names and bidi-containing domain names can be assumed to be vanishingly small. The <domain> rules allow a label separator at the end. This is technically correct, and allowed in URIs and IRIs (which don't deal at all with the internal structure of domain names, because in their place, names from other registry mechanisms could also be used). However, my guess is that a label separator at the end in vanishingly small in practice these days, and it might help excluding them for better precision. The termination criterion includes unassigned (see also below re. dynamic updates), surrogates (see also above re. terminology), private-use, and control-code (what is meant by that exactly? C0+C1, or something else?) characters. My guess is that except the control codes, this really doesn't help much. Unassigned characters are by definition not used. The explanation of the extraction/termination of IRIs is a mess. This is a place where an algorithmic description will help most. E.g. something along the lines of: For detecting all IRIs in a given text, repeatedly scan for the first place where the IRI syntax matches, and take the longest match. Remove any final characters from that longest match to obtain a matched IRI, and continue detecting from the character immediately following the longest match. (I'm not sure I got the details right (e.g. does only one dot get removed at the end, or two if there are two,...?), but that's the style I'd like to see here, because then I'd actually understand what's supposed to go on.) RTL (and other non-ASCII) scheme names/alternates are clearly not allowed at this time, and there are no plans at all to introduced them. However, it would be prudent in my opinion to a) explore how the various solutions work if ever RTL schemes are considered, and b) if possible to define the algorithm so that it continues to work even in the event that they are introduced, rather than having to go through an additional revision. The filename syntax doesn't include the very common Windows drive letter syntax. There should be a list of syntactic differences between this spec and the IRI spec, with explanations, so that readers can jugde each difference on its merit rather than have to spend their time chasing details. The spec seems to give some special status to some Latin-1 symbols (inverted exclamation mark, middle dot, inverted question mark). It is totally unclear why. The IRI spec is very clear that only ASCII symbols can take syntactic roles (there is no difference here between URIs and IRIs), and if there is some reason to include other symbol-like characters at some point in the syntax, there are clearly many many more such characters than just those in Latin-1. Dynamic Updates? ================ The use of the list of top level domains at IANA is interesting because it provides quite some help to separate IRIs from non-IRIs. However, it is unclear whether the general expectation is that software should be dynamically updated with the IANA list, or whether it's okay to have longer release cycles. ICANN is apparently increasing the number of registrations per year, and many non-ASCII TLDs still remain to be defined. This means that with longer release cycles (e.g. smaller pieces of software that don't have a built-in update mechanism) in the mix, there will always be some discrepancy. This will create a highly undesirable long delay from registration to wide usability of a new TLD. A similar issue appears with unassigned characters that are used as a termination criterion. These also will change from Unicode version to version. Orders ====== http://tools.ietf.org/agenda/79/slides/iri-0.pdf, presented (remotely) at IETF 79, contains slides 19-23. In particular, slide 23 shows four possible solutions. Solution #2 on that slide is equivalent to Option 1 in the document under review. Options 2, 3, and 4 are essentially context/content-dependent variable choices from the table on slide 23. (Similar kinds of overview tables may make this document way more easy to understand.) The paragraph mentioning "big-endian" order in Option 1 is quite irrelevant. Users who are used to some given sequence of components and want to either see that sequence preserved (keep component order strictly LTR) or converted to their preferred directionality (change to have component order strictly go RTL) don't necessarily care about ultimate logic at all. Option 1 has the disadvantage that even IRIs with RTL components only can use an LTR component order, which seems quite unnatural. At the Unicode conference, on Tuesday morning, the group from Microsoft explained that preference for component order was not uniform, and not context- or content-dependent, but depended on country: Israel strongly preferred LTR component order, while many (but not all) Arabic countries preferred RTL order. According to their words, the situation was similar to what happens in Math, but there was no 100% correlation. Regards, Martin.
Received on Monday, 24 October 2011 09:32:18 UTC