- From: Phillips, Addison <addison@lab126.com>
- Date: Mon, 24 Oct 2011 08:00:02 -0700
- To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
FYI... > -----Original Message----- > From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On > Behalf Of "Martin J. Dürst" > Sent: Monday, October 24, 2011 2:32 AM > To: public-iri@w3.org > Cc: Mark Davis > Subject: October 2011 feedback on PRI 185 (long!) > > The message below was posted to the respective forum on the Unicode > Web site (http://www.unicode.org/forum/viewforum.php?f=34). I'm > sending it here because it's highly relevant for our work. It is also relevant to > IDNA and EAI. > > > > These are my comments on "Extension of UBA for improved display of > URL/IRIs", available from http://www.unicode.org/review/pri185/, as > modified on Sept. 22 (presumably 2011). > > I have commented mostly on procedural issues in the last round, but have > taken a deeper look at technical and editorial issues this time, too. I wrote > part of this on a very long flight, so some references are missing; if you need > some additional pointers, don't hesitate to ask. > > > Procedural Issues > ================= > > Opening the ability to comment via the Unicode Forum is some progress on > the previous way of commenting via a Web form (which was essentially a > black hole for outsiders). However, it is still very much a one-way street. > > This makes it difficult to involve affected communities such as the IETF > Working Groups (WGs, or former WG) on Internationalized Domain Names, > Email Address Internationalization, and Internationalized Resource Identifiers, > the relevant groups at W3C, and in many ways most important, the actually > affected users that use bidi IRIs. > > It also makes it difficult to find out how, and more importantly, why, > comments have been addressed or not. It is still a far way away from how > other organizations deal with public comments. In the W3C, providing a > public list of all public comments and how they are addressed is standard > practice. In the IETF, most of the discussions are held on public mailing list, > and many WGs are using a public tracker (http://trac.tools.ietf.org/). > > > Preexisting Specs and Parallel Work > =================================== > > The IRI specification (RFC 3987, http://tools.ietf.org/html/rfc3987) as well as > draft updates (http://tools.ietf.org/html/draft-ietf-iri-3987bis), and the > specifications for Email Address Internationalization as well as their draft > updates should be referenced at the start of the document. (There are no > specs for file names as far as I know.) It's a good idea to point readers to > more introductory material such as Richard Ishida's "idn-and-iri", but that's > not enough for what is supposed to become (part of) a spec itself. > > RFC 3987 contains a section about the display of bidirectional IRIs (Section 4, > http://tools.ietf.org/html/rfc3987#section-4.4). This should clearly be > mentioned in the document. The section was written based on the following > goals/assumptions: > > 1) That it would be desirable that bidi IRIs were displayed the same > everywhere, both in places where they are identified as such (e.g. a > browser's address/location bar) and in free text where no special processing > could be applied to them. > > 2) That it was unfeasible to change the Unicode Bidirectional Algorithm > (UBA) to deal with IRIs as a special case. > > The first assumption is shared by the current proposal; removing the second > assumption is at the base of the current proposal. > > Now that changing (or extending) the UBA is on the table, we have to check > what needs specifying, and where. My current take is that we have the > following pieces: > > 1) Display of bidi IRIs once identified: UBA extension, with strong input from > stakeholders in affected regions and from IRI WG. > > 2) Identifying IRIs in contexts: This would ideally be provided by the IETF. > There is Appendix C of the URI spec > (http://tools.ietf.org/html/rfc3986#appendix-C), Delimiting a URI in Context, > and there was at least one attempt to do something in this direction (see > http://tools.ietf.org/html/draft-yoneya-iri-recognition-00), but no wider > interest and no pressure for standardization (the functionality seemed to > work well where needed (e.g. email programs) and minor differences in > implementation seemed to hurt nobody). So there's a rather large chance > that this remains for the UTC to do, although with strong input from the IETF. > > 3) Restrictions on strong directionality mixing for components such as domain > name labels: This is done for IDNA in RFC 5893 > (http://tools.ietf.org/html/rfc5893) and is being updated and adapted based > on the RFC 5893 effort for IRIs in the IRI WG. Input from the bidi experts in > the UTC is greatly appreciated. > > We should make sure that we have got something like the above "pieces of > the puzzle" right before we get too much into specific technical details. > > > Document Target > =============== > > There is talk about this being an experimental extension. Care should be > given to be extremely clear what these two words mean, in particular > because I don't know any other cases where this has been done in an > Unicode context. > > Extension seems to mean "the bidi algorithm can be used with or without > this". This is desirable from an implementer's perspective, but not from > a security perspective. > > Experimental seems to mean "we aren't really sure yet whether this will > fly, and whether we got the details right". It would be very good if > this could be avoided by more careful deliberations and work up-front; > the consequences of late changes for both security and implementers > would be really bad. > > If this is an extension, I'd personally prefer this to be in a separate > document rather than to be part of TR #9. > > > Other Changes to the Bidi Algorithm > =================================== > > With the exception of minor tweaks, the bidi algorithm stayed stable > since almost 15 years. But in recent years, there has been increased > activity with new ideas for modification, both in the bidi algorithm > itself and on higher levels (see the HTML work initiated by Aharon > Lanin). It looks like these changes are being added piecemeal without > yet seeing a new horizon of stability (after their IUC talk on Tuesday > morning, people from Microsoft said that their parenthesis detection > solution solved 13% of reported bidi problems; that means there may > easily be more fixes comming). > > But the bidi algorithm isn't an area where constant tinkering is > advisable. It would therefore be very important that all these new > initiatives are carefully checked against each other, and coordinated > both in timing and in substance. It may be well advisable to wait with > some of them so that many changes can be made 'in bulk' (the idea of an > UBA 2.0), which will also help implementers. > > > Readability and Self-Containedness of the Document > ================================================== > > In order to gain valuable comments not only from total insiders, the > document has to be much more accessible to potential commenters. This > starts with the title and the start of the introduction, which > explicitly should mention email addresses and filenames, because it is > otherwise ignored by people interested in these items. > > The number of examples is extremely low (3). There are no examples of > email addresses or filenames. There are no examples of non-generic > (opaque syntax) URI schemes (e.g. mailto:,...). There are way too few > examples to show what happens under different combinations of RTL and > LTR components. There are no examples with realistic names (e.g. > existing RTL top-level domains). There is a need for these to give > people an everyday feel for the issue, while there is also a need to use > abstract names (abc,...) to test usability when guessing is hard. > > [The IRI spec, RFC 3987, has 10 examples (see > http://tools.ietf.org/html/rfc3987#section-4.4) just to explain a single > solution to the problem.] > > All examples use the "uppercase is RTL" convention, which is good for > outsiders, but doesn't show the potential end result for the people > really affected. Parallel examples in Arabic and Hebrew are very important. > > [As an RFC (all US-ASCII), the IRI spec was not able to include Arabic > or Hebrew, but we made sure we provided Arabic and Hebrew equivalents > for the examples (see > http://www.w3.org/International/iri-edit/BidiExamples.html) and > referenced them from the spec. The 11th example has been added based > on > feedback. These examples are generated by a Ruby script, it should not > be too difficult to change the script to produce examples for this spec.] > > > Security > ======== > > The document correctly notes that ambiguous displays of bidi IRIs,... > can cause security problems. However, the document is wrong and/or > misleading in stating and/or implying that the proposal will remove > ambiguity and confusion, except potentially in the very long term (10 to > 20 years). The current specification for the display of bidi IRIs (RFC > 3987, Section 4) uses the current bidi algorithm applied in an LTR > context. In current implementations, display in an RTL context may also > happen. A new specification will introduce at least a third alternative. > While it may help reduce tinkering by implementers, it still creates (at > least) one more alternative, and this should be very, very clearly noted > in the document. > > The document doesn't contain a security section, but it very clearly > needs one. The IETF has an RFC on how to write good security sections. > > > Terminology > =========== > > The document uses 'fields' for e.g. individual domain name labels and > path components. In the IETF, we have used 'component' for this; please > align. > > 'surrogates' are mentioned as terminating characters. Are these > surrogate pairs (in which case, it would be better to talk about non-BMP > characters, but then it's totally unclear why these would terminate > IRIs). Or are these unpaired surrogate units? In that case, I do not > think the document should in any way prescribe how to handle stuff that > is below the level of characters as codepoints. Otherwise, we would have > to talk about incomplete UTF-8 byte sequences,... > > > BNF, Syntax Issues > ================== > > The document uses an ad-hoc and/or undefined syntactical notation. It > says "This BNF uses a Perl-style syntax". Googling for "Perl-style" and > "BNF" only leads to irrelevant stuff and the document itself. Please > provide the syntax in a well-defined (with reference and syntax-checker, > like e.g. the IETF ABNF) meta-syntax. > > The meta-syntax uses so-called "smart" quotes. This has to be fixed. > > Some non-terminals in the syntax are not defined. An example is > <scheme>. Another is <percentEncodedUTF8>. > > Some non-terminals use names different from those in the IRI spec > although they are exactly the same. An example may be > <percentEncodedUTF8>. This seems to correspond to <pct-encoded> in the > IRI spec. If it doesn't, then the difference may be that it assumes an > underlying UTF-8 encoding; such an assumption would be wrong, > <ptc-encoded> can be used to represent raw bytes both in URIs and in IRIs. > > The document only deals with the so-called "generic" syntax of IRIs. It > always requires a double slash and a domain name after the scheme. > However, many schemes do not use the "generic" syntax. An example is the > mailto scheme; mailto:user@domain.tld would not be matched by the > algorithm. > > The document doesn't allow <iuserinfo> and <iport> components in the > <iauthority> part (where it simply uses <domain>). Why were they > excluded? Including additional syntax won't lead to many more false > positives (because such strings look even more like IRIs than those > without these components) and will avoid some false negatives. > > With respect to potentially syntactically significant characters (i.e. > all ASCII symbols), the document uses an approach completely different > from the IRI spec, which makes checking of differences nearly > impossible. Substractions in character classes are particularly confusing. > > The use of character classes, in particular > [[:L:][:N:][:M:][:S:][:Pd:][:Pc:][:Cf:]..., makes the syntax unreadable > except to a very small set of regexperts, which have only a small > overlap with Bidi and Uri experts. The IRI spec above ASCII excludes > extremely little (just C1, the surrogate area, and non-characters, even > private characters are allowed in query parts). It is unclear from the > above cryptic syntax what is excluded, and why, and in asmuch as rare > stuff is excluded, this doesn't really help making the extraction more > precise. > > There should be a complete list of ASCII symbol characters with their > role/function in the IRI spec and in this spec. This is the best way to > check for completeness. As an example, in the current syntax, "-" and > "~" don't appear anywhere. Are they supposed to be included or excluded? > > The IDN Label separators from IDNA 2003 are included despite the fact > that they are not relevant in IDNA 2008 and they have never been allowed > in IRIs. These definitely do appear in practice, but how often will they > appear in IRIs involving RTL? My guess is that this chance is extremely > low. If I had to cut corners, this is one instance where I'd do so; if > somebody really cares about correct bidi display of an IRI with both RTL > and ideographs, they should be able to use simple dots. > > Related, the use of UTS46 probably offers too much leeway. Some > restriction, e.g. in the symbol area (and in the area of compatibility > characters), could bring some benefits for detection. After all, the > overlap between leftovers from IDNA 2003 vanity symbol domain names and > bidi-containing domain names can be assumed to be vanishingly small. > > The <domain> rules allow a label separator at the end. This is > technically correct, and allowed in URIs and IRIs (which don't deal at > all with the internal structure of domain names, because in their place, > names from other registry mechanisms could also be used). However, my > guess is that a label separator at the end in vanishingly small in > practice these days, and it might help excluding them for better precision. > > The termination criterion includes unassigned (see also below re. > dynamic updates), surrogates (see also above re. terminology), > private-use, and control-code (what is meant by that exactly? C0+C1, or > something else?) characters. My guess is that except the control codes, > this really doesn't help much. Unassigned characters are by definition > not used. > > The explanation of the extraction/termination of IRIs is a mess. This is > a place where an algorithmic description will help most. E.g. something > along the lines of: > For detecting all IRIs in a given text, repeatedly scan for the first > place where the IRI syntax matches, and take the longest match. Remove > any final characters from that longest match to obtain a matched IRI, > and continue detecting from the character immediately following the > longest match. > (I'm not sure I got the details right (e.g. does only one dot get > removed at the end, or two if there are two,...?), but that's the style > I'd like to see here, because then I'd actually understand what's > supposed to go on.) > > RTL (and other non-ASCII) scheme names/alternates are clearly not > allowed at this time, and there are no plans at all to introduced them. > However, it would be prudent in my opinion to > a) explore how the various solutions work if ever RTL schemes are > considered, and > b) if possible to define the algorithm so that it continues to work even > in the event that they are introduced, rather than having to go through > an additional revision. > > The filename syntax doesn't include the very common Windows drive letter > syntax. > > There should be a list of syntactic differences between this spec and > the IRI spec, with explanations, so that readers can jugde each > difference on its merit rather than have to spend their time chasing > details. > > The spec seems to give some special status to some Latin-1 symbols > (inverted exclamation mark, middle dot, inverted question mark). It is > totally unclear why. The IRI spec is very clear that only ASCII symbols > can take syntactic roles (there is no difference here between URIs and > IRIs), and if there is some reason to include other symbol-like > characters at some point in the syntax, there are clearly many many more > such characters than just those in Latin-1. > > > Dynamic Updates? > ================ > > The use of the list of top level domains at IANA is interesting because > it provides quite some help to separate IRIs from non-IRIs. However, it > is unclear whether the general expectation is that software should be > dynamically updated with the IANA list, or whether it's okay to have > longer release cycles. ICANN is apparently increasing the number of > registrations per year, and many non-ASCII TLDs still remain to be > defined. This means that with longer release cycles (e.g. smaller pieces > of software that don't have a built-in update mechanism) in the mix, > there will always be some discrepancy. This will create a highly > undesirable long delay from registration to wide usability of a new TLD. > > A similar issue appears with unassigned characters that are used as a > termination criterion. These also will change from Unicode version to > version. > > > Orders > ====== > > http://tools.ietf.org/agenda/79/slides/iri-0.pdf, presented (remotely) > at IETF 79, contains slides 19-23. In particular, slide 23 shows four > possible solutions. Solution #2 on that slide is equivalent to Option 1 > in the document under review. Options 2, 3, and 4 are essentially > context/content-dependent variable choices from the table on slide 23. > (Similar kinds of overview tables may make this document way more easy > to understand.) > > The paragraph mentioning "big-endian" order in Option 1 is quite > irrelevant. Users who are used to some given sequence of components and > want to either see that sequence preserved (keep component order > strictly LTR) or converted to their preferred directionality (change to > have component order strictly go RTL) don't necessarily care about > ultimate logic at all. > > Option 1 has the disadvantage that even IRIs with RTL components only > can use an LTR component order, which seems quite unnatural. > > At the Unicode conference, on Tuesday morning, the group from Microsoft > explained that preference for component order was not uniform, and not > context- or content-dependent, but depended on country: Israel strongly > preferred LTR component order, while many (but not all) Arabic countries > preferred RTL order. According to their words, the situation was similar > to what happens in Math, but there was no 100% correlation. > > > Regards, Martin.
Received on Monday, 24 October 2011 15:12:21 UTC