I18N comments on XPointer last call from Martin J. Duerst on 1999-12-27 (www-xml-linking-comments@w3.org from October to December 1999)

From: Martin J. Duerst <duerst@w3.org>
Date: Mon, 27 Dec 1999 12:30:03 +0900
To: www-xml-linking-comments@w3.org
Message-Id: <199912270328.MAA08707@sh.w3.mag.keio.ac.jp>
Dear Linking WG,

These are the last call comments regarding Internationalization (i18n)
from the I18N WG/IG.

I'm sending  these comments directly to the public list indicated
for comments, but in case there are any topics that need further
discussion, please use crossposting between the relevant group
lists, without coping this list.


Character Sets and Escaping (2.2)
---------------------------

This is very important for i18n, and as far as we understand from the
text and from previous discussions, our groups in principle understand
each other and agree on the way to go. 

However, the current wording comes with a number of problems and
can and should be improved. This point in particular has been
discussed during the I18N WGs last face-to-face meeting.
Several rounds of discussion may be needed to get to the final shape.
The main problems currently are:

- Title: Change 'character sets' to 'character encodings'.

- Substructure: There should be subsections to organize the topic further.

- Lack of details: People familliar with the referenced specifications,
  and people checking those throughly, will figure things out. However,
  these are usually in the minority, and in this particular case, it's
  very easy for implementers to think that they are not really concerned.
  Giving details, e.g. on which characters in the ASCII range have to
  be escaped,..., is very important to try to assure stable implementations.
  In particular, there should be lists of the various relevant URI character
  categories (reserved, marks, delims, unwise,...) and how to treat
  them.

- Lack of examples: The distinction/interaction between the various
  escaping schemes is complex. From your experience, we know that many
  implementers get this wrong in one way or another. Adding a number
  of examples seems crucial.

  (if the above two items should become too long, moving them to
   an appendix may be considered).

- In 2.1, the sentence 'does not represent the general applicability
  of escaping' should be made more precise (e.g. 'the formal syntax
  in this specification defines the syntax of XPointers before the
  application of the various kinds of escaping described in Section
  2.2').

- I am not yet completely clear as to whether the ^ escaping mechanism
  is needed. It would be great if we could avoid it. Let me try to
  show that we don't need it:
  - XPointer is the part of an URI behind the #.
  - The # is the 'most important' character in an URI, i.e. when
    an URI is processed, first look for a #.
  - The fragment identifier in URI syntax (RFC 2396) allows a wide
    range of things, in particular:
           fragment      = *uric
           uric          = reserved | unreserved | escaped
           reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                           "$" | ","
           unreserved    = alphanum | mark
           mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
                           "(" | ")"

           escaped       = "%" hex hex
  - This includes (), for which no escaping is needed.
  - This means that %28 (for '(') and %29 (for ')') can be used to
    hide unbalanced parentheses from the scheme end detector.
  - This means that the ^ escaping mechanism is not needed anymore.
  - One escaping mechanism less is a big plus.

  The potential problems for the above argument that I currently see are:
  - The %28/%29 escapings are used or planned to be used for something else.
  - It is not possible to rely on the URI infrastructure to keep
    %28/%29 and (/) separated for fragment parts.

- 'The UTF-8 encoding is used for URI-references': There is (unfortunately!)
  no general character encoding for URI references. The spec should say
  that the UTF-8 encoding is used for XPointers, and the use of any
  other encoding with %HH in URIs is an error.

- The Note on handling illegal characters in URIs should make clear the
  following:
  - Whether it is allowed to use characters not allowed in URIs in
    some place in an XML document (or outside) depends on that particular
    case, not on whether XPointer is used or not.
  - Where XPointer is used, it is advides to allow to use charaters
    not permitted in URIs, to make the use of XPointer easier.

- The second Validity Constraint in 2.2 doesn't belong there,
  and should be moved somewhere else.

- Both Unicode and UTF-8 are mentioned. References to them must
  be added (RFC 2379 for UTF-8, reference to Unicode 3.0 for Unicode).


Points and Ranges
-----------------

- These are very valuable additions on top of XPath. In particular,
  multiple (logical) ranges are essential to address single graphical
  ranges in Bidirectional contexts. Also, multiple ranges can be very
  helpful to indicate correspondences between versions of the same
  document in different languages; this wouldn't be possible with
  single ranges.

- There should be some note explaining the potential for multiple
  selections appearing as a result of a single graphical selection
  in a bidi context (near the 'unique()' function and/or at the end
  of 3.1).

- Points and ranges, in accordance with DOM2 and with our Character
  Model (see http://www.w3.org/TR/1999/WD-charmod-19991129/#Indexing),
  use 0-based indexing. However, as far as we understand, it seems
  impossible to access points or ranges based on these indices.
  This should preferably be changed by adding appropriate functions,
  or alternatively, be pointed out in a note.

- Points and Ranges correspond, to DOM's "position" and "range".
  However, DOM's position and range are based on UTF-16 units.
  XPointer on the other hand works on UCS characters, as is clear
  from http://www.w3.org/TR/xpath#strings. It should
  be made clear that this also applies to points and ranges.


Various other issues
--------------------

- Please make sure the CR/PR has a pointer to a translations page.
  (see e.g. the XPath Recommendation.

- Intro, Robustness requirements: 'must attempt to be internationalized'
  sounds strange. It seems to say 'the WG has to give it a try, but if
  they don't get it, no problem'. This is obviously wrong. Please say
  something like 'This specification must be appropriately internationalized.'

- 2.1.3, Child Sequences: This is a highly unstable way of addressing
  into a document. The i18n WG/IG are in particular concerned about
  stability e.g. when translating a document. Giving such an unstable
  way of addressing such a distinguished and short syntax seems highly
  inappropriate, in particular given the fact that the *[n] notation,
  which does the same thing and can be mixed with other ways of addressing,
  does exactly the same and is not much longer.
  We request that Child Sequences be removed from the spec for the
  above reasons. If this should not be possible, we alternatively
  request that a very strong warning against this way of addressing
  is added to 2.1.3.

- Whitespace handling in 'string-range': This is defined so that
  multiple spaces in the source match with multiple spaces in the
  XPointer. In as far as this is used to deal with 'pretty printing'
  in the source or in XPointers (as opposed to catching spurious
  double spaces,...), this is inappropriate because it
  does not cover languages that are written without spaces, such as
  Thai, Chinese, and Japanese. This  has to be improved.

- There should be some comment about matching and normalization in
  3.5. The best thing to say is that only codepoint-by-codepoint
  matching is done, and that both source and XPointer are assumed
  to be normalized, and that for things such as case-folding,
  appropriate functions in XPath should be used.


Regards,   Martin.




#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org
Received on Sunday, 26 December 1999 22:28:29 UTC