- From: Martin J. Duerst <duerst@w3.org>
- Date: Mon, 27 Dec 1999 12:30:03 +0900
- To: www-xml-linking-comments@w3.org
Dear Linking WG,
These are the last call comments regarding Internationalization (i18n)
from the I18N WG/IG.
I'm sending these comments directly to the public list indicated
for comments, but in case there are any topics that need further
discussion, please use crossposting between the relevant group
lists, without coping this list.
Character Sets and Escaping (2.2)
---------------------------
This is very important for i18n, and as far as we understand from the
text and from previous discussions, our groups in principle understand
each other and agree on the way to go.
However, the current wording comes with a number of problems and
can and should be improved. This point in particular has been
discussed during the I18N WGs last face-to-face meeting.
Several rounds of discussion may be needed to get to the final shape.
The main problems currently are:
- Title: Change 'character sets' to 'character encodings'.
- Substructure: There should be subsections to organize the topic further.
- Lack of details: People familliar with the referenced specifications,
and people checking those throughly, will figure things out. However,
these are usually in the minority, and in this particular case, it's
very easy for implementers to think that they are not really concerned.
Giving details, e.g. on which characters in the ASCII range have to
be escaped,..., is very important to try to assure stable implementations.
In particular, there should be lists of the various relevant URI character
categories (reserved, marks, delims, unwise,...) and how to treat
them.
- Lack of examples: The distinction/interaction between the various
escaping schemes is complex. From your experience, we know that many
implementers get this wrong in one way or another. Adding a number
of examples seems crucial.
(if the above two items should become too long, moving them to
an appendix may be considered).
- In 2.1, the sentence 'does not represent the general applicability
of escaping' should be made more precise (e.g. 'the formal syntax
in this specification defines the syntax of XPointers before the
application of the various kinds of escaping described in Section
2.2').
- I am not yet completely clear as to whether the ^ escaping mechanism
is needed. It would be great if we could avoid it. Let me try to
show that we don't need it:
- XPointer is the part of an URI behind the #.
- The # is the 'most important' character in an URI, i.e. when
an URI is processed, first look for a #.
- The fragment identifier in URI syntax (RFC 2396) allows a wide
range of things, in particular:
fragment = *uric
uric = reserved | unreserved | escaped
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" |
"(" | ")"
escaped = "%" hex hex
- This includes (), for which no escaping is needed.
- This means that %28 (for '(') and %29 (for ')') can be used to
hide unbalanced parentheses from the scheme end detector.
- This means that the ^ escaping mechanism is not needed anymore.
- One escaping mechanism less is a big plus.
The potential problems for the above argument that I currently see are:
- The %28/%29 escapings are used or planned to be used for something else.
- It is not possible to rely on the URI infrastructure to keep
%28/%29 and (/) separated for fragment parts.
- 'The UTF-8 encoding is used for URI-references': There is (unfortunately!)
no general character encoding for URI references. The spec should say
that the UTF-8 encoding is used for XPointers, and the use of any
other encoding with %HH in URIs is an error.
- The Note on handling illegal characters in URIs should make clear the
following:
- Whether it is allowed to use characters not allowed in URIs in
some place in an XML document (or outside) depends on that particular
case, not on whether XPointer is used or not.
- Where XPointer is used, it is advides to allow to use charaters
not permitted in URIs, to make the use of XPointer easier.
- The second Validity Constraint in 2.2 doesn't belong there,
and should be moved somewhere else.
- Both Unicode and UTF-8 are mentioned. References to them must
be added (RFC 2379 for UTF-8, reference to Unicode 3.0 for Unicode).
Points and Ranges
-----------------
- These are very valuable additions on top of XPath. In particular,
multiple (logical) ranges are essential to address single graphical
ranges in Bidirectional contexts. Also, multiple ranges can be very
helpful to indicate correspondences between versions of the same
document in different languages; this wouldn't be possible with
single ranges.
- There should be some note explaining the potential for multiple
selections appearing as a result of a single graphical selection
in a bidi context (near the 'unique()' function and/or at the end
of 3.1).
- Points and ranges, in accordance with DOM2 and with our Character
Model (see http://www.w3.org/TR/1999/WD-charmod-19991129/#Indexing),
use 0-based indexing. However, as far as we understand, it seems
impossible to access points or ranges based on these indices.
This should preferably be changed by adding appropriate functions,
or alternatively, be pointed out in a note.
- Points and Ranges correspond, to DOM's "position" and "range".
However, DOM's position and range are based on UTF-16 units.
XPointer on the other hand works on UCS characters, as is clear
from http://www.w3.org/TR/xpath#strings. It should
be made clear that this also applies to points and ranges.
Various other issues
--------------------
- Please make sure the CR/PR has a pointer to a translations page.
(see e.g. the XPath Recommendation.
- Intro, Robustness requirements: 'must attempt to be internationalized'
sounds strange. It seems to say 'the WG has to give it a try, but if
they don't get it, no problem'. This is obviously wrong. Please say
something like 'This specification must be appropriately internationalized.'
- 2.1.3, Child Sequences: This is a highly unstable way of addressing
into a document. The i18n WG/IG are in particular concerned about
stability e.g. when translating a document. Giving such an unstable
way of addressing such a distinguished and short syntax seems highly
inappropriate, in particular given the fact that the *[n] notation,
which does the same thing and can be mixed with other ways of addressing,
does exactly the same and is not much longer.
We request that Child Sequences be removed from the spec for the
above reasons. If this should not be possible, we alternatively
request that a very strong warning against this way of addressing
is added to 2.1.3.
- Whitespace handling in 'string-range': This is defined so that
multiple spaces in the source match with multiple spaces in the
XPointer. In as far as this is used to deal with 'pretty printing'
in the source or in XPointers (as opposed to catching spurious
double spaces,...), this is inappropriate because it
does not cover languages that are written without spaces, such as
Thai, Chinese, and Japanese. This has to be improved.
- There should be some comment about matching and normalization in
3.5. The best thing to say is that only codepoint-by-codepoint
matching is done, and that both source and XPointer are assumed
to be normalized, and that for things such as case-folding,
appropriate functions in XPath should be used.
Regards, Martin.
#-#-# Martin J. Du"rst, World Wide Web Consortium
#-#-# mailto:duerst@w3.org http://www.w3.org
Received on Sunday, 26 December 1999 22:28:29 UTC