- From: Paul Cotton <pcotton@microsoft.com>
- Date: Tue, 4 Jul 2000 09:07:16 -0700
- To: "'www-xml-query-comments@w3.org'" <www-xml-query-comments@w3.org>
-----Original Message-----
From: Martin J. Duerst [mailto:duerst@w3.org]
Sent: Monday, June 26, 2000 3:53 AM
To: Paul Cotton; w3c-xml-query-wg@w3.org
Cc: w3c-i18n-ig@w3.org
Subject: I18N Requirements on XML Queries
Hello Paul, dear XML Query WG,
At its last f2f in Paris (the first taking 3 days instead of
only 2 days), the I18N WG finally got around to have a careful
look at XML Query Requirements. We apologize for our delay.
Below please find a list of requirements that we have drawn up
as part of the minutes of the recent meeting.
Please note that the list below cannot be complete, in the
sense that we cannot guarantee that if all the requirements
below are addresses, XML Queries are appropriately internationalized.
Rather, the requirements below provide input for further work.
Please feel free to contact us at any time if you have any
questions or comments.
Regards, Martin.
[31]XML Query Requirements
[31] http://www.w3.org/TR/xmlquery-req
Apologies for delay. Our comments will evolve as we see the evolution
of the spec and our understanding of it evolves; we would like to work
closely with the Query WG. Our current comments on XML Query
Requirements are:
a. The XML Query language should work equally well with any data or
query, regardless of the (human) languages or locales involved.
(This does not imply automatic translation!)
b. IURI: the requirements must address the existence of URI
references containing non-ASCII characters according to the
Character Model.
c. UCS evolution: the requirements must address the issue of UCS
evolution (in particular the addition of new characters).
d. 3.3.2: please remove "1.0" from "XML 1.0 character data".
e. 3.4.14 to 3.4.16: why "SHOULD" rather than "MUST"?
f. 3.4.17: add "locale" and "time zone" to the list ("such as...").
Clarify what "in which the query is executed" means. In
particular, the relevant information generally relates to the
user. The locale and time zone of the server are generally
irrelevant to the results of the query.
g. The semantics of queries should have a clear interpretation with
respect to locale. Make sure that all aspects of the language are
either locale-independent or that sufficient locale information is
contained in the query to make it unambiguous. A
locale-independent approach should be adopted wherever possible
(e.g. number format), localization being handled at the user end.
In some cases (such as a query of strings that collate between
"Smith" and "Thomas"), it is necessary to have certain locale
information ("in the Danish sorting order") be part of the query.
h. Ideally, it should be possible to transmit arbitrary collating
tailoring information with a query. (Cf. Unicode Technical Report
# 10 for details on collating and tailoring).
i. It may not possible for a processor to use collating information
based upon an arbitary tailoring or a specified locale (e.g. for
performance reasons or unavailability of the collating data for
the specified locale). In such a case, a query must not simply
return false results: it may decline the query or return results
according to another collating sequence, together with a warning
of that fact.
j. Section 4: strengthen the statement about the relationship between
our 2 groups.
k. Section 4: change "W3C goals for international access to the Web"
to "W3C goals for i18n".
l. The data model must account for inherited attributes (such as
xml:lang).
m. Query processors need to know about the structure of xml:lang. If
a query asks for a match of the string "chat" with xml:lang="fr",
the query should match data with xml:lang="fr-BE". Note: the
language tag spec, RFC 1766, is currently being extended (3-letter
language codes being introduced). The XML 1.0 spec has been
amended to take this into account.
n. It is a goal of i18n that queries involving string matching
("select x where x='some_constant'") treat canonically equivalent
strings (in the Unicode sense) as matching. If the query and the
target are both XML, early normalization (as per the Character
Model) is assumed and binary comparison ensures that the
equivalence requirement is satisfied. However, if the target is
originally a legacy database which logically has a layer that
exports the data as XML, that XML must be exported in normalized
form. The XML Query spec must impose the normalization requirement
upon such layers.
o. Similarly, the query may come from a user-interface layer that
creates the XML query. The XML Query spec must impose the
normalization requirement upon such layers.
p. Provided that the query and the target are in normalized form C,
the output of the query must itself be in normalized form C.
q. Queries involving string matching should support various kinds of
loose matching (such as case-insensitivity, katakana-hiragana
equivalence, accent-accentless equivalence, etc.)
r. If such features as case-insensitivity are present in queries
involving string matching, these features must be properly
internationalized (e.g. case folding works for accented letters)
and language-dependence must be taken into account (e.g. Turkish
dotless-i).
s. Queries involving character counting and indexing must take into
account the Character Model. Specifically, they should follow
Layer 3 (locale-independent graphemes). Additional details can be
found in The Unicode Standard 3.0 and UTR#18. Queries involving
word counting and indexing should similarly follow the
recommendations in these references.
Received on Tuesday, 4 July 2000 12:07:50 UTC