FW: I18N Requirements on XML Queries

-----Original Message-----
From: Martin J. Duerst [mailto:duerst@w3.org]
Sent: Monday, June 26, 2000 3:53 AM
To: Paul Cotton; w3c-xml-query-wg@w3.org
Cc: w3c-i18n-ig@w3.org
Subject: I18N Requirements on XML Queries


Hello Paul, dear XML Query WG,

At its last f2f in Paris (the first taking 3 days instead of
only 2 days), the I18N WG finally got around to have a careful
look at XML Query Requirements. We apologize for our delay.

Below please find a list of requirements that we have drawn up
as part of the minutes of the recent meeting.

Please note that the list below cannot be complete, in the
sense that we cannot guarantee that if all the requirements
below are addresses, XML Queries are appropriately internationalized.
Rather, the requirements below provide input for further work.

Please feel free to contact us at any time if you have any
questions or comments.

Regards,    Martin.


     [31]XML Query Requirements

      [31] http://www.w3.org/TR/xmlquery-req

    Apologies for delay. Our comments will evolve as we see the evolution
    of the spec and our understanding of it evolves; we would like to work
    closely with the Query WG. Our current comments on XML Query
    Requirements are:
     a. The XML Query language should work equally well with any data or
        query, regardless of the (human) languages or locales involved.
        (This does not imply automatic translation!)
     b. IURI: the requirements must address the existence of URI
        references containing non-ASCII characters according to the
        Character Model.
     c. UCS evolution: the requirements must address the issue of UCS
        evolution (in particular the addition of new characters).
     d. 3.3.2: please remove "1.0" from "XML 1.0 character data".
     e. 3.4.14 to 3.4.16: why "SHOULD" rather than "MUST"?
     f. 3.4.17: add "locale" and "time zone" to the list ("such as...").
        Clarify what "in which the query is executed" means. In
        particular, the relevant information generally relates to the
        user. The locale and time zone of the server are generally
        irrelevant to the results of the query.
     g. The semantics of queries should have a clear interpretation with
        respect to locale. Make sure that all aspects of the language are
        either locale-independent or that sufficient locale information is
        contained in the query to make it unambiguous. A
        locale-independent approach should be adopted wherever possible
        (e.g. number format), localization being handled at the user end.
        In some cases (such as a query of strings that collate between
        "Smith" and "Thomas"), it is necessary to have certain locale
        information ("in the Danish sorting order") be part of the query.
     h. Ideally, it should be possible to transmit arbitrary collating
        tailoring information with a query. (Cf. Unicode Technical Report
        # 10 for details on collating and tailoring).
     i. It may not possible for a processor to use collating information
        based upon an arbitary tailoring or a specified locale (e.g. for
        performance reasons or unavailability of the collating data for
        the specified locale). In such a case, a query must not simply
        return false results: it may decline the query or return results
        according to another collating sequence, together with a warning
        of that fact.
     j. Section 4: strengthen the statement about the relationship between
        our 2 groups.
     k. Section 4: change "W3C goals for international access to the Web"
        to "W3C goals for i18n".
     l. The data model must account for inherited attributes (such as
        xml:lang).
     m. Query processors need to know about the structure of xml:lang. If
        a query asks for a match of the string "chat" with xml:lang="fr",
        the query should match data with xml:lang="fr-BE". Note: the
        language tag spec, RFC 1766, is currently being extended (3-letter
        language codes being introduced). The XML 1.0 spec has been
        amended to take this into account.
     n. It is a goal of i18n that queries involving string matching
        ("select x where x='some_constant'") treat canonically equivalent
        strings (in the Unicode sense) as matching. If the query and the
        target are both XML, early normalization (as per the Character
        Model) is assumed and binary comparison ensures that the
        equivalence requirement is satisfied. However, if the target is
        originally a legacy database which logically has a layer that
        exports the data as XML, that XML must be exported in normalized
        form. The XML Query spec must impose the normalization requirement
        upon such layers.
     o. Similarly, the query may come from a user-interface layer that
        creates the XML query. The XML Query spec must impose the
        normalization requirement upon such layers.
     p. Provided that the query and the target are in normalized form C,
        the output of the query must itself be in normalized form C.
     q. Queries involving string matching should support various kinds of
        loose matching (such as case-insensitivity, katakana-hiragana
        equivalence, accent-accentless equivalence, etc.)
     r. If such features as case-insensitivity are present in queries
        involving string matching, these features must be properly
        internationalized (e.g. case folding works for accented letters)
        and language-dependence must be taken into account (e.g. Turkish
        dotless-i).
     s. Queries involving character counting and indexing must take into
        account the Character Model. Specifically, they should follow
        Layer 3 (locale-independent graphemes). Additional details can be
        found in The Unicode Standard 3.0 and UTR#18. Queries involving
        word counting and indexing should similarly follow the
        recommendations in these references.

Received on Tuesday, 4 July 2000 12:07:50 UTC