- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Tue, 14 Mar 2006 04:47:34 -0500
- To: Felix Sasaki <fsasaki@w3.org>
- Cc: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
- Message-ID: <20060314094734.GD20832@w3.org>
On Tue, Mar 14, 2006 at 11:44:57AM +0900, Felix Sasaki wrote: > Hi Eric, > > You had recently (in Mandelieu) a question on citing Unicode. I looked > at charmod again and have the impression that the following answers your > concern: > > [[The fact that both ISO/IEC 10646 and the Unicode Standard are evolving > (in synchrony) raises the issue of versioning: should a specification > refer to a specific version of the standard, or should it make a generic > reference, so that the normative reference is to the version current at > the time of reading the specification? In general the answer is both. > > C063 [S] A generic reference to the Unicode Standard MUST be made if > it is desired that characters allocated after a specification is > published are usable with that specification. A specific reference to > the Unicode Standard MAY be included to ensure that functionality > depending on a particular version is available and will not change over > time. > > An example would be the set of characters acceptable as Name characters > in XML 1.0 [XML 1.0], which is an enumerated list that parsers must > implement to validate names.]] > > That is: your implementation has to choose whether it wants to go the > xml 1.0 or xml 1.1. way; in the later case, just cite Unicode as xml > 1.1. does. We face a similar question whenever we cite any specification. That's why W3 offers versioned and latest URIs for specfications. We use versions [VER] URI as a latest version URI and clarify that by adding "... as it may from time to time be revised or amended" to the citation. My issue is that the definition of _Character_string_: [[ ...each represented by a code point in Unicode... ]] is better suited to the XML 1.0 way as "code point" denotes *assigned* code points*. In SPARQL, I'm using the (slightly contradictory) text [[ A SPARQL query string is a Unicode character string (c.f. section 6.1 String concepts of [CHARMOD]) in the language defined by the following grammar, starting with the Query production. For compatibility with future versions of Unicode, the characters in this string may include unassigned Unicode codepoints (see Identifier and Pattern Syntax [UNIID] section 4 Pattern Syntax). For productions with excluded character classes (for example "[^<>'{}|^`]"), the characters are excluded from the range #x00 - #xEFFFFF. ]] to clarify that I include yet-unassigned code points in the range #x00 - #xEFFFFF. This way, the person writing the parser knows exactly what characters are in the language. They need to know this range to write a complete parser, such as one you write with lex and yacc. (Aside: People writing library-dependent parsers may rely on the system libraries to distinguish Unicide characters. In this case, the libraries may evolve shortly behind Unicode, obviating the need for any real definition of what's in a character string. ) Savvy people have asked if SPARQL uses a grammar that's a moving target [MOV]. The mechanism of an add-only character set with a specified range is not immediately clear to people. *This point about *assigned* code points is the crux of my argument. If it is wrong, and "code point" includes unassinged code ponts, we don't need this extra text in SPARQL. [VER] http://www.unicode.org/unicode/standard/versions/ [STR] http://www.w3.org/TR/2005/REC-charmod-20050215/#def-character-string [MOV] http://www.w3.org/mid/43C026C2.4090300@dajobe.org -- cheers, -eric office: +81.466.49.1170 W3C, Keio Research Institute at SFC, Shonan Fujisawa Campus, Keio University, 5322 Endo, Fujisawa, Kanagawa 252-8520 JAPAN +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +81.90.6533.3882 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Tuesday, 14 March 2006 09:47:46 UTC