Re: On citing Unicode

On Tue, Mar 14, 2006 at 11:44:57AM +0900, Felix Sasaki wrote:
> Hi Eric,
> 
> You had recently (in Mandelieu) a question on citing Unicode. I looked
> at charmod again and have the impression that the following answers your
> concern:
> 
> [[The fact that both ISO/IEC 10646 and the Unicode Standard are evolving
> (in synchrony) raises the issue of versioning: should a specification
> refer to a specific version of the standard, or should it make a generic
> reference, so that the normative reference is to the version current at
> the time of reading the specification? In general the answer is both.
> 
> C063  [S]  A generic reference to the Unicode Standard MUST be made if
> it is desired that characters allocated after a specification is
> published are usable with that specification. A specific reference to
> the Unicode Standard MAY be included to ensure that functionality
> depending on a particular version is available and will not change over
> time.
> 
> An example would be the set of characters acceptable as Name characters
> in XML 1.0 [XML 1.0], which is an enumerated list that parsers must
> implement to validate names.]]
> 
> That is: your implementation has to choose whether it wants to go the
> xml 1.0 or xml 1.1. way; in the later case, just cite Unicode as xml
> 1.1. does.

We face a similar question whenever we cite any specification. That's
why W3 offers versioned and latest URIs for specfications. We use
versions [VER] URI as a latest version URI and clarify that by adding
"... as it may from time to time be revised or amended" to the
citation.

My issue is that the definition of _Character_string_:
[[
...each represented by a code point in Unicode...
]]
is better suited to the XML 1.0 way as "code point" denotes *assigned*
code points*. In SPARQL, I'm using the (slightly contradictory) text
[[
A SPARQL query string is a Unicode character string (c.f. section 6.1
String concepts of [CHARMOD]) in the language defined by the following
grammar, starting with the Query production. For compatibility with
future versions of Unicode, the characters in this string may include
unassigned Unicode codepoints (see Identifier and Pattern Syntax
[UNIID] section 4 Pattern Syntax). For productions with excluded
character classes (for example "[^<>'{}|^`]"), the characters are
excluded from the range #x00 - #xEFFFFF.
]]
to clarify that I include yet-unassigned code points in the range #x00
- #xEFFFFF. This way, the person writing the parser knows exactly what
characters are in the language. They need to know this range to write
a complete parser, such as one you write with lex and yacc.

(Aside: 
People writing library-dependent parsers may rely on the system
libraries to distinguish Unicide characters. In this case, the
libraries may evolve shortly behind Unicode, obviating the need for
any real definition of what's in a character string.
)

Savvy people have asked if SPARQL uses a grammar that's a moving
target [MOV]. The mechanism of an add-only character set with a
specified range is not immediately clear to people.

*This point about *assigned* code points is the crux of my argument.
If it is wrong, and "code point" includes unassinged code ponts, we
don't need this extra text in SPARQL.

[VER] http://www.unicode.org/unicode/standard/versions/
[STR] http://www.w3.org/TR/2005/REC-charmod-20050215/#def-character-string
[MOV] http://www.w3.org/mid/43C026C2.4090300@dajobe.org
-- 
cheers, 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Tuesday, 14 March 2006 09:47:46 UTC