defn "string" across XML infoset/query/schema, I18N specs from Dan Connolly on 2001-01-03 (www-rdf-comments@w3.org from January to March 2001)

From: Dan Connolly <connolly@w3.org>
Date: Wed, 03 Jan 2001 11:41:54 -0600
To: www-international@w3.org
CC: www-xml-infoset-comments@w3.org, www-xml-schema-comments@w3.org, www-xml-query-comments@w3.org, www-rdf-comments@w3.org
Message-ID: <3A536462.DBDB19C@w3.org>
We were just discussing the infoset spec, and the
lack of a definition of the term "string" there.

In my head, a string is a finite sequence of
unicode (UCS) characters. I suggested we say
that in the infoset spec. It occurred
to me that we should be consistent with the I18N
character model, and that there should
be some words that we can cite/steal...

I don't see any clear mathematical specification
of the term "string" in the spec.

4.3 String Identity Matching 
http://www.w3.org/TR/charmod/#IdentityMatching
http://www.w3.org/TR/1999/WD-charmod-19990225#IdentityMatching

Some text that looks relevant, though sorta garbled is:

	"Level 2: Indexing based on abstract codepoints
             UCS codepoints should be chosen, in accordance
             with Production [2] of [XML 1.0], the SGML
             declaration of [HTML 4.0], and the character model
             of [RFC 2070]. This is the highest level of
             abstraction that ensures interoperability. To avoid
             problems with duplicates, it is assumed that the
             data is normalized according to Section 3.2. "
	-- http://www.w3.org/TR/1999/WD-charmod-19990225#Indexing

By "string" I mean a finite sequence of those things... the abstract
things... it should be clear that these are characters, not
(necessarily) identical to the integer codepoints to which they
correspond.

I wonder if a formal model would clarify. I started working on one a
while back:

	http://www.w3.org/Architecture/theory/Character.lsl
	Mon, 15 Jan 1996 19:34:44 GMT

but I haven't integrated it into my somewhat more recent, but still out
of date stuff:

	http://www.w3.org/XML/9711theory/XMLElement.
	http://www.w3.org/XML/9711theory/XMLElement.lsl
	http://www.w3.org/XML/9711theory/XMLElement.html
	$Id: XMLElement.lsl,v 1.9 2000/01/17 21:33:41 connolly Exp $

Meanwhile, the term "Character" is grounded in the web at:

	http://www.w3.org/XML/2000/12/infoset-20001211#Character

but in the parts of that RDF schema where one would expect
to find #String, one finds just:

	http://www.w3.org/2000/01/rdf-schema#Literal

which is not constrained to be a sequence of characters;
RDF literals can include markup etc.

Hmm... I suspect the Query data model spec has a specification
for character and string, but I haven't looked. So let's look...
http://www.w3.org/TR/query-datamodel/
http://www.w3.org/TR/2000/WD-query-datamodel-20000511/
ah... it takes its definition of string from the schema spec...
of course, I should have thought of that...

Ah yes, this text will do nicely:

[[[
3.2.1 string

        [Definition:]  The string datatype represents character
        strings in XML. The value space of string is the set of
        finite-length sequences of characters (as defined in [XML
        1.0 Recommendation (Second Edition)]) that match the
        Char production from [XML 1.0 Recommendation
        (Second Edition)]. A character is an atomic unit of
        communication; it is not further specified except to note
        that every character has a corresponding Universal Code
        Set code point ([ISO 10646], [Unicode] and [Unicode3]),
        which is an integer. 

             NOTE: As noted in Order (§2.4.1.2), the fact
             that this specification does not specify an
             order-relation for string does not preclude
             other applications from treating strings as
             being ordered. 
]]]
http://www.w3.org/TR/2000/CR-xmlschema-2-20001024/#string

Hm... I'm surprised by the restrictive clause
"that match the Char production..."; do we really
mean to exclude strings including the 0th character
or the 1st character (ala CTRL-A) from XML strings?
I guess so. Well, I learn something new every day.

So the term "string" in the infoset spec refers to an
item in the value space of the string datatype.

Er... of course, the dependency should go the other way:
the schema spec should import its definition of "string"
from the character model spec, either directly, or
indirectly, thru the infoset spec. The infoset spec
should import its definition from the character model spec.

Hmm... I'm not sure if scheduling that dependency is
manageable, but that's how it *should* work, in theory.

Hmm... the term string seems to have a home in the web...
no, those hyperlinked "StringValue" terms refer to

section 3.8 Values
http://www.w3.org/TR/2000/WD-query-datamodel-20000511/#valueNode


[Hmm... it would be great to "Webize" the notation used
in the query data model spec
http://www.w3.org/DesignIssues/Webize.html

I suspect the result would be what we're after for the Semantic Web...
	http://www.w3.org/DesignIssues/Semantic.html
	http://www.w3.org/DesignIssues/Logic.html
	http://www.w3.org/2000/01/sw/

But I should send that request in a separate message...
]


-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Wednesday, 3 January 2001 12:42:00 UTC