- From: Dan Connolly <connolly@w3.org>
- Date: Wed, 03 Jan 2001 11:41:54 -0600
- To: www-international@w3.org
- CC: www-xml-infoset-comments@w3.org, www-xml-schema-comments@w3.org, www-xml-query-comments@w3.org, www-rdf-comments@w3.org
We were just discussing the infoset spec, and the lack of a definition of the term "string" there. In my head, a string is a finite sequence of unicode (UCS) characters. I suggested we say that in the infoset spec. It occurred to me that we should be consistent with the I18N character model, and that there should be some words that we can cite/steal... I don't see any clear mathematical specification of the term "string" in the spec. 4.3 String Identity Matching http://www.w3.org/TR/charmod/#IdentityMatching http://www.w3.org/TR/1999/WD-charmod-19990225#IdentityMatching Some text that looks relevant, though sorta garbled is: "Level 2: Indexing based on abstract codepoints UCS codepoints should be chosen, in accordance with Production [2] of [XML 1.0], the SGML declaration of [HTML 4.0], and the character model of [RFC 2070]. This is the highest level of abstraction that ensures interoperability. To avoid problems with duplicates, it is assumed that the data is normalized according to Section 3.2. " -- http://www.w3.org/TR/1999/WD-charmod-19990225#Indexing By "string" I mean a finite sequence of those things... the abstract things... it should be clear that these are characters, not (necessarily) identical to the integer codepoints to which they correspond. I wonder if a formal model would clarify. I started working on one a while back: http://www.w3.org/Architecture/theory/Character.lsl Mon, 15 Jan 1996 19:34:44 GMT but I haven't integrated it into my somewhat more recent, but still out of date stuff: http://www.w3.org/XML/9711theory/XMLElement. http://www.w3.org/XML/9711theory/XMLElement.lsl http://www.w3.org/XML/9711theory/XMLElement.html $Id: XMLElement.lsl,v 1.9 2000/01/17 21:33:41 connolly Exp $ Meanwhile, the term "Character" is grounded in the web at: http://www.w3.org/XML/2000/12/infoset-20001211#Character but in the parts of that RDF schema where one would expect to find #String, one finds just: http://www.w3.org/2000/01/rdf-schema#Literal which is not constrained to be a sequence of characters; RDF literals can include markup etc. Hmm... I suspect the Query data model spec has a specification for character and string, but I haven't looked. So let's look... http://www.w3.org/TR/query-datamodel/ http://www.w3.org/TR/2000/WD-query-datamodel-20000511/ ah... it takes its definition of string from the schema spec... of course, I should have thought of that... Ah yes, this text will do nicely: [[[ 3.2.1 string [Definition:] The string datatype represents character strings in XML. The value space of string is the set of finite-length sequences of characters (as defined in [XML 1.0 Recommendation (Second Edition)]) that match the Char production from [XML 1.0 Recommendation (Second Edition)]. A character is an atomic unit of communication; it is not further specified except to note that every character has a corresponding Universal Code Set code point ([ISO 10646], [Unicode] and [Unicode3]), which is an integer. NOTE: As noted in Order (§2.4.1.2), the fact that this specification does not specify an order-relation for string does not preclude other applications from treating strings as being ordered. ]]] http://www.w3.org/TR/2000/CR-xmlschema-2-20001024/#string Hm... I'm surprised by the restrictive clause "that match the Char production..."; do we really mean to exclude strings including the 0th character or the 1st character (ala CTRL-A) from XML strings? I guess so. Well, I learn something new every day. So the term "string" in the infoset spec refers to an item in the value space of the string datatype. Er... of course, the dependency should go the other way: the schema spec should import its definition of "string" from the character model spec, either directly, or indirectly, thru the infoset spec. The infoset spec should import its definition from the character model spec. Hmm... I'm not sure if scheduling that dependency is manageable, but that's how it *should* work, in theory. Hmm... the term string seems to have a home in the web... no, those hyperlinked "StringValue" terms refer to section 3.8 Values http://www.w3.org/TR/2000/WD-query-datamodel-20000511/#valueNode [Hmm... it would be great to "Webize" the notation used in the query data model spec http://www.w3.org/DesignIssues/Webize.html I suspect the result would be what we're after for the Semantic Web... http://www.w3.org/DesignIssues/Semantic.html http://www.w3.org/DesignIssues/Logic.html http://www.w3.org/2000/01/sw/ But I should send that request in a separate message... ] -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Wednesday, 3 January 2001 12:42:00 UTC