- From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
- Date: Tue, 04 Nov 2003 08:11:32 -0500 (EST)
- To: dave.beckett@bristol.ac.uk
- Cc: www-rdf-comments@w3.org
From: Dave Beckett <dave.beckett@bristol.ac.uk> Subject: Re: character encoding in RDF Date: Tue, 4 Nov 2003 11:08:31 +0000 > On Wed, 29 Oct 2003 08:36:01 -0500 (EST), "Peter F. Patel-Schneider" <pfps@research.bell-labs.com> wrote: > > > > > In the abstract of RDF/XML Syntax Specification Revised (W3C Working Draft > > 10 October 2003) it is stated that the actions generate ``triples of the > > RDF graph as defined in RDF Concepts and Abstract Syntax'' ``written using > > the N-Triples RDF graph serializing format'' defined in RDF Test Cases. > > > > In RDF Test Cases, Section 3.1, it is stated that ``[a]n N-Triples document > > is a sequence of US-ASCII characters''. In Section 3, it is further > > specified that N-Triples documents are to be encoded as ``7-bit US-ASCII''. > > It is further specified in Section 3.1 that the only allowable characters > > in absoluteURIs and strings are the characters represented by code points > > from decimal 32 to decimal 126. Characters outside of this range (and a > > few withing it) are encoded using a non-standard encoding. > > > > However, the strings allowed in RDF/XML documents are defined from Unicode > > strings. This leads to a number of problems. > > > > Section 6.1.6 of RDF/XML Syntax Specification Revised states that ``[t]he > > <>-quoted identifier accessor value [of a URI Reference Event] must use the > > N-Triple escapes for URI references ...''. This statement, along with the > > way that these events are created seems to indicate that URI references in > > RDF/XML documents must use the N-Triple character encoding for Unicode, not > > any of the more usual encodings, such as UTF-8. > > RDF/XML is defined on the syntax data model events which are created > from Unicode strings. Agreed but irrelevant to my point. > The string-value accessors are for outputing the > events as strings, as N-Triples, which does not how the events were > created, from the XML input. Again agreed, and this is relevant to my point. Section 6 of RDF/XML Syntax Specification Revised says that the grammar action, ``[t]aken together [...] define a transformation from any syntactially well-formed RDF/XML into an RDF graph represented in the N-Triples language''. It is the ``represented in ...'' phrase that causes the problems, because it does bring in issues related to the character encoding in N-Triples documents. If the reference to N-Triples was removed, then this problem would be eliminated. > > Section 6.1.8 of RDF/XML Syntax Specification Revised states that ``[t]he > > double-quoted literal-value accessor value [of a plain literal event] must > > use the N-Triples escapes for strings ...''. Again, this statement, along > > with the way that these events are created seems to indicate that URI > > references in RDF/XML documents must use the N-Triple character encoding > > for Unicode, not any of the more usual encodings, such as UTF-8. > > Again, not in creation and there are no content encoding issues > involved. Only Unicode strings (from the XML infoset items). N-Triples > is an output form only, in order to describe the test cases and grammar > and not required to implement. There are definitely content encoding issues involved. The grammar actions are supposed to emit N-Triples, which brings content encoding issues to the fore. If grammar actions were of the form Add a triple with subject ..., predicate ..., and object ... to the graph. instead of ... the following statement is added to the graph: ... ... ... . then there would not be any issues of content encoding. > > Similar problems occur with Attribute Events. > > > > Similar problems occur with Typed Literal Events and Plain Literal Events, > > indicating that typed literals and plain literals must be written in > > RDF/XML documents using the N-Triple character encoding for Unicode. > > I don't follow how you conclude there are problems in any of these sections. > > Taking URI reference events as an example. These are constructed from > a string value (a Unicode string) used as an RDF reference, the definition > of which and limitations on the characters allowed are all defined in > RDF Concepts, linked when that event is first defined. Agreed. > When those events are written out as N-Triples, they clearly have to > conform to the N-Triples syntax rules, but that is solely a way to write > the Unicode string in N-Triples, it does not limit in any way the range > of characters in an RDF URI reference. RDF Concepts defines that, and > RDF Concepts does not depend on N-Triples. I agree that they have to conform to the N-Triples syntax rules, and this is the problem that I see. The grammar actions directly place Unicode strings, for example Unicode strings that are part of Plain Literal Events, into the N-Triples document, without any possibility of encoding. This means that this string must be in the form required by N-Triples, which is the problem that I have seen. > Similarly for the other events. The RDF Concepts terms when written in > N-Triples do not limit the alphabets of the terms. > > > I suggest that the wording in question should be changed to something like: > > > > ... encodes the same Unicode character string as ... but using the > > string encoding in N-Triples ... > > At present I think I don't understand your problem. The problem is that there is no place in the grammar actions for the encoding used by N-Triples. In the absence of this transformation, the character encoding used by N-Triples is pushed back into the RDF/XML document. > I'm also not sure where you are proposing wording change; I can't see > that in any of the sections you mention. Do you mean the abstract? I > would think that isn't required to give the fine detail of the document, > which this might be. I meant the various bits of the document that I quoted. > Dave On further reflection, it would be better to change the grammar actions as shown above. This might be too big of a change at this stage, so I would be satisfied with changes to the various bits of Section 6 having to do with string-value accessors. Peter F. Patel-Schneider
Received on Tuesday, 4 November 2003 08:16:06 UTC