Re: character encoding in RDF from Peter F. Patel-Schneider on 2003-11-04 (www-rdf-comments@w3.org from October to December 2003)

From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
Date: Tue, 04 Nov 2003 08:11:32 -0500 (EST)
To: dave.beckett@bristol.ac.uk
Cc: www-rdf-comments@w3.org
Message-Id: <20031104.081132.23664164.pfps@research.bell-labs.com>
From: Dave Beckett <dave.beckett@bristol.ac.uk>
Subject: Re: character encoding in RDF
Date: Tue, 4 Nov 2003 11:08:31 +0000

> On Wed, 29 Oct 2003 08:36:01 -0500 (EST), "Peter F. Patel-Schneider" <pfps@research.bell-labs.com> wrote:
> 
> > 
> > In the abstract of RDF/XML Syntax Specification Revised (W3C Working Draft
> > 10 October 2003) it is stated that the actions generate ``triples of the
> > RDF graph as defined in RDF Concepts and Abstract Syntax'' ``written using
> > the N-Triples RDF graph serializing format'' defined in RDF Test Cases.
> > 
> > In RDF Test Cases, Section 3.1, it is stated that ``[a]n N-Triples document
> > is a sequence of US-ASCII characters''.  In Section 3, it is further
> > specified that N-Triples documents are to be encoded as ``7-bit US-ASCII''.
> > It is further specified in Section 3.1 that the only allowable characters
> > in absoluteURIs and strings are the characters represented by code points
> > from decimal 32 to decimal 126.  Characters outside of this range (and a
> > few withing it) are encoded using a non-standard encoding.  
> > 
> > However, the strings allowed in RDF/XML documents are defined from Unicode
> > strings.  This leads to a number of problems.
> > 
> > Section 6.1.6 of RDF/XML Syntax Specification Revised states that ``[t]he
> > <>-quoted identifier accessor value [of a URI Reference Event] must use the
> > N-Triple escapes for URI references ...''.  This statement, along with the
> > way that these events are created seems to indicate that URI references in
> > RDF/XML documents must use the N-Triple character encoding for Unicode, not
> > any of the more usual encodings, such as UTF-8.
> 
> RDF/XML is defined on the syntax data model events which are created
> from Unicode strings.  

Agreed but irrelevant to my point.

> The string-value accessors are for outputing the
> events as strings, as N-Triples, which does not how the events were
> created, from the XML input.

Again agreed, and this is relevant to my point.  

Section 6 of RDF/XML Syntax Specification Revised says that the grammar
action, ``[t]aken together [...] define a transformation from any
syntactially well-formed RDF/XML into an RDF graph represented in the
N-Triples language''.  It is the ``represented in ...'' phrase that causes
the problems, because it does bring in issues related to the character
encoding in N-Triples documents.  If the reference to N-Triples was
removed, then this problem would be eliminated.

> > Section 6.1.8 of RDF/XML Syntax Specification Revised states that ``[t]he
> > double-quoted literal-value accessor value [of a plain literal event] must
> > use the N-Triples escapes for strings ...''.  Again, this statement, along
> > with the way that these events are created seems to indicate that URI
> > references in RDF/XML documents must use the N-Triple character encoding
> > for Unicode, not any of the more usual encodings, such as UTF-8.
> 
> Again, not in creation and there are no content encoding issues
> involved. Only Unicode strings (from the XML infoset items).  N-Triples
> is an output form only, in order to describe the test cases and grammar
> and not required to implement.

There are definitely content encoding issues involved.  The grammar actions
are supposed to emit N-Triples, which brings content encoding issues to the
fore.  If grammar actions were of the form

	Add a triple with subject ..., predicate ..., and object ... to the
	graph. 

instead of

	... the following statement is added to the graph:
	... ... ... .

then there would not be any issues of content encoding.

> > Similar problems occur with Attribute Events.
> > 
> > Similar problems occur with Typed Literal Events and Plain Literal Events,
> > indicating that typed literals and plain literals must be written in
> > RDF/XML documents using the N-Triple character encoding for Unicode.
> 
> I don't follow how you conclude there are problems in any of these sections.
> 
> Taking URI reference events as an example.   These are constructed from
> a string value (a Unicode string) used as an RDF reference, the definition
> of which and limitations on the characters allowed are all defined in
> RDF Concepts, linked when that event is first defined.

Agreed.

> When those events are written out as N-Triples, they clearly have to
> conform to the N-Triples syntax rules, but that is solely a way to write
> the Unicode string in N-Triples, it does not limit in any way the range
> of characters in an RDF URI reference.  RDF Concepts defines that, and
> RDF Concepts does not depend on N-Triples.

I agree that they have to conform to the N-Triples syntax rules, and this
is the problem that I see.  The grammar actions directly place Unicode
strings, for example Unicode strings that are part of Plain Literal Events,
into the N-Triples document, without any possibility of encoding.  This
means that this string must be in the form required by N-Triples, which is
the problem that I have seen.

> Similarly for the other events.  The RDF Concepts terms when written in
> N-Triples do not limit the alphabets of the terms.
> 
> > I suggest that the wording in question should be changed to something like:
> > 
> > 	... encodes the same Unicode character string as ... but using the
> > 	string encoding in N-Triples ...
> 
> At present I think I don't understand your problem.  

The problem is that there is no place in the grammar actions for the
encoding used by N-Triples.  In the absence of this transformation, the
character encoding used by N-Triples is pushed back into the RDF/XML document.

> I'm also not sure where you are proposing wording change; I can't see
> that in any of the sections you mention.  Do you mean the abstract? I
> would think that isn't required to give the fine detail of the document,
> which this might be.

I meant the various bits of the document that I quoted.

> Dave

On further reflection, it would be better to change the grammar actions as
shown above.  This might be too big of a change at this stage, so I would
be satisfied with changes to the various bits of Section 6 having to do
with string-value accessors.

Peter F. Patel-Schneider
Received on Tuesday, 4 November 2003 08:16:06 UTC