Re: character encoding in RDF from Peter F. Patel-Schneider on 2003-11-04 (www-rdf-comments@w3.org from October to December 2003)

From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
Date: Tue, 04 Nov 2003 10:20:00 -0500 (EST)
To: dave.beckett@bristol.ac.uk
Cc: www-rdf-comments@w3.org
Message-Id: <20031104.102000.04559863.pfps@research.bell-labs.com>
From: Dave Beckett <dave.beckett@bristol.ac.uk>
Subject: Re: character encoding in RDF
Date: Tue, 4 Nov 2003 14:36:01 +0000

> On Tue, 04 Nov 2003 08:11:32 -0500 (EST), "Peter F. Patel-Schneider" <pfps@research.bell-labs.com> wrote:
> 
> > 
> > From: Dave Beckett <dave.beckett@bristol.ac.uk>
> > Subject: Re: character encoding in RDF
> > Date: Tue, 4 Nov 2003 11:08:31 +0000
> > 
> > > On Wed, 29 Oct 2003 08:36:01 -0500 (EST), "Peter F. Patel-Schneider" <pfps@research.bell-labs.com> wrote:
> 
> <snip/>
> 
> > Section 6 of RDF/XML Syntax Specification Revised says that the grammar
> > action, ``[t]aken together [...] define a transformation from any
> > syntactially well-formed RDF/XML into an RDF graph represented in the
> > N-Triples language''.  It is the ``represented in ...'' phrase that causes
> > the problems, because it does bring in issues related to the character
> > encoding in N-Triples documents.  If the reference to N-Triples was
> > removed, then this problem would be eliminated.
> 
> I don't see how represented has any overloaded meaning here. It is hard
> to produce a mapping that is machine testable without writing down, or
> representing, the output stage of the mapping in a syntax.  The rules
> for writing down the N-Triples obviously have to be followed since
> that's the form we chose, but didn't mandate you implement.

I don't follow this line of reasoning.

To me, the document is very clear in stating how the N-Triples document is
generated, and this generation has no place for the N-Triples Unicode
encoding to happen.  There are several ways of indicating that this should
happen or eliminating the issue, some of which I have outlined, but without
such I still view the mapping from RDF/XML to N-Triples as flawed.

> > > > Section 6.1.8 of RDF/XML Syntax Specification Revised states that ``[t]he
> > > > double-quoted literal-value accessor value [of a plain literal event] must
> > > > use the N-Triples escapes for strings ...''.  Again, this statement, along
> > > > with the way that these events are created seems to indicate that URI
> > > > references in RDF/XML documents must use the N-Triple character encoding
> > > > for Unicode, not any of the more usual encodings, such as UTF-8.
> > > 
> > > Again, not in creation and there are no content encoding issues
> > > involved. Only Unicode strings (from the XML infoset items).  N-Triples
> > > is an output form only, in order to describe the test cases and grammar
> > > and not required to implement.
> > 
> > There are definitely content encoding issues involved.  The grammar actions
> > are supposed to emit N-Triples, which brings content encoding issues to the
> > fore.  If grammar actions were of the form
> 
> The grammar actions emit RDF triples written down in N-Triples.
> 
> > 	Add a triple with subject ..., predicate ..., and object ... to the
> > 	graph. 
> > 
> > instead of
> > 
> > 	... the following statement is added to the graph:
> > 	... ... ... .
> 
> I don't see that as needing changing (unless replacing statment with
> triple, editorial). The context is clear and the document has already
> explained the relationship between the XML syntax, triples, the graph
> and N-Triples and made several references to it for each event
> definition.

Yes, and in each case as I read it the raw Unicode string is dumped into an
N-Triples document with no chance for using the N-Triples encoding.

> > > > Similar problems occur with Attribute Events.
> > > > 
> > > > Similar problems occur with Typed Literal Events and Plain Literal Events,
> > > > indicating that typed literals and plain literals must be written in
> > > > RDF/XML documents using the N-Triple character encoding for Unicode.
> > > 
> > > I don't follow how you conclude there are problems in any of these sections.
> > > 
> > > Taking URI reference events as an example.   These are constructed from
> > > a string value (a Unicode string) used as an RDF reference, the definition
> > > of which and limitations on the characters allowed are all defined in
> > > RDF Concepts, linked when that event is first defined.
> > 
> > Agreed.
> > 
> > > When those events are written out as N-Triples, they clearly have to
> > > conform to the N-Triples syntax rules, but that is solely a way to write
> > > the Unicode string in N-Triples, it does not limit in any way the range
> > > of characters in an RDF URI reference.  RDF Concepts defines that, and
> > > RDF Concepts does not depend on N-Triples.
> > 
> > I agree that they have to conform to the N-Triples syntax rules, and this
> > is the problem that I see.  The grammar actions directly place Unicode
> > strings, for example Unicode strings that are part of Plain Literal Events,
> > into the N-Triples document, without any possibility of encoding.  This
> > means that this string must be in the form required by N-Triples, which is
> > the problem that I have seen.
> 
> The document mentions that the string-value must use the N-Triples encoding
> so this point is already covered in
>   http://www.w3.org/TR/2003/WD-rdf-syntax-grammar-20031010/#section-literal-node
> The Unicode strings (sequences of Unicode characters) are not directly
> put into the N-Triples document but using N-Triples encoding, which is
> linked directly at the URL above and all other events with string-value.

I disagree.  My reading of these sections is indeed that the 
accessor uses the N-Triples escapes, but all that this ends up doing is
pushing these escapes back to the input document, as assignment to these
accessors is done directly from accessors of the RDF/XML document events,
which are in Unicode.

> The table it points to in RDF Test Cases section 3.2 was added in the
> last version after a previous comment and suggestion from you, providing
> a straightforward description of the Unicode character to N-Triples encoding.
> 
> > > Similarly for the other events.  The RDF Concepts terms when written in
> > > N-Triples do not limit the alphabets of the terms.
> > > 
> > > > I suggest that the wording in question should be changed to something like:
> > > > 
> > > > 	... encodes the same Unicode character string as ... but using the
> > > > 	string encoding in N-Triples ...
> > > 
> > > At present I think I don't understand your problem.  
> > 
> > The problem is that there is no place in the grammar actions for the
> > encoding used by N-Triples.  In the absence of this transformation, the
> > character encoding used by N-Triples is pushed back into the RDF/XML document.
> 
> I still don't see a problem.  Only if you are writing N-Triples (and
> this is optional, as section 6 introduction describes) then you need to
> consider the N-Triples encoding; otherwise you can generate the triples
> inside your application without dealing with such details.  The RDF/XML
> WD defines a mapping where the output triples are encoded in N-Triples. 
> It does not mandate that you implement the mapping to N-Triples, or use
> N-Triples encodings:
> 
>   "The model given here illustrates one way to create a representation of
>   an RDF Graph from an RDF/XML document. It does not mandate any
>   implementation method -- any other method that results in a
>   representation of the same RDF Graph may be use"
>   -- http://www.w3.org/TR/2003/WD-rdf-syntax-grammar-20031010/#section-Data-Model
> 
> but I'm sure you are familiar with that.

Well, yes, but the document actually uses throughout a method that ends up
with an N-Triples document (as evidenced by the "." at the end of the
grammar actions), not any other representation of the RDF graph.  So the
net effect of allowing any other method of generating the same RDF graph
is only to allow any other method of generating the same, wrong RDF graph.

> In terms of implemenations, as far as I'm aware, this is what most of
> them do, they do not write N-Triples; the output of the mapping is
> always some other result form (software object typically).
> 
> > > I'm also not sure where you are proposing wording change; I can't see
> > > that in any of the sections you mention.  Do you mean the abstract? I
> > > would think that isn't required to give the fine detail of the document,
> > > which this might be.
> > 
> > I meant the various bits of the document that I quoted.
> > 
> > > Dave
> > 
> > On further reflection, it would be better to change the grammar actions as
> > shown above.  This might be too big of a change at this stage, so I would
> > be satisfied with changes to the various bits of Section 6 having to do
> > with string-value accessors.
> 
> Those sections already tell you to use the N-Triples encoding.
> Take URI Reference Event, for example It says:
> [[
> string-value
> 
>     The value is the concatenation of "<", the value of the identifier accessor and ">"
> 
>     The <>-quoted identifier accessor value must use the N-Triples
>     escapes for URI references as described in 3.3 URI References. 
> ]] -- http://www.w3.org/TR/2003/WD-rdf-syntax-grammar-20031010/#section-identifier-node
> 
> Which tells you how to turn the identifier accessor value into an
> encoded N-Triples URI reference for output purposes.  There is no direct
> copying of Uncode strings into N-Triples without encoding. The other
> events have similar words and links.

I don't see any place for performing an encoding step here.  All I see is
the requirement that the value be in the N-Triples encoding.  Along with
the direct assignment to such values, which to me also do not allow for an
encoding step, this results in the problem I see.

> At this point, the only change I see here is an editorial one to change
> 'statement' to 'triple' in the grammar action descriptions which would
> probably be more accurate.  
> 
> It could be "the following triple encoded in N-Triples is added to the
> RDF graph" but that's a mouthful and already covered by the earlier
> definition of the actions:
> 
> "The grammar action may include generating new triples to the graph, written in N-Triples format."
> -- http://www.w3.org/TR/2003/WD-rdf-syntax-grammar-20031010/#section-Infoset-Grammar-Notation

I don't see that this encoding is covered by this definition at all.

> Dave

Peter F. Patel-Schneider
Bell Labs Research
Received on Tuesday, 4 November 2003 10:20:12 UTC