Re: Comments on WD-rdf-testcases-20010912 from Dave Beckett on 2001-09-18 (www-rdf-comments@w3.org from July to September 2001)

From: Dave Beckett <dave.beckett@bristol.ac.uk>
Date: Tue, 18 Sep 2001 10:38:59 +0100
To: Bjoern Hoehrmann <derhoermi@gmx.net>, www-rdf-comments@w3.org
CC: barstow@w3.org
Message-ID: <8359.1000805939@tatooine.ilrt.bris.ac.uk>
You are mostly addressing section 3: N-Triples:
  http://www.w3.org/TR/2001/WD-rdf-testcases-20010912/#ntriples
which I edited, so will respond

>>>Bjoern Hoehrmann said:
> Hi,
> 
>    I wonder why the working drafts doesn't reference RFC 2396 for the
> absoluteURI syntax ...

A missing citation I guess.  The section defines a syntax creating a
graph whose meaning is defined in a (still being drafted) RDF model
theory document.  So where tokens like 'subject', 'predicate',
'object', 'uriref' etc. appear, their syntax is defined but their
meaning is left out.

> ...and instead uses a very loose syntax definition with
> incompatible character escape sequences. The [CHARMOD] requires
> specifications to specify that URIs are escaped like
> 
>   http://www.hoehrmann.invalid/~bj%C3%B6rn/
> 
> but the RDF Test Cases WD implies, one should use
> 
>   http://www.hoehrmann.invalid/~bj\uF6rn/
> 
> or 
> 
>   http://www.hoehrmann.invalid/~bj\u00F6rn/


We started with escaping rules taken from Python (which you mention
later) i.e \-escapes for Strings.

CHARMOD says, for Character Escaping (not URIS)

      * Specifications MUST NOT invent a new escaping mechanism if an
         appropriate one already exists.
      -- http://www.w3.org/TR/charmod/#sec-Escaping

so the \-escaping for strings seemed appropriate.

The choice for URI escaping was either to recommend a second way to
escape characters (such as %xx) or to use the same method.  For
simplicity, the same method was used but the familiarity of %xx might
be a better choice, although it would require a little more code.

Looking; CHARMOD says, for URIs:

  A W3C specification that defines new syntax for URIs, such as a new
  kind of fragment identifier, MUST specify that characters outside
  the US-ASCII repertoire are encoded in URIs using UTF-8 and
  %HH-escaping
  -- http://www.w3.org/TR/charmod/#sec-URIs

so we have to (MUST) change our URI escaping to match that requirement.

Thanks for catching it.


> The specification should clearly state that four characters must follow
> the \u and eight characters the \U. ...

I thought that was what we wrote:
  \uxxxx      Hexadecimal digits xxxx encoding character ...
  \Uxxxxxxxx  Hexadecimal digits xxxxxxxx encoding character ...

> ... I don't see any good reason why \U
> is defined for
> 
>   [[#x10000-#xFFFFFFFF]
> 
> (note the unmatched bracket) instead of ...-#x10FFFF, Unicode doesn't
> define anything above.  ...

True - the present latest version of Unicode doesn't, but we were
neutral on that by allowing the full 32 bit range.  I took the
recommendation from http://www.w3.org/TR/charmod/#unicode
and cited 3.0

Charmod says:
  * The specification MUST NOT arbitrarily restrict the range of
    characters that can be used, which must cover all Unicode code
    points from 0 to 0x10FFFF inclusive.
  -- http://www.w3.org/TR/charmod/#sec-RefProcModel

which is a range that is allowed.  It doesn't say we should exclude
code points beyond that range.  We can change it.

> The \U should IMO only require six hex digits
> instead of eight, otherwise authors have always to specify two
> superflous zero digits. I would recommend a more perlish approach for \u
> and \U in general, i.e. use \u{ <one to six hex digits> } in place of
> them.

the \-escapes come from deployed Python code reading this format.
Python happens to use fixed lengths for the escapes
  http://www.python.org/doc/current/ref/strings.html
and allows encoding Unicode chars with full 32 bits so we kept that.

It is easier to have a fixed size field, since this is meant to be a
simple format; which is why we require absolute URIs, line-by-line
handling and other simple structures.  Furthermore, it should be
useful to retain the chance to expand this field later to encode the
32 bits if Unicode grew to require that (and there is plenty of
growth there).

> I _really_ wonder why #20, #3C and #3E should be additionally allowed
> for absoluteURIs. They have to be URI-escaped, the WD implies I should
> use
> 
>   http://www.example.org/test\u0020case/
> 
> instead of 
> 
>   http://www.example.org/test%20case/
> 
> That's IMO pure nonsense.

It is not nonsense - it has a meaning - but as I say above, given the
requirements of CHARMOD in this area, I expect we will change to the
second example.  Those particular characters are escaped since they
were used in the syntax: 
  uriref ::= '<' absoluteURI '>'
  -- http://www.w3.org/TR/2001/WD-rdf-testcases-20010912/#uriref


> The reference to Python string literals should be removed, I don't care
> about Python string literals and they are of no relevance here.

They were explanatory and probably should be removed, but I had to
use the same references above to explain to you the reason for
choosing this string escaping method and some other choices.

> I don't see no need for the trailing '.' character required for each
> n-triple line.

This is compatibility with the existing N3 format
  http://www.w3.org/DesignIssues/Notation3 
which remains useful to retain.  It is possible we might want to take
on more syntax from that other format and hence be able to use its
tools.  We are unlikely to change this.

This N-Triples format is meant to be simple, complete format for
encoding RDF graphs, compatible with existing tools and is proving
very useful in our work.

Thanks for your feedback.

Dave
Received on Tuesday, 18 September 2001 05:39:05 UTC