N-Triples V1.6 Character Encoding and Escaping from Dave Beckett on 2001-07-25 (w3c-rdfcore-wg@w3.org from July 2001)

From: Dave Beckett <dave.beckett@bristol.ac.uk>
Date: Wed, 25 Jul 2001 12:22:57 +0100
To: w3c-rdfcore-wg@w3.org
Message-ID: <12227.996060177@tatooine.ilrt.bris.ac.uk>

N-Triples V1.6
http://www.w3.org/2001/sw/RDFCore/ntriples/


While completing the references I came across

  Character Model for the World Wide Web, W3C Working Draft, 26 January 2001
  http://www.w3.org/TR/charmod/

which is [[a common reference for interoperable text manipulation on
the World Wide Web. ]] and heading into W3C Recommendation.

The relevant sections are:

  Reference Processing Model
  http://www.w3.org/TR/charmod/#sec-RefProcModel

and if we want to match that model, we cannot define N-Triples in
terms of US-ASCII.  It recommends picking one character encoding and 
[[ If compatibility with ASCII is desired, UTF-8 (see [RFC 2279]) is
RECOMMENDED ]]

So I propose we use UTF-8.

(Aside: Python which is where this started, remains a 7-bit ASCII
format with 8-bit for characters and does not define a standard
encoding for 8-bit characters.)


In section
  Character Escaping
  http://www.w3.org/TR/charmod/#sec-Escaping

it says:

  [[Certain guidelines apply to the way specifications define
  character escapes. These guidelines MUST be followed when designing
  new W3C protocols and formats and SHOULD be followed as much as
  possible when revising existing protocols and formats.

    * Specifications MUST NOT invent a new escaping mechanism if an
      appropriate one already exists.
  ]]

Ok, we are designing a new format and n3 uses \ after Python.

  [[* There SHOULD be only one way to escape a character. [A
    well-known counter-example is that for historical reasons, both
    HTML and XML have redundant decimal (&#ddddd;) and hexadecimal
    (&#xhhhh;) escapes.]
  ]]

One way to escape a character.  Check.

   [[* Explicit end delimiters MUST be provided. Escapes such as
   \uABCD where the end delimiter is a space or any character other
   than [01-9A-F] SHOULD be avoided: it is not clear visually, and it
   can cause an editor to insert spurious line-breaks when
   word-wrapping on spaces. Forms like SPREAD's &UABCD; [SPREAD] or
   XML's &#xhhhh;, where the escape is explicitly terminated by a
   semicolon, are much better. Escaped characters SHOULD be
   acceptable wherever unescaped characters are. In particular, they
   SHOULD be acceptable in identifiers and comments.
   ]]

Oh dear; the python style things \uABCD are mentioned as should be
avoided.  This is only a recommendation though.

So I propose we provide one way to escape:
   '\u' [A-Fa-f0-9]{1,8} ';'
which generates the appropriate Unicode code point from 1-8 hex digits.

I further propose to modify the EBNF to allow whatever we choose in
URIs (http://www.w3.org/2001/sw/RDFCore/ntriples/#absoluteURI) after
the comments in section
  Character Encoding in URI References
  http://www.w3.org/TR/charmod/#sec-URIs

I think we can leave _:name
(http://www.w3.org/2001/sw/RDFCore/ntriples/#name) alone for now.


I have updated the document to record these proposals.

Dave

Received on Wednesday, 25 July 2001 07:22:58 UTC