- From: Dave Beckett <dave.beckett@bristol.ac.uk>
- Date: Wed, 25 Jul 2001 12:22:57 +0100
- To: w3c-rdfcore-wg@w3.org
N-Triples V1.6
http://www.w3.org/2001/sw/RDFCore/ntriples/
While completing the references I came across
Character Model for the World Wide Web, W3C Working Draft, 26 January 2001
http://www.w3.org/TR/charmod/
which is [[a common reference for interoperable text manipulation on
the World Wide Web. ]] and heading into W3C Recommendation.
The relevant sections are:
Reference Processing Model
http://www.w3.org/TR/charmod/#sec-RefProcModel
and if we want to match that model, we cannot define N-Triples in
terms of US-ASCII. It recommends picking one character encoding and
[[ If compatibility with ASCII is desired, UTF-8 (see [RFC 2279]) is
RECOMMENDED ]]
So I propose we use UTF-8.
(Aside: Python which is where this started, remains a 7-bit ASCII
format with 8-bit for characters and does not define a standard
encoding for 8-bit characters.)
In section
Character Escaping
http://www.w3.org/TR/charmod/#sec-Escaping
it says:
[[Certain guidelines apply to the way specifications define
character escapes. These guidelines MUST be followed when designing
new W3C protocols and formats and SHOULD be followed as much as
possible when revising existing protocols and formats.
* Specifications MUST NOT invent a new escaping mechanism if an
appropriate one already exists.
]]
Ok, we are designing a new format and n3 uses \ after Python.
[[* There SHOULD be only one way to escape a character. [A
well-known counter-example is that for historical reasons, both
HTML and XML have redundant decimal (&#ddddd;) and hexadecimal
(&#xhhhh;) escapes.]
]]
One way to escape a character. Check.
[[* Explicit end delimiters MUST be provided. Escapes such as
\uABCD where the end delimiter is a space or any character other
than [01-9A-F] SHOULD be avoided: it is not clear visually, and it
can cause an editor to insert spurious line-breaks when
word-wrapping on spaces. Forms like SPREAD's &UABCD; [SPREAD] or
XML's &#xhhhh;, where the escape is explicitly terminated by a
semicolon, are much better. Escaped characters SHOULD be
acceptable wherever unescaped characters are. In particular, they
SHOULD be acceptable in identifiers and comments.
]]
Oh dear; the python style things \uABCD are mentioned as should be
avoided. This is only a recommendation though.
So I propose we provide one way to escape:
'\u' [A-Fa-f0-9]{1,8} ';'
which generates the appropriate Unicode code point from 1-8 hex digits.
I further propose to modify the EBNF to allow whatever we choose in
URIs (http://www.w3.org/2001/sw/RDFCore/ntriples/#absoluteURI) after
the comments in section
Character Encoding in URI References
http://www.w3.org/TR/charmod/#sec-URIs
I think we can leave _:name
(http://www.w3.org/2001/sw/RDFCore/ntriples/#name) alone for now.
I have updated the document to record these proposals.
Dave
Received on Wednesday, 25 July 2001 07:22:58 UTC