- From: <jos.deroo.jd@belgium.agfa.com>
- Date: Wed, 25 Jul 2001 18:01:32 +0100
- To: dave.beckett@bristol.ac.uk
- Cc: w3c-rdfcore-wg@w3.org
Hi Dave, Just an idea, but can't we just simply specify that characters outside the US-ASCII repertoire are always encoded using UTF-8 and %HH-escaping? That would do http://www.w3.org/TR/charmod/#sec-URIs and simplify overall implementation. Anyhow, http://www.w3.org/2001/sw/RDFCore/ntriples/ is nice! -- Jos De Roo, AGFA http://www.agfa.com/w3c/jdroo/ Dave Beckett <dave.beckett@bristol.ac.uk>@w3.org on 2001-07-25 12:22:57 PM Sent by: w3c-rdfcore-wg-request@w3.org To: w3c-rdfcore-wg@w3.org cc: Subject: N-Triples V1.6 Character Encoding and Escaping N-Triples V1.6 http://www.w3.org/2001/sw/RDFCore/ntriples/ While completing the references I came across Character Model for the World Wide Web, W3C Working Draft, 26 January 2001 http://www.w3.org/TR/charmod/ which is [[a common reference for interoperable text manipulation on the World Wide Web. ]] and heading into W3C Recommendation. The relevant sections are: Reference Processing Model http://www.w3.org/TR/charmod/#sec-RefProcModel and if we want to match that model, we cannot define N-Triples in terms of US-ASCII. It recommends picking one character encoding and [[ If compatibility with ASCII is desired, UTF-8 (see [RFC 2279]) is RECOMMENDED ]] So I propose we use UTF-8. (Aside: Python which is where this started, remains a 7-bit ASCII format with 8-bit for characters and does not define a standard encoding for 8-bit characters.) In section Character Escaping http://www.w3.org/TR/charmod/#sec-Escaping it says: [[Certain guidelines apply to the way specifications define character escapes. These guidelines MUST be followed when designing new W3C protocols and formats and SHOULD be followed as much as possible when revising existing protocols and formats. * Specifications MUST NOT invent a new escaping mechanism if an appropriate one already exists. ]] Ok, we are designing a new format and n3 uses \ after Python. [[* There SHOULD be only one way to escape a character. [A well-known counter-example is that for historical reasons, both HTML and XML have redundant decimal (&#ddddd;) and hexadecimal (&#xhhhh;) escapes.] ]] One way to escape a character. Check. [[* Explicit end delimiters MUST be provided. Escapes such as \uABCD where the end delimiter is a space or any character other than [01-9A-F] SHOULD be avoided: it is not clear visually, and it can cause an editor to insert spurious line-breaks when word-wrapping on spaces. Forms like SPREAD's &UABCD; [SPREAD] or XML's &#xhhhh;, where the escape is explicitly terminated by a semicolon, are much better. Escaped characters SHOULD be acceptable wherever unescaped characters are. In particular, they SHOULD be acceptable in identifiers and comments. ]] Oh dear; the python style things \uABCD are mentioned as should be avoided. This is only a recommendation though. So I propose we provide one way to escape: '\u' [A-Fa-f0-9]{1,8} ';' which generates the appropriate Unicode code point from 1-8 hex digits. I further propose to modify the EBNF to allow whatever we choose in URIs (http://www.w3.org/2001/sw/RDFCore/ntriples/#absoluteURI) after the comments in section Character Encoding in URI References http://www.w3.org/TR/charmod/#sec-URIs I think we can leave _:name (http://www.w3.org/2001/sw/RDFCore/ntriples/#name) alone for now. I have updated the document to record these proposals. Dave
Received on Wednesday, 25 July 2001 12:02:15 UTC