Re: N-triples (1.4) from Graham Klyne on 2001-07-18 (w3c-rdfcore-wg@w3.org from July 2001)

From: Graham Klyne <GK@NineByNine.org>
Date: Wed, 18 Jul 2001 21:49:06 +0100
To: Dave Beckett <dave.beckett@bristol.ac.uk>
Cc: RDF core WG <w3c-rdfcore-wg@w3.org>
Message-Id: <5.1.0.14.2.20010718210911.044b3840@joy.songbird.com>
Hi Dave,

At 08:26 PM 7/18/01 +0100, Dave Beckett wrote:
> > 3. Outstanding issue 1
> >
> > N-Triples is a text/plain MIME type format - consider character and
> > encoding issues with requirement to be able to express all Unicode chars.
> >
> > That much is easy, I think.  Use:
> >    Content-type: text/plain;charset=utf-8
>
>That out-of-band information cannot be picked up by the parser just
>reading the bytes.  I would prefer the format to be self-contained if
>possible, not depending on charset, so all unicode chars can be
>handled inside US-ASCII.  Asking N-Triples parsers to have to add an
>entire UTF-8 decoding step seems rather an large step, when \u
>etc. below could do the work when required.

OK, I accept the "out of band" point.

Assuming that the only place where non-ASCII characters can appear is in 
string literals, that might work.  Then I think we'd also need to be 
careful about requiring non-USASCII characters to always be escaped in 
string literals so that the higher Unicode code points don't appear 
anywhere in the N-triples source code.

But then there's the internationalized URIs and domain name work that's 
waiting in the wings, so I don't suppose that approach would last for ever.

If you want to stick with just US-ASCII in an N-triples file then I won't 
fight it, but my own feeling is that it would be easier to just 
say:  always use UTF-8 encoding.  That seems fairly future-proof.

> > 4. Outstanding issue 2
> >
> > Consider adding \#xHEX escaping to allow N-Triples to encode Unicode
> > characters in text/plain.  Or after Python: \uxxxx and \Uxxxxxxxx
> >
> > I suppose that, for completeness, we must.  Escaping is a topic that 
> can be
> > far trickier than it seems it should be for such a "simple"
> > purpose.  Following your Python reference, I see they've tightened up the
> > Python spec.  I think these are probably the right choices for us:
> >    \uxxxx      Character with 16-bit hex value xxxx (Unicode only)
> >    \Uxxxxxxxx  Character with 32-bit hex value xxxxxxxx (Unicode only)
>
>I would add notes such that there would only be 1 way to encode any
>character, so even escaped literals could be compared as
>strings. i.e.
>   \uxxxx for (mutter some chars below #0020) & chars #007f-#ffff
>   \UXXXXXXXX for chars #00010000 to #ffffffff
> >
> > One might also consider allowing:
> >    \xhh        ASCII character with hex value hh
> > for 'hh' in the range '00' to '7F' (i.e. where Unicode code points match
> > US-ASCII).
>
>Is this one escape too many?  If we do add it, I would prevent \u
>from this handling the range this covers.

In view of what you say above, I agree.  Forget I suggested that!

>How about just one escape \UXXXXXXXX for all chars not made available
>by \-escapes or used in-situ - that seems more appealing for this
>little syntax.

Well, that could work too.

> > 5. eoln format
> >
> > Being a MIME text/plain format, the cr in eoln should not be optional:
> >    eoln ::= cr? lf
> > should read:
> >    eoln ::= cr lf
>
>As I commented previously, that change would make all our existing
>examples illegal and this seems awkward to impose on many people
>writing these files.  I would agree if this was a protocol such as
>HTTP, SMTP or telnet where it was never (hardly ever) typed by
>people.
>
>Most web servers will autodetect and deliver ntriples files as
>text/plain whatever line ending we use, for example your RDF Core WG
>minutes of 2001-06-15:
> 
>http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2001Jun/att-0471/01-Minutes-20010615.txt
>is served as text/plain but uses only \n as newlines which strictly
>isn't allowed.

Well, you are right.  I hadn't noticed that.  The original that I sent did 
NOT use bare LFs, so it's getting corrupted in the cycle from mailing list 
to archive and back.

The MIME spec (RFC 2046, section 4.1.1) is very clear that line breaks in a 
text/* type MUST be represented by CR LF, and that bare CR or LF are not 
allowed.

But now I see what is happening... the HTTP designers chose to override 
that aspect of MIME (RFC 2616, section 19.4.2) so now we have two different 
flavours of MIME floating around.  The email flavour requires CRLF, but the 
HTTP flavour allows CRLF, bare CR or bare LF.  Hmph!  (You may notice I 
come from an email orientation.)

I suppose, then, we must go back to allowing CRLF, LF or CR as a line 
break, to be compatible with anything that can be served via HTTP.

>So what to do?
>Invent a suggested ".nt" suffix and MIME type it text/x-ntriples?

That doesn't fix the problem -- it's still subject to the rules of text/* 
content types.  You'd have to define something like 
application/vnd.w3c.ntriples to achieve that effect.

>I don't see that as necessary - lets not worry about it.

I agree.  Because a text file can get munged from whatever form it starts 
out in if its passed about using HTTP and email, I don't think the MIME 
type particularly helps us.  Oh well.

#g
--


------------
Graham Klyne
(GK@ACM.ORG)
Received on Wednesday, 18 July 2001 17:00:21 UTC