IRI Whitespace? from Nathan on 2011-04-05 (public-rdf-wg@w3.org from April 2011)

From: Nathan <nathan@webr3.org>
Date: Tue, 05 Apr 2011 01:47:59 +0100
To: RDF WG <public-rdf-wg@w3.org>
CC: RDFA Working Group <public-rdfa-wg@w3.org>
Message-ID: <4D9A66BF.3070405@webr3.org>

Hi All,

I've always heard of an "IRIs can contain whitespace" issue. So I 
thought I'd take a closer look.

 From what I can tell, IRI extends the the class of unreserved 
charectors by adding the characters of the UCS beyond U+007F.

Here's a chart of all the white space chars defined in unicode, and 
whether they need to be percent encoded, or whether they can be included 
as is:

                    ----------------------------------------
                   |  U+0009 \t
                   |  U+000A \n
                   |  U+000B \v
     % encoded --> |  U+000C \f
                   |  U+000D \r
                   |  U+0020 SPACE
                   |  U+0085 NEL (NEXT LINE)
                    ----------------------------------------
                   |  U+00A0 NBSP (NO-BREAK SPACE)
                   |  U+1680 OGHAM SPACE MARK
                   |  U+180E MONGOLIAN VOWEL SEPARATOR
                   |  U+2000 EN QUAD
                   |  U+2001 EM QUAD
                   |  U+2002 EN SPACE
allowed in IRI -->|  U+2003 EM SPACE
                   |  U+2004 THREE-PER-EM SPACE
                   |  U+2005 FOUR-PER-EM SPACE
                   |  U+2006 SIX-PER-EM SPACE
                   |  U+2007 FIGURE SPACE
                   |  U+2008 PUNCTUATION SPACE
                   |  U+2009 THIN SPACE
                   |  U+200A HAIR SPACE
                   |  U+2028 LINE SEPARATOR
                   |  U+2029 PARAGRAPH SEPARATOR
                   |  U+202F NARROW NO-BREAK SPACE
                   |  U+205F MEDIUM MATHEMATICAL SPACE
                   |  U+3000 IDEOGRAPHIC SPACE
                    ----------------------------------------

In Turtle, SPARQL, RDFa 1.1 Core (and XML 5th edition) whitespace is 
defined as:

   U+0009 U+000A U+000D U+0020

So where's the collission / issue? I'm a little confused now.

Best,

Nathan

Received on Tuesday, 5 April 2011 00:48:49 UTC