RE: 2 RDFa SPARQL Test Harness Issues from Seaborne, Andy on 2008-05-18 (public-rdf-in-xhtml-tf@w3.org from May 2008)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Sun, 18 May 2008 17:18:16 +0000
To: Manu Sporny <msporny@digitalbazaar.com>
CC: RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>, Benjamin Nowack <bnowack@semsol.com>, Dave Beckett <dave@dajobe.org>
Message-ID: <38CBA1F6A350B044AF785E63AAC3C6776497E26019@G5W0276.americas.hpqcorp.net>

> -----Original Message-----
> From: Manu Sporny [mailto:msporny@digitalbazaar.com]
> Sent: 18 May 2008 17:11
> To: Seaborne, Andy
> Cc: RDFa mailing list; Benjamin Nowack; Dave Beckett
> Subject: Re: 2 RDFa SPARQL Test Harness Issues
>
> Seaborne, Andy wrote:
> >> We currently have two test cases that use UTF-8 characters (TC#60 and
> >> TC#108). The SPARQL.org and ARC SPARQL engines both die processing
> >> queries containing multi-byte UTF-8 characters:
> >>
> >
> > It starts with "\ufeffASK", i.e. a BOM.
> > ...
> > Remove the BOM and the bomb will not go off.
>
> *sigh* - Thanks Andy - turns out that both SPARQL queries in the RDFa
> Test Suite start off with that BOM... which is why we were seeing those
> Test Cases react in a similar manner.
>
> We could remove it - but it's valid[1][2] UTF-8, isn't it? Technically,
> we should be able to feed that to SPARQL and the engine should deal with
> it, right?

I am not an expert on Unicode - but not by my reading of the Unicode - it's in the middle of the URL string.  Hence, just placing the contents of the file, %-ified with BOM, into the query is not right here.

http://unicode.org/faq/utf_bom.html#28

"""
Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format ...
"""

Even treated (specially) as a zero width non-breaking space as mentioned in the FAQ does not work because a zero width non-breaking space is not whitespace (space, tab, newline, linefeed) as in separates tokens in SPARQL or is ignored as usual.

So, the parser it looks much like: "xASK ..." for some character x and xASK is not legal at this point.

        Andy

>
> -- manu
>
> [1] http://unicode.org/faq/utf_bom.html#29

> [2] http://www.rfc-editor.org/rfc/rfc3629.txt

>
> --
> Manu Sporny
> President/CEO - Digital Bazaar, Inc.
> blog: DB Launches Medical Record Sales Service with Shepherd Medical
> http://blog.digitalbazaar.com/2008/02/24/health2trade/

Received on Sunday, 18 May 2008 17:19:04 UTC