CSV diversion

* Andy Seaborne <andy.seaborne@epimorphics.com> [2013-05-01 13:05+0100]
> 
> 
> On 01/05/13 12:48, Eric Prud'hommeaux wrote:
> >* Andy Seaborne <andy.seaborne@epimorphics.com> [2013-05-01 10:19+0100]
> >>gedit complains about (but displays) the attachment.
> >>
> >>On 01/05/13 05:52, Eric Prud'hommeaux wrote:
> >>>I've noticed 6 vectors for creating literals with C0 codes
> >>>(including \0):
> >>>   old turtle
> >>>   APIs
> >>>   SPARQL CONSTRUCT
> >>>   SPARQL Update
> >>>   RDBs via Direct Mapping
> >>>   RDBs via R2RML
> >>>(RDB example reproducable with
> >>>   create table test(s text);
> >>>   insert into test (s) values ('a\0b');
> >>>   select s, length(s) from test;
> >>>   +------+-----------+
> >>>   | s    | length(s) |
> >>>   +------+-----------+
> >>>   |      |         1 |
> >>>   | a b  |         3 |
> >>>   +------+-----------+
> >>
> >>? where did the first row come from?
> >
> >MySQL's D-entailment. ˚͜˚
> >My first insert was '\0\, but i figured that 'a\0b' would be more
> >illustrative.
> >
> >
> >>>).
> >>>
> >>>These can't be serialized in RDF/XML. Nor can the results of a query
> >>>including this data be serialized in application/sparql-results, e.g.
> >>
> >>application/sparql-results+xml
> >
> >quite right -- tx for the correction.
> >
> >
> >>There is also
> >>
> >>application/sparql-results+json
> >>text/tab-separated-values
> 
> TSV says
> http://www.iana.org/assignments/media-types/text/tab-separated-values
> 
> """
> Required Parameters: Character Set, Encoding Type
> """
> 
> 
> I avoided CVS as it is not a true representation of the data but ...
> 
> >Does text/csv permit *anything* outside of
> >%x20-21 / %x23-2B / %x2D-7E / COMMA / CR / LF / 2DQUOTE ?
> >— http://tools.ietf.org/html/rfc4180#page-4
> 
> RFC 4180 says:
> """
> Common usage of CSV is US-ASCII, but other character sets defined
>       by IANA for the "text" tree may be used in conjunction with the
>       "charset" parameter.
> """
> so UTF-8 is possible.

Given that the grammar permits only a subset of ASCII, it seems that
any ASCII-compatible encoding (JIS, UTF-8) would only express the
ASCII subset. For non-ASCII-compatible encodings (UTF-16, EBCDIC),
there'd be a point to the charset parameter, but it still wouldn't
permit any characters outside ASCII.

Or maybe the interpretation is supposed to be "if you're using a non-
ASCII encoding, make up a new production for TEXTDATA." At any rate,
the path to character range compatibility isn't clear to me.


> >>JSON allows \u0000 - RFC 4627 refers to Unicode 4.0
> >>
> >>
> >>>   SELECT ?icon { ?who <p> ?icon FILTER (regex(?icon, "PNG")) }
> >>>They can, however, be queried in SPARQL:
> >>>   SELECT ?who { ?who <p> ?icon FILTER (regex(?icon, "PNG")) }
> >>>(Technically, useful functions like fn:regex are based on strings, but
> >>>I don't know of implementations which enforce this.)
> >>>
> >>>In theory, existing turtle files like the attached are rendered
> >>>illegal by the post-facto declaration that they are xs:strings.
> >>>In practice, people don't enforce this (noting that these tests
> >>>existed for a while in Turtle with no one failing or crying fowl.)
> >>>
> >>
> >

-- 
-ericP

Received on Wednesday, 1 May 2013 12:32:40 UTC