- From: Eric Prud'hommeaux <eric@w3.org>
- Date: Fri, 21 Apr 2006 12:51:57 -0400
- To: Dirk-Willem van Gulik <dirkx@webweaving.org>
- Cc: public-rdf-dawg@w3.org
- Message-ID: <20060421165157.GI26709@w3.org>
On Thu, Mar 09, 2006 at 01:43:49AM -0800, Dirk-Willem van Gulik wrote: > > > .. always UTF8 ... > > > Unicode code points may also be expressed using an \uXXXX (U+0 to > > U+FFFF) or \UXXXXXXXX syntax (for U+10000 onwards) where X is a > > hexadecimal digit [0-9A-F] > > I assume that what is ment here is the use of 7bit safe chars to express > unicode code points. This begs the question: > > -> can this be mixed with true utf8 in the same payload. > > -> my advise would be NOT to allow this; think cross > site scripting for an example of the pain you may get > into at some point in the future. I think it is safer to allow than not to allow. Not allowing it would mean striking "always UTF8" and including text to say that if one happened to have encoded one's entire query in 7 bits, that it may no longer be in the same payload as other unicode (including ASCII?). There currently is no specification for 7 bit SPARQL. One could escape it to the point that there are no wide characters over the wire, but it is still utf-8. If you then send it over a 7 bit wire, we don't specify how to do so. I'm afraid I don't see the mechanics of how this enables cross-site scripting (any more than any other wide char format). Was this a hunch, or a worked out screw case. > -> Is there 'escaping' for the \u and \U sequence itself ? > > And if there is - can this be mixed in utf8 ? And if not > - how does one know for a fact what mode one is ? I believe the current text impies one level of escaping. I believe that, given an application/sparql-query media type [[ ASK { ?s ?p "\u005Cu0041" } ]] the interpretation of the object is '\\'+'u'+'0'+'0'+'4'+'1'. The query [[ ASK { ?s ?p "\u0041" } ]] is valid application/sparql-query , but is not the same query, and the no intermediate processor should treat it as such. > Or on other words: > > -> If you really want this - better define it narrower I think we really do want escaping. It will make life better for a lot of folks who can't edit chinese and russian and ... directly in their editor. Do you have some specific text in mind? In general, I think escaping requires implementation precision and that there's no way to avoid it. One could rule \\ out of quoting, but it would not keep one from constructing things that looked like non-equivilent application/sparql-query . For instance, ASK { ?query foo:serialization "ASK { ?s ?p \"\u005Cu0041" }\" }" } or ASK { ?query foo:serialization "ASK { ?s ?p \u0022\u005Cu0041" }\u0022 }" } > OR > > -> Drop it altogether. > > As to give strict parsers in hostile environments a chance. Did ietf-types ever see this? I'm about to ask for the media type (through other channels) and want to make sure they don't know of some dissent that I have noticed. Do you think this is well-enough specified that we can ask IETF for our media-type? -- -eric office: +81.466.49.1170 W3C, Keio Research Institute at SFC, Shonan Fujisawa Campus, Keio University, 5322 Endo, Fujisawa, Kanagawa 252-8520 JAPAN +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA cell: +81.90.6533.3882 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution.
Received on Friday, 21 April 2006 16:52:07 UTC