Re: SPARQL Protocol and Unicode characters

Clarification and notes -- this response was not considered by the DAWG:

On Thu, Feb 03, 2005 at 04:10:58PM +0100, Arjohn Kampman wrote:
> 
> Dear all,
> 
> The SPARQL Protocol as described at [1] suggests that SPARQL queries are 
> going to be sent over the line as simple www-urlencoded strings. I would
> like to point out that we have tried this approach in Sesame and that it
> fails to handle multi-byte characters properly [2]. Main reason for this
> is that the used %xx patterns cannot encode any byte values larger than
> 255.
> 
> In Sesame, we "solved" this issue by switching to multipart/form-data
> encoded POST requests.

I presume you are using the charset parameter
[[ [2388]
   Each part of a multipart/form-data is supposed to have a content-
   type.  In the case where a field element is text, the charset
   parameter for the text indicates the character encoding used.
]]
and that the clients tend to encoding the characters in charsets that
the servers tend to understand.

I phrase it this way because I'm looking at the trade-offs between:
  - transaction-specified encoding.
  - transaction-specified encoding with manditory support for at
    least one common encoding.
  - fixed-encoding (eg. utf-8), the only one used by the protocol.
What encodings do you RDQL servers support?

noting related RFCs ('cause I need to write it down somewhere):

[2045] MIME Part One: Format of Internet Message Bodies:
  transfer encodings interacting with character encodings.
[2046] MIME Part Two: Media Types
  4.1.2.  Charset Parameter
  5.1.  Multipart Media Type
[2388] Returning Values from Forms: multipart/form-data
  4.5 Charset of text in form data

>                        Main drawback of this solution is that we use
> POST-requests all the time, even when GET-requests would be more
> natural.

The DAWG's Use Cases and Requirements [UC&R] has Addressable Query
Results as a design objective. This was motivated by a TAG finding [GET].
[[
"Use GET if: 
      * The interaction is more like a question (i.e., it is a safe
        operation such as a query, read operation, or lookup)."
]]

>          Another option would be to enforce an UTF-8 characters-to-
> octets mapping to the query before adding it as a parameter value.

We could also include the charset in the GET, but I'm hoping that the
simplest approach (which I take to be fixed-encoding) will suffice.

> Hope you can use this feedback to improve the protocol.
> 
> Regards,
> 
> Arjohn Kampman
> 
> 
> [1] http://www.w3.org/TR/rdf-sparql-protocol/
> [2] http://www.openrdf.org/issues/secure/ViewIssue.jspa?key=SES-84
[2045] http://www.faqs.org/rfcs/rfc2045.html
[2046] http://www.faqs.org/rfcs/rfc2046.html
[2388] http://www.faqs.org/rfcs/rfc2388.html
[UC&R] http://www.w3.org/TR/2004/WD-rdf-dawg-uc-20041012/
[GET] http://www.w3.org/2001/tag/doc/whenToUseGet.html
-- 
-eric

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
                        JAPAN
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Monday, 14 March 2005 03:18:07 UTC