Re: SPARQL Protocol and Unicode characters

Clarification and notes -- this response was not considered by the DAWG:

On Thu, Feb 03, 2005 at 04:10:58PM +0100, Arjohn Kampman wrote:
> Dear all,
> The SPARQL Protocol as described at [1] suggests that SPARQL queries are 
> going to be sent over the line as simple www-urlencoded strings. I would
> like to point out that we have tried this approach in Sesame and that it
> fails to handle multi-byte characters properly [2]. Main reason for this
> is that the used %xx patterns cannot encode any byte values larger than
> 255.
> In Sesame, we "solved" this issue by switching to multipart/form-data
> encoded POST requests.

I presume you are using the charset parameter
[[ [2388]
   Each part of a multipart/form-data is supposed to have a content-
   type.  In the case where a field element is text, the charset
   parameter for the text indicates the character encoding used.
and that the clients tend to encoding the characters in charsets that
the servers tend to understand.

I phrase it this way because I'm looking at the trade-offs between:
  - transaction-specified encoding.
  - transaction-specified encoding with manditory support for at
    least one common encoding.
  - fixed-encoding (eg. utf-8), the only one used by the protocol.
What encodings do you RDQL servers support?

noting related RFCs ('cause I need to write it down somewhere):

[2045] MIME Part One: Format of Internet Message Bodies:
  transfer encodings interacting with character encodings.
[2046] MIME Part Two: Media Types
  4.1.2.  Charset Parameter
  5.1.  Multipart Media Type
[2388] Returning Values from Forms: multipart/form-data
  4.5 Charset of text in form data

>                        Main drawback of this solution is that we use
> POST-requests all the time, even when GET-requests would be more
> natural.

The DAWG's Use Cases and Requirements [UC&R] has Addressable Query
Results as a design objective. This was motivated by a TAG finding [GET].
"Use GET if: 
      * The interaction is more like a question (i.e., it is a safe
        operation such as a query, read operation, or lookup)."

>          Another option would be to enforce an UTF-8 characters-to-
> octets mapping to the query before adding it as a parameter value.

We could also include the charset in the GET, but I'm hoping that the
simplest approach (which I take to be fixed-encoding) will suffice.

> Hope you can use this feedback to improve the protocol.
> Regards,
> Arjohn Kampman
> [1]
> [2]

office: +81.466.49.1170 W3C, Keio Research Institute at SFC,
                        Shonan Fujisawa Campus, Keio University,
                        5322 Endo, Fujisawa, Kanagawa 252-8520
        +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
cell:   +81.90.6533.3882

Feel free to forward this message to any list for any purpose other than
email address distribution.

Received on Monday, 14 March 2005 03:18:07 UTC