Re: [XHR2] overrideMimeType from Maciej Stachowiak on 2007-07-29 (public-webapi@w3.org from July 2007)

From: Maciej Stachowiak <mjs@apple.com>
Date: Sun, 29 Jul 2007 08:26:11 -0700
To: Jonas Sicking <jonas@sicking.cc>
Cc: Web APIs WG <public-webapi@w3.org>
Message-Id: <1FA95221-9B59-4A40-9D03-E81BBC4F0A37@apple.com>

On Jul 28, 2007, at 11:38 PM, Jonas Sicking wrote:

> Maciej Stachowiak wrote:
>> On Jul 27, 2007, at 12:09 PM, Jonas Sicking wrote:
>>>
>>> Anne van Kesteren wrote:
>>>> I've been looking at overrideMimeType implementations in Gecko  
>>>> and WebKit and it seems like they differ a bit. In Gecko it has  
>>>> to be invoked before send(), but in WebKit it would work if you  
>>>> invoke it just before getting responseXML or responseText.  
>>>> Neither implementation seems to do any input checks.
>>>> If you have any opinion on how it should be specified I suppose  
>>>> now would be the time to air your thoughts.
>>>
>>> Of course I prefer the mozilla way :)
>>>
>>> It does seem fairly complicated to allow it to be set after the  
>>> download is finished though. You do have the stream stored  
>>> in .reponseBody, but at that point all encoding information has  
>>> been lost. For HTML parsing (which I hope the spec will support in  
>>> the future) there are a pile of rules used to guess the encoding,  
>>> all of which would be useful to use, but can't be used if all you  
>>> have access to is the unencoded responseBody.
>> Why would the encoding information be lost? The only sources of  
>> encoding info are the responseText itself and http headers, both of  
>> which the XMLHttpResponse needs to provide anyway.
>
> ResponseText is not the raw byte stream gotten off the wire, it is  
> already decoded into utf16 using whatever algorithm we define for  
> determining the encoding. HTML decoding is a lot more complicated  
> since you have to first guess an encoding, then start to parse the  
> document, but if you find a
>
> <meta http-equiv="Content-Type" content="text/html; charset=?">
>
> Where charset is different from what you guessed, you have to  
> restart from the beginning using the charset defined in the meta tag.
>
> Yes, it would definitely be possible for the implementation to keep  
> around the raw byte stream and either lazily decode responseText, or  
> keep both the utf16 responseText and the raw byte stream around.

A third possibility is to remember what encoding you used when  
decoding and turn the UTF-16 back into the original bytes, though I  
suppose that wouldn't work if you hit encoding errors originally.

> It is a bit quirky behavior though since setting overrideMimeType  
> could then change the encoding and therefor both responseXML and  
> responseText.

If XHR2 offers responseBody with a raw byte array of some kind, it  
will be required for implementations to keep the raw bytes around  
anyway.

Regards,
Maciej

Received on Sunday, 29 July 2007 15:26:20 UTC