- From: Arun Ranganathan <arun@mozilla.com>
- Date: Wed, 06 Jul 2011 21:44:32 -0700
- To: "Gregg Tavares (wrk)" <gman@google.com>
- CC: Web Applications Working Group WG <public-webapps@w3.org>
- Message-ID: <4E1539B0.1080500@mozilla.com>
On 6/30/11 6:01 PM, Gregg Tavares (wrk) wrote:
>
>
> On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <arun@mozilla.com
> <mailto:arun@mozilla.com>> wrote:
>
>> Sorry if these have all been discussed before. I just read the
>> File API for the first time and 2 random questions popped in my
>> head.
>>
>> 1) If I'm using readAsText with a particular encoding and the
>> data in the file is not actually in that encoding such that code
>> points in the file can not be mapped to valid code points what
>> happens? Is that implementation specific or is it specified? I
>> can imagine at least 3 different behaviors.
>
> This should be specified better and isn't. I'm inclined to then
> return the file in the encoding it is in rather than force an
> encoding (in other words, ignore the encoding parameter if it is
> determined that code points can't be mapped to valid code points
> in the encoding... also note that we say to "Replace bytes or
> sequences of bytes that are not valid according to thecharsetwith
> a single U+FFFD character [Unicode
> <http://dev.w3.org/2006/webapi/FileAPI/#Unicode>]"). Right now,
> the spec isn't specific to this scenario ("... if the user agent
> cannot decode blob using encoding, then let charset be null"
> before the algorithmic steps, which essentially forces UTF-8).
>
> Can we list your three behaviors here, just so we get them on
> record? Which behavior do you think is ideal? More importantly,
> is substituting U+FFFD and "defaulting" to UTF-8 good enough for
> your scenario above?
>
>
> The 3 off the top of my head were
>
> 1) Throw an exception. (content not valid for encoding)
> 2) Remap bad codes to some other value (sounds like that's the one above)
> 3) Remove the bad character
>
> I see you've listed a 4th, "Ignore the encoding on error, assume
> utf-8". That one seems problematic because of partial reads. If you
> are decoding as shift-jis, have returned a partial read, and then
> later hit a bad code point, the stuff you've seen previously will all
> need to change by switching to no encoding.
>
> I'd chose #2 which it sounds like is already the case according the spec.
This is the case in the spec. currently, but:
>
> Regardless of what solution is chosen is there a way for me to know
> something was lost?
>
I don't think so, actually. And I'm not entirely sure how we can allow
for such a way, unless we throw an error or something.
>
>>
>> 2) If I'm reading using readAsText a multibyte encoding (utf-8,
>> shift-jis, etc..) is it implementation dependent whether or not
>> it can return partial characters when returning partial results
>> during reading? In other words, Let's say the next character in
>> a file is a 3 byte code point but the reader has only read 2 of
>> those 3 bytes so far. Is implementation dependent whether result
>> includes those 2 bytes before reading the 3rd byte or not?
>>
>
> Yes, partial results are currently implementation dependent; the
> spec. only says they SHOULD be returned. There was reluctance to
> have MUST condition on partial file reads. I'm open to revisiting
> this decision if the justification is a really good one.
>
>
> I'm assuming by "MUST condition" you mean a UA doesn't have to support
> partial reads at all, not that how partial reads work shouldn't be
> specified.
>
> Here's an example.
>
> Assume we stick with unknown characters get mapped to U+FFFD.
> Assume my stream is utf8 and in hex the bytes are.
>
> E3 83 91 E3 83 91
>
> That's 2 code points of 0x30D1. Now assume the reader has only read
> the first 5 bytes.
>
> Should the partial results be
>
> (a) filereader.result.length == 1 where the content is 0x30D1
>
> or should the partial result be
>
> (b) filereader.result.length == 2 where the content is 0x30D1, 0xFFFD
> because at that point the E3 83 at the end of the partial result is
> not a valid codepoint
>
> I think the spec should specify that if the UA supports partial reads
> it should follow example (a)
OK. I think the spec. needs more bolstering here. Thanks for your
example. This makes it clearer.
-- A*
Received on Thursday, 7 July 2011 04:45:09 UTC