Re: [FileAPI] Updates to FileAPI Editor's Draft from Arun Ranganathan on 2011-07-07 (public-webapps@w3.org from July to September 2011)

From: Arun Ranganathan <arun@mozilla.com>
Date: Wed, 06 Jul 2011 21:44:32 -0700
To: "Gregg Tavares (wrk)" <gman@google.com>
CC: Web Applications Working Group WG <public-webapps@w3.org>
Message-ID: <4E1539B0.1080500@mozilla.com>
On 6/30/11 6:01 PM, Gregg Tavares (wrk) wrote:
>
>
> On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <arun@mozilla.com 
> <mailto:arun@mozilla.com>> wrote:
>
>>     Sorry if these have all been discussed before. I just read the
>>     File API for the first time and 2 random questions popped in my
>>     head.
>>
>>     1) If I'm using readAsText with a particular encoding and the
>>     data in the file is not actually in that encoding such that code
>>     points in the file can not be mapped to valid code points what
>>     happens? Is that implementation specific or is it specified? I
>>     can imagine at least 3 different behaviors.
>
>     This should be specified better and isn't.  I'm inclined to then
>     return the file in the encoding it is in rather than force an
>     encoding (in other words, ignore the encoding parameter if it is
>     determined that code points can't be mapped to valid code points
>     in the encoding... also note that we say to "Replace bytes or
>     sequences of bytes that are not valid according to thecharsetwith
>     a single U+FFFD character [Unicode
>     <http://dev.w3.org/2006/webapi/FileAPI/#Unicode>]").  Right now,
>     the spec isn't specific to this scenario ("... if the user agent
>     cannot decode blob using encoding, then let charset be null"
>     before the algorithmic steps, which essentially forces UTF-8).
>
>     Can we list your three behaviors here, just so we get them on
>     record?  Which behavior do you think is ideal?  More importantly,
>     is substituting U+FFFD and "defaulting" to UTF-8 good enough for
>     your scenario above?
>
>
> The 3 off the top of my head were
>
> 1) Throw an exception. (content not valid for encoding)
> 2) Remap bad codes to some other value (sounds like that's the one above)
> 3) Remove the bad character
>
> I see you've listed a 4th, "Ignore the encoding on error, assume 
> utf-8". That one seems problematic because of partial reads. If you 
> are decoding as shift-jis, have returned a partial read, and then 
> later hit a bad code point, the stuff you've seen previously will all 
> need to change by switching to no encoding.
>
> I'd chose #2 which it sounds like is already the case according the spec.

This is the case in the spec. currently, but:
>
> Regardless of what solution is chosen is there a way for me to know 
> something was lost?
>

I don't think so, actually. And I'm not entirely sure how we can allow 
for such a way, unless we throw an error or something.
>
>>
>>     2) If I'm reading using readAsText a multibyte encoding (utf-8,
>>     shift-jis, etc..) is it implementation dependent whether or not
>>     it can return partial characters when returning partial results
>>     during reading? In other words,  Let's say the next character in
>>     a file is a 3 byte code point but the reader has only read 2 of
>>     those 3 bytes so far. Is implementation dependent whether result
>>     includes those 2 bytes before reading the 3rd byte or not?
>>
>
>     Yes, partial results are currently implementation dependent; the
>     spec. only says they SHOULD be returned.  There was reluctance to
>     have MUST condition on partial file reads.  I'm open to revisiting
>     this decision if the justification is a really good one.
>
>
> I'm assuming by "MUST condition" you mean a UA doesn't have to support 
> partial reads at all, not that how partial reads work shouldn't be 
> specified.
>
> Here's an example.
>
> Assume we stick with unknown characters get mapped to U+FFFD.
> Assume my stream is utf8 and in hex the bytes are.
>
> E3 83 91 E3 83 91
>
> That's 2 code points of 0x30D1. Now assume the reader has only read 
> the first 5 bytes.
>
> Should the partial results be
>
> (a) filereader.result.length == 1 where the content is 0x30D1
>
>  or should the partial result be
>
> (b) filereader.result.length == 2 where the content is 0x30D1, 0xFFFD 
>  because at that point the E3 83 at the end of the partial result is 
> not a valid codepoint
>
> I think the spec should specify that if the UA supports partial reads 
> it should follow example (a)

OK.  I think the spec. needs more bolstering here.  Thanks for your 
example.  This makes it clearer.

-- A*
Received on Thursday, 7 July 2011 04:45:09 UTC