Re: [FileAPI] Updates to FileAPI Editor's Draft

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 7 Jul 2011 12:26:08 -0700
Message-ID: <CA+c2ei96mCkp4Ks1cAe2LdGXcG7i6OreaV+zxzggU6jyDkMpvQ@mail.gmail.com>
To: arun@mozilla.com
Cc: "Gregg Tavares (wrk)" <gman@google.com>, Web Applications Working Group WG <public-webapps@w3.org>
On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <arun@mozilla.com> wrote:
> Sorry if these have all been discussed before. I just read the File API for
> the first time and 2 random questions popped in my head.
> 1) If I'm using readAsText with a particular encoding and the data in the
> file is not actually in that encoding such that code points in the file can
> not be mapped to valid code points what happens? Is that implementation
> specific or is it specified? I can imagine at least 3 different behaviors.
> This should be specified better and isn't.  I'm inclined to then return the
> file in the encoding it is in rather than force an encoding (in other words,
> ignore the encoding parameter if it is determined that code points can't be
> mapped to valid code points in the encoding... also note that we say to
> "Replace bytes or sequences of bytes that are not valid according to
> the charset with a single U+FFFD character [Unicode]").  Right now, the spec
> isn't specific to this scenario ("... if the user agent cannot decode blob
> using encoding, then let charset be null" before the algorithmic steps,
> which essentially forces UTF-8).

I definitely don't think we should use some type of autodetecting of
charset if people explicitly define one. That is likely to create more
confusion and bugs than it'll solve problems.

I don't fully understand what's undefined if we say that any invalid
character should be replaced by U+FFFD? I.e. why isn't that enough?
I'm not at all doubting that it isn't enough, but I'd like to
understand how it's not enough in order to fix it.

> 2) If I'm reading using readAsText a multibyte encoding (utf-8, shift-jis,
> etc..) is it implementation dependent whether or not it can return partial
> characters when returning partial results during reading? In other words,
>  Let's say the next character in a file is a 3 byte code point but the
> reader has only read 2 of those 3 bytes so far. Is implementation dependent
> whether result includes those 2 bytes before reading the 3rd byte or not?
> Yes, partial results are currently implementation dependent; the spec. only
> says they SHOULD be returned.  There was reluctance to have MUST condition
> on partial file reads.  I'm open to revisiting this decision if the
> justification is a really good one.

I absolutely don't think we should return partial results. From the
page authors point of view .result should "stream" in. Once a
character has been appended to it, it should never change.

/ Jonas
