- From: Gregg Tavares (wrk) <gman@google.com>
- Date: Thu, 30 Jun 2011 18:01:41 -0700
- To: arun@mozilla.com
- Cc: Web Applications Working Group WG <public-webapps@w3.org>
- Message-ID: <CAKZ+BNq3CWjw-yODOZ4XzLMfueQu9qS1s5Ts70DjRVdd1=ZTPA@mail.gmail.com>
On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <arun@mozilla.com> wrote: > ** > > Sorry if these have all been discussed before. I just read the File API for > the first time and 2 random questions popped in my head. > > 1) If I'm using readAsText with a particular encoding and the data in the > file is not actually in that encoding such that code points in the file can > not be mapped to valid code points what happens? Is that implementation > specific or is it specified? I can imagine at least 3 different behaviors. > > > This should be specified better and isn't. I'm inclined to then return the > file in the encoding it is in rather than force an encoding (in other words, > ignore the encoding parameter if it is determined that code points can't be > mapped to valid code points in the encoding... also note that we say to "Replace > bytes or sequences of bytes that are not valid according to the charset with > a single U+FFFD character [Unicode<http://dev.w3.org/2006/webapi/FileAPI/#Unicode> > ]"). Right now, the spec isn't specific to this scenario ("... if the > user agent cannot decode blob using encoding, then let charset be null" > before the algorithmic steps, which essentially forces UTF-8). > > Can we list your three behaviors here, just so we get them on record? > Which behavior do you think is ideal? More importantly, is substituting > U+FFFD and "defaulting" to UTF-8 good enough for your scenario above? > The 3 off the top of my head were 1) Throw an exception. (content not valid for encoding) 2) Remap bad codes to some other value (sounds like that's the one above) 3) Remove the bad character I see you've listed a 4th, "Ignore the encoding on error, assume utf-8". That one seems problematic because of partial reads. If you are decoding as shift-jis, have returned a partial read, and then later hit a bad code point, the stuff you've seen previously will all need to change by switching to no encoding. I'd chose #2 which it sounds like is already the case according the spec. Regardless of what solution is chosen is there a way for me to know something was lost? > > > > 2) If I'm reading using readAsText a multibyte encoding (utf-8, > shift-jis, etc..) is it implementation dependent whether or not it can > return partial characters when returning partial results during reading? In > other words, Let's say the next character in a file is a 3 byte code point > but the reader has only read 2 of those 3 bytes so far. Is implementation > dependent whether result includes those 2 bytes before reading the 3rd byte > or not? > > > Yes, partial results are currently implementation dependent; the spec. only > says they SHOULD be returned. There was reluctance to have MUST condition > on partial file reads. I'm open to revisiting this decision if the > justification is a really good one. > I'm assuming by "MUST condition" you mean a UA doesn't have to support partial reads at all, not that how partial reads work shouldn't be specified. Here's an example. Assume we stick with unknown characters get mapped to U+FFFD. Assume my stream is utf8 and in hex the bytes are. E3 83 91 E3 83 91 That's 2 code points of 0x30D1. Now assume the reader has only read the first 5 bytes. Should the partial results be (a) filereader.result.length == 1 where the content is 0x30D1 or should the partial result be (b) filereader.result.length == 2 where the content is 0x30D1, 0xFFFD because at that point the E3 83 at the end of the partial result is not a valid codepoint I think the spec should specify that if the UA supports partial reads it should follow example (a) > > -- A* >
Received on Friday, 1 July 2011 01:02:05 UTC