- From: Arun Ranganathan <arun@mozilla.com>
- Date: Wed, 06 Jul 2011 21:44:32 -0700
- To: "Gregg Tavares (wrk)" <gman@google.com>
- CC: Web Applications Working Group WG <public-webapps@w3.org>
- Message-ID: <4E1539B0.1080500@mozilla.com>
On 6/30/11 6:01 PM, Gregg Tavares (wrk) wrote: > > > On Tue, Jun 21, 2011 at 10:17 AM, Arun Ranganathan <arun@mozilla.com > <mailto:arun@mozilla.com>> wrote: > >> Sorry if these have all been discussed before. I just read the >> File API for the first time and 2 random questions popped in my >> head. >> >> 1) If I'm using readAsText with a particular encoding and the >> data in the file is not actually in that encoding such that code >> points in the file can not be mapped to valid code points what >> happens? Is that implementation specific or is it specified? I >> can imagine at least 3 different behaviors. > > This should be specified better and isn't. I'm inclined to then > return the file in the encoding it is in rather than force an > encoding (in other words, ignore the encoding parameter if it is > determined that code points can't be mapped to valid code points > in the encoding... also note that we say to "Replace bytes or > sequences of bytes that are not valid according to thecharsetwith > a single U+FFFD character [Unicode > <http://dev.w3.org/2006/webapi/FileAPI/#Unicode>]"). Right now, > the spec isn't specific to this scenario ("... if the user agent > cannot decode blob using encoding, then let charset be null" > before the algorithmic steps, which essentially forces UTF-8). > > Can we list your three behaviors here, just so we get them on > record? Which behavior do you think is ideal? More importantly, > is substituting U+FFFD and "defaulting" to UTF-8 good enough for > your scenario above? > > > The 3 off the top of my head were > > 1) Throw an exception. (content not valid for encoding) > 2) Remap bad codes to some other value (sounds like that's the one above) > 3) Remove the bad character > > I see you've listed a 4th, "Ignore the encoding on error, assume > utf-8". That one seems problematic because of partial reads. If you > are decoding as shift-jis, have returned a partial read, and then > later hit a bad code point, the stuff you've seen previously will all > need to change by switching to no encoding. > > I'd chose #2 which it sounds like is already the case according the spec. This is the case in the spec. currently, but: > > Regardless of what solution is chosen is there a way for me to know > something was lost? > I don't think so, actually. And I'm not entirely sure how we can allow for such a way, unless we throw an error or something. > >> >> 2) If I'm reading using readAsText a multibyte encoding (utf-8, >> shift-jis, etc..) is it implementation dependent whether or not >> it can return partial characters when returning partial results >> during reading? In other words, Let's say the next character in >> a file is a 3 byte code point but the reader has only read 2 of >> those 3 bytes so far. Is implementation dependent whether result >> includes those 2 bytes before reading the 3rd byte or not? >> > > Yes, partial results are currently implementation dependent; the > spec. only says they SHOULD be returned. There was reluctance to > have MUST condition on partial file reads. I'm open to revisiting > this decision if the justification is a really good one. > > > I'm assuming by "MUST condition" you mean a UA doesn't have to support > partial reads at all, not that how partial reads work shouldn't be > specified. > > Here's an example. > > Assume we stick with unknown characters get mapped to U+FFFD. > Assume my stream is utf8 and in hex the bytes are. > > E3 83 91 E3 83 91 > > That's 2 code points of 0x30D1. Now assume the reader has only read > the first 5 bytes. > > Should the partial results be > > (a) filereader.result.length == 1 where the content is 0x30D1 > > or should the partial result be > > (b) filereader.result.length == 2 where the content is 0x30D1, 0xFFFD > because at that point the E3 83 at the end of the partial result is > not a valid codepoint > > I think the spec should specify that if the UA supports partial reads > it should follow example (a) OK. I think the spec. needs more bolstering here. Thanks for your example. This makes it clearer. -- A*
Received on Thursday, 7 July 2011 04:45:09 UTC