Re: [XHR2] responseText for text/html before the encoding has stabilized

On Fri, Sep 30, 2011 at 8:05 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> Unless responseType=="" or responseType=="document" I don't think we
> should do *any* HTML or XML parsing. Even the minimal amount needed to
> do charset detection.

I'd be happy to implement it that way.

> For responseType=="text" we currently *only* look at http headers and
> if nothing is found we fall back to using UTF8. Though arguably we
> should also check for a BOM, but don't currently.

Not checking for the BOM looks like a bug to me though not a
particularly serious one given that the default is UTF-8, so the
benefit of checking the BOM is that people can use UTF-16. But using
UTF-16 on the wire is a bad idea anyway.

This could be fixed for consistency without too much hardship but it's
rather useless use of developer time.

On Fri, Sep 30, 2011 at 9:05 PM, Ian Hickson <ian@hixie.ch> wrote:
> So... the prescanning is generally considered optional

I consider that a spec bug. For the sake of well-defined behavior, I
think the spec should require buffering up to 1024 bytes in order to
look for a charset <meta> without a timeout (but buffering should stop
as soon as a charset <meta> has been seen, so that if the <meta>
appears early, there's no useless stalling until the 1024-byte
boundary).

> (the only benefit
> really is that it avoids reloads in bad cases), and indeed implementations
> are somewhat encouraged to abort it early if the server only sent a few
> bytes (because that will shorten the time until something is displayed).

Firefox has buffered up to 1024 bytes without a timeout since Firefox
4. I have received no reports of scripts locking due to the buffering.
There have been a couple of reports of incremental display of progress
messages having become non-incremental, but those were non-fatal and
easy to fix (by declaring the encoding).

> Also, it has a number of false-positives, e.g. it doesn't ignore the
> contents of <script> elements.

I think restarts with scripts are much worse than mostly-theoretical
false positives. (If someone puts a charset <meta> inside a script,
they are doing it very wrong.)

> Do we really want to put it into the critical path in this way?

For responseType == "" and responseType == "document", I think doing
so would be less surprising than ignoring <meta>. For responseType ==
"text" and responseType == "chunked-text" or any response type that
doesn't actually involve running the full HTML parser, I'd rather not
run the <meta> prescan, either.

> I agree that the reloading alternative is even worse.

Yes.

> What about just
> relying on the Content-Type charset="" and defaulting to UTF-8 if it isn't
> there, and not doing any in-page stuff?

That would be easy to implement, but it would be strange not to
support some ways of declaring the encoding that are considered
conforming by HTML.

> How is the encoding determined for, e.g., text/plain or text/css files
> brought down through XHR and viewed through responseText?

Per spec, @charset isn't honored for text/css, so in that sense, not
honoring <meta> would be consistent. However, I'd be hesitant to stop
honoring the XML declaration for XML, since the could well be content
depending on it. XML and CSS probably won't end up being treated
consistently with each other. But then, XHR doesn't support parsing
into a CSS OM.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 3 October 2011 13:23:24 UTC