[XHR2] responseText for text/html before the encoding has stabilized from Henri Sivonen on 2011-09-29 (public-webapps@w3.org from July to September 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 29 Sep 2011 14:49:17 +0300
To: public-webapps@w3.org
Message-ID: <CAJQvAudBYab4rz4mq7Q=5v0KPOfpnpxWjT=XWhBBpW73z62G0Q@mail.gmail.com>
http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#text-response-entity-body says:
"The text response entity body is a DOMString representing the
response entity body." and "If charset is null and mime is text/html
follow the rules set forth in the HTML specification to determine the
character encoding. Let charset be the determined character encoding."
Furthermore, the response entity body is defined while the state is
LOADING: "The response entity body is the fragment of the entity body
of the response received so far (LOADING) or the complete entity body
of the response (DONE)."

The spec is silent on what responseText for text/html should be if
responseText is read before it is known that "the rules set forth in
the HTML specification to determine the character encoding" will no
longer change their result. This looks like a spec bug.

There are three obvious solutions:
1) Change the encoding used for responseText as more data becomes
available so that previous responseText is not guaranteed to be a
prefix of subsequent responseText.
2) Make XHR pretend it hasn't seen any data at all before it has seen
so much that the encoding decision is final.
3) Not using the HTML rules for responseText.

Solution #1 is what Gecko now does with XML, but fortunately XML
doesn't allow non-ASCII before the XML declaration, so you can't
detect this from outside the black box. With HTML, solution #1 would
mean handing a footgun to Web authors who might not prepare for cases
where previous responseText stops being a prefix of subsequent
responseText.

Solution #2 could, in the worst case (assuming we aren't doing the
worst of worst cases; i.e. we aren't allowing parser restarts
arbitrarily late), stall until 1024 bytes has been seen, which risks
breaking existing comet apps if there exist comet apps that use
responseText with slowly-arriving text/html responses that don't have
a BOM, don't have an early <meta> and don't have an HTTP charset and
that require the JS part of the app to respond act on data within the
first 1024 bytes before the server sends more. (OK, it would be silly
to write comet apps with responseText using text/html as opposed to
e.g. text/plain or whatever and not put a charset declaration on the
HTTP layer, but this is the Web, so who knows if such apps exist.)

Solution #3 would make the text/html side inconsistent with the XML
side and could lead to confusion especially in the default mode if
responseXML does honor <meta>s (within the first 1024 bytes). Solution
#3 would be easy to implement, though.

As a complication, since Saturday, Gecko supports a "moz-chunked-text"
response type which modifies the behavior of response and responseText
so that they only show a string consisting of new text since the
previous progress event. "moz-chunked-text" isn't specced anywhere (to
my knowledge), but IRC discussion with Olli indicates that it's
assumed that, even going forward, the encoding decision is made the
same way for "moz-chunked-text" and "text" response types. This
assumption obviously excludes solution #1 above, since chunks reported
before <meta> could use a different encoding compared to chunks after
<meta>, which wouldn't make sense.

It's worth noting that "moz-chunked-text" turns off responseXML, so
it's not unthinkable to use non-HTML rules for "moz-chunked-text".

In IRC discussion with Olli, we gravitated towards solution #2, but we
didn't consider the comet stalling aspect in that discussion.

In any case, all this should be specced properly and it currently isn't. :-(

It seems to me that all these cannot be true:
 * responseText and responseXML use the same encoding detection rules.
 * The "text" and default modes use the same encoding detection rules.
 * "text" and "moz-chunked-text" use the same encoding detection rules.
 * "moz-chunked-text" uses the same encoding for all chunks.
 * All imaginable badly written comet apps are guaranteed to continue working.
 * responseXML considers <meta> in a deterministic way (no timer for
bailing out before 1024 bytes if the network stalls).

Which property do we give up?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Thursday, 29 September 2011 11:49:55 UTC