[XHR2] HTML in XHR implementation feedback from Henri Sivonen on 2011-11-16 (public-webapps@w3.org from October to December 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 16 Nov 2011 12:40:08 +0200
To: public-webapps WG <public-webapps@w3.org>
Message-ID: <CAJQvAufHqo3J8oqk_ipB-KucSHkophsuTpZ17yPSeHuF2nsXFw@mail.gmail.com>
I landed support for HTML parsing in XHR in Gecko today. It has not
yet propagated to the Nightly channel.

Here's how it behaves:

 * Contrary to the spec, for response types other than "" and
"document", character encoding determination for text/html happens the
same way as for unknown types.

 * For text/html responses for response type "" and "document", the
character encoding is established by taking the first match from this
list in this order:
   - HTTP charset parameter
   - BOM
   - HTML-compliant <meta> prescan up to 1024 bytes.
   - UTF-8

 * In particular, the following have no effect on the character encoding:
   - <meta> discovered by the tree builder algorithm
   - The user-configurable fallback encoding
   - Locale-specific defaults
   - The encoding of the document that invoked XHR
   - Byte patterns in the response (beyond BOM and <meta>). Even the
BOMless UTF-16 detection that Firefox does when heuristic detection
has otherwise been turned off is skipped for XHR.

 * When there is no HTTP-level charset parameter, progress events are
stalled and responseText made null until the parser has found a BOM or
a charset <meta> or has seen 1024 bytes or the EOF without finding
either BOM or charset <meta>.

 * If the response is a multipart response, XHR behaves as if it
didn't support HTML parsing for the subparts of the response. (The
multipart handling infrastructure in Gecko makes assumptions that are
incorrect for the off-the-main-thread parsing infrastructure. Since
the plan is to move XML parsing off the main thread, too, we'll need
to find out whether multipart support is a worthwhile feature to keep.
If it is, we need to add some mechanisms to make multipart work when
subparts are parsed off the main thread or. If not, we should drop the
feature, in my opinion.)

 * HTML parsing is supported in the synchronous mode, but I'd be quite
happy to remove that support in order to curb sync XHR proliferation.

 * I believe the implementation otherwise matches the spec, but
exposing the document via responseXML should be considered to be at
risk. See below.

Risks:

 * Stalling progress events while waiting for <meta> could, in theory,
deadlock an existing Web app when the Web app does long polling with
responseType == "", gets a text/html response without a charset
declaration, the first chunk of the response is shorter than 1024
bytes and the server won't send more before the client side informs
the server via another channel that the first chunk has been
processed.
   - If this turns out to be a Real Problem, my plan is to make
responseText show decoded text up to the first byte that isn't one of
0x09, 0x0A, 0x0C, 0x0D, 0x20 - 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 -
0x5A, and 0x61 - 0x7A.
   - I think this risk is low.

 * responseXML now becomes non-null for HTTP error responses that have
a text/html response body. This might be a problem if Web apps that
expect to get XML responses check for HTTP errors by checking
responseXML for null. We'll see how bad breakage nightly testers
report.
   - I think this risk is high.
   - If this turns out to be a Real Problem, the solution would be to
make HTML parsing (including the <meta> prescan) available only when
responseType == "document". (Note that xhr.response maps to
responseText when responseType == "", so if responseXML is made null
when responseType == "", xhr.response wouldn't work for retrieving the
tree.) This change might even be a good idea performance-wise to avoid
adding HTML parsing overhead for legacy uses of XHR that don't set
responseType.

Spec change proposals so far:

 * I suggest making responseType modes other than "" and "document"
not consider the internal character encoding declarations in HTML (or
XML).

Spec change proposals that I'm not making yet but might make in near future:

 * Making responseType == "" not support HTML parsing at all and to
treat text/html as an unknown type for the purpose of character
encoding.

 * Making XHR not support HTML parsing in the synchronous mode.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 16 November 2011 10:40:42 UTC