[Bug 8310] New: script block's source initialization: please honor the specified charset and type from bugzilla@wiggum.w3.org on 2009-11-16 (public-html-bugzilla@w3.org from November 2009)

From: <bugzilla@wiggum.w3.org>
Date: Mon, 16 Nov 2009 00:32:42 +0000
To: public-html-bugzilla@w3.org
Message-ID: <bug-8310-2486@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=8310

           Summary: script block's source initialization: please honor the
                    specified charset and type
           Product: HTML WG
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec bugs
        AssignedTo: dave.null@w3.org
        ReportedBy: verdy_p@wanadoo.fr
         QAContact: public-html-bugzilla@w3.org
                CC: ian@hixie.ch, mike@w3.org, public-html@w3.org


About this section:

"If the load was successful
Initialize the script block's source as follows:

If the script is from an external file
The contents of that file, interpreted as string of Unicode characters, are the
script source.

For each of the rows in the following table, starting with the first one and
going down, if the file has as many or more bytes available than the number of
bytes in the first column, and the first bytes of the file match the bytes
given in the first column, then set the script block's character encoding to
the encoding given in the cell in the second column of that row, irrespective
of any previous value:

Bytes in Hexadecimal    Encoding
FE FF   UTF-16BE
FF FE   UTF-16LE
EF BB BF        UTF-8
This step looks for Unicode Byte Order Marks (BOMs).

The file must then be converted to Unicode using the character encoding given
by the script block's character encoding.

If the script is inline and the script block's type is a text-based language
The value of the DOM text attribute at the time the "running a script"
algorithm was first invoked is the script source.

If the script is inline and the script block's type is an XML-based language
The child nodes of the script element at the time the "running a script"
algorithm was first invoked are the script source."

----

This description clearly breaks the definition of the <script> element as a way
to embed or reference any extranl data that is not part of the document's
content flow. Notably, it forces all scripts to be text-based (even though this
could as well be a binary-encoded image (loaded from an external file or URL).

My opinion is that all these steps should be taken ONLY if the script block's
type is embedded inline in the document (not loaded separately), in which case
the detection of BOM's is clearly undesirable, or if it is loaded from an
external file or URL whose type is text-based (its computed MIME type starts by
"text/" or maps to a text-based protocol such as "application/xml").

In other words, the specified value for the script's "type" attribute must
still be hononed if it is present, as well as the specified value for the
script's "charset" attribute when it is also present.

Forcing the detection of BOMs when the specified charset does not have to be
"guessed", is a bug, notably because the script's content could be any kind of
text data which may legitimately start by the suggested bytes in the specified
non-Unicode-based charset where it will legally represent one or more actual
and significant characters.

On the opposite, if the specified charset is one of the suggested UTF
(Unicode-based), the detection of BOMs may be used to check the conditions by
which the charset may be safely modified into one of the others ; in other
words, BOM detection CAN be, and on fact SHOULD be used ONLY IF:

- there is no charset specified, OR

- the specified charset is an Unicode-approved and compatible UTF, and at least
one of UTF-8, UTF-16, UTF-32, and ONLY if these charsets allow the presence of
BOMs, so NOT if the specified charset is UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE

The detection of BOMs is then possible also for compressed Unicode transforms
like BOCU-8 or SCSU (which MAY also be optionally supported by browsers,
independantly of the separate support for transport-layer protocols that can
use deflate/gzip/compress algorithms on any document type and using any
possible charset) and possibly some other large Asian charsets like GB-18030,
or recent versions of HKCS or JISX (where BOMs are also representable because
they can now fully map the UCS bijectively, and may be used as if they were a
UTF), provided that their respective encoding of BOMs are distinct between each
charset.


-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Monday, 16 November 2009 00:32:51 UTC