- From: <bugzilla@wiggum.w3.org>
- Date: Mon, 16 Nov 2009 00:32:42 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=8310 Summary: script block's source initialization: please honor the specified charset and type Product: HTML WG Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: HTML5 spec bugs AssignedTo: dave.null@w3.org ReportedBy: verdy_p@wanadoo.fr QAContact: public-html-bugzilla@w3.org CC: ian@hixie.ch, mike@w3.org, public-html@w3.org About this section: "If the load was successful Initialize the script block's source as follows: If the script is from an external file The contents of that file, interpreted as string of Unicode characters, are the script source. For each of the rows in the following table, starting with the first one and going down, if the file has as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then set the script block's character encoding to the encoding given in the cell in the second column of that row, irrespective of any previous value: Bytes in Hexadecimal Encoding FE FF UTF-16BE FF FE UTF-16LE EF BB BF UTF-8 This step looks for Unicode Byte Order Marks (BOMs). The file must then be converted to Unicode using the character encoding given by the script block's character encoding. If the script is inline and the script block's type is a text-based language The value of the DOM text attribute at the time the "running a script" algorithm was first invoked is the script source. If the script is inline and the script block's type is an XML-based language The child nodes of the script element at the time the "running a script" algorithm was first invoked are the script source." ---- This description clearly breaks the definition of the <script> element as a way to embed or reference any extranl data that is not part of the document's content flow. Notably, it forces all scripts to be text-based (even though this could as well be a binary-encoded image (loaded from an external file or URL). My opinion is that all these steps should be taken ONLY if the script block's type is embedded inline in the document (not loaded separately), in which case the detection of BOM's is clearly undesirable, or if it is loaded from an external file or URL whose type is text-based (its computed MIME type starts by "text/" or maps to a text-based protocol such as "application/xml"). In other words, the specified value for the script's "type" attribute must still be hononed if it is present, as well as the specified value for the script's "charset" attribute when it is also present. Forcing the detection of BOMs when the specified charset does not have to be "guessed", is a bug, notably because the script's content could be any kind of text data which may legitimately start by the suggested bytes in the specified non-Unicode-based charset where it will legally represent one or more actual and significant characters. On the opposite, if the specified charset is one of the suggested UTF (Unicode-based), the detection of BOMs may be used to check the conditions by which the charset may be safely modified into one of the others ; in other words, BOM detection CAN be, and on fact SHOULD be used ONLY IF: - there is no charset specified, OR - the specified charset is an Unicode-approved and compatible UTF, and at least one of UTF-8, UTF-16, UTF-32, and ONLY if these charsets allow the presence of BOMs, so NOT if the specified charset is UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE The detection of BOMs is then possible also for compressed Unicode transforms like BOCU-8 or SCSU (which MAY also be optionally supported by browsers, independantly of the separate support for transport-layer protocols that can use deflate/gzip/compress algorithms on any document type and using any possible charset) and possibly some other large Asian charsets like GB-18030, or recent versions of HKCS or JISX (where BOMs are also representable because they can now fully map the UCS bijectively, and may be used as if they were a UTF), provided that their respective encoding of BOMs are distinct between each charset. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Monday, 16 November 2009 00:32:51 UTC