中文世界專用編碼（Big5、GB 系列）的錯誤修復機制 from Kang-Hao (Kenny) Lu on 2011-08-20 (public-html-ig-zh@w3.org from August 2011)

From: Kang-Hao (Kenny) Lu <kennyluck@w3.org>
Date: Sat, 20 Aug 2011 12:46:06 +0800
To: 中文HTML5同樂會ML <public-html-ig-zh@w3.org>
Message-ID: <4E4F3C0E.7050708@w3.org>

我在電話會議上提了這個問題，不過目前沒什麼結果，拿來整個郵件群問。

HTML5 規範可以說是很嚴格的瀏覽器規格書，在 11.2.2.3 輸入串連的前置處理
[1]這裡有提到一段話：

[[ Bytes or sequences of bytes in the original byte stream that could
not be converted to Unicode code points must be converted to U+FFFD
REPLACEMENT CHARACTERs. Specifically, if the encoding is UTF-8, the
bytes must be decoded with the error handling[2] defined in this
specification. ]]
（在原來的位元串流裡面無法賺換為 Unicode 代碼點的位元組／數個位元組序列
會被轉成 U+FFFD REPLACEMENT CHARACTER。若編碼為 UTF-8，必須使用本規範定
義的代有修復機制的演算法[2]解碼這些位元組。）

而[2]的演算法是有完整定義的。我的問題是這份規範也參照的 Big5、GB 等 IETF
標準有沒有描述這些編碼的錯誤修復機制，又這些錯誤修復機制與瀏覽器的實作是
否相符？

跟這個相關兼容性問題還有編碼用的代號[3]每個瀏覽器使用的不完全相同的問
題，雖然這目前因為底層不同的解碼程式庫而有所不同，不過有一天還是 可能達
到完整兼容的。

[1]
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream
[2]
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#decoded-as-utf-8,-with-error-handling
[3] http://www.whatwg.org/wiki/Web_Encoding

此致

呂 康豪（Kenny）, 中文興趣小組W3C連絡人
Google+: https://plus.google.com/112088462407783855918/posts
新浪微博: http://t.sina.com.cn/1950042164

Received on Saturday, 20 August 2011 04:46:49 UTC