W3C home > Mailing lists > Public > www-archive@w3.org > September 2015

Bug handling utf-16 in w3ctestlib

From: Ms2ger <ms2ger@gmail.com>
Date: Tue, 22 Sep 2015 13:13:38 +0200
To: Peter Linss <peter@linss.com>
Cc: www-archive <www-archive@w3.org>
Message-ID: <560137E2.7040705@gmail.com>
Hi Peter,

I discovered yesterday that there's a file name conflict between
css-backgrounds-3/border-image-slice-001.xht and
css-backgrounds-3/border-image-slice-001.htm that isn't caught by the
build system.

It turned out that border-image-slice-001.htm (which is encoded in
utf-16-le) was being parsed as windows-1252, so no elements were
recognized and the file was dropped as "not a test". The file wasn't
detected as utf-16-le in HTMLSource.parse because of the encoding
handling there.

As HTMLBinaryInputStream.__init__ already calls detectEncoding(), the
UTF-16 BOM is no longer in the stream when HTMLSource.parse calls
detectEncoding() manually. This causes detectEncoding() not to find
anything interesting, and return windows-1252. Attached is a patch to
remove the manual handling, instead depending on HTMLParser.parse to
handle the encoding detection itself.

Could you apply the patch to <https://hg.csswg.org/dev/w3ctestlib>? I
don't believe I have push access myself.

Thanks
Ms2ger



Received on Tuesday, 22 September 2015 11:14:08 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 22 September 2015 11:14:09 UTC