Re: Text::Iconv 1.4, new Validator bundle? from Martin Duerst on 2004-09-20 (public-qa-dev@w3.org from September 2004)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 20 Sep 2004 15:41:10 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-qa-dev@w3.org
Message-Id: <4.2.0.58.J.20040920152923.04ee2950@localhost>
At 16:35 04/09/16 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >I'm working on getting rid of the dependency of Text::Iconv anyway,
> >using perl unicode stuff. I should be able to check in the code next
> >week. So I wouldn't worry too much about Text::Iconv anymore.
>
>Do you mean you are working on a general purpose module for check,
>checklink, etc. that we can plug into the new Markup Validator or

As I said earlier, I'm doing some work that may eventually end
up in a module. It's much easier to wrap code up into a module
once the interfaces are clear than just starting with a module
because it looks good to have one (which I agree it would).


>do you mean you are working on a few changes to check? In case of
>the latter, what version exactly?

I'm still trying to figure out what's the right thing to do,
0.6.0 or HEAD.


>I was under the impression that
>we agreed that using Encode and proper Perl Unicode features were
>not planned for 0.7.0 which will be the next version of the Markup
>Validator.

Who agreed? You suggested to use proper Perl Unicode, didn't you?


>In that case, I would be concerned that such changes
>introduce a number of additional complexities that might be
>difficult to deal with without a test suite and such.

A lot of things would be better with a test suite. But I'm
not ready to wait for one.


>It is worth
>to point out that switching to proper Unicode internals is by no
>means trivial, for example
>
>   % perl -MEncode -e "print decode 'utf-16be', qq(\x00\xf6)"
>   Unknown encoding 'utf-16be' at -e line 1
>
>using the Encode.pm that ships with Perl 5.8.2 even though the
>encoding would be supported if written as "UTF-16BE".

Good to know. Does this apply to all encodings, or only to
a few?


>Other things
>to consider would be semantic changes to various symbols e.g. in
>regular expressions,
>
>   #!perl -w
>   use strict;
>   use warnings;
>   use Text::Iconv;
>   use Encode;
>
>   my $t1 = qq(\x20\x28);
>   my $s1 = Text::Iconv->new("UTF-16BE" => "utf-8")->convert($t1);
>   my $s2 = Encode::decode("UTF-16BE", $t1);
>
>   print "ok1\n" if $s1 =~ /\s/;
>   print "ok2\n" if $s2 =~ /\s/;
>
>This would print "ok2" but not "ok1", we would have to go through
>all of these

Good point. [What this is about is that \s matches more than
a few ASCII characters in the case of Unicode.]


>and check which behavior we desire, and have tests so
>that later changes do not introduce bugs. Iconv and Encode also do
>not support the same set of character encodings, GB18030 for example
>is supported by the current Markup Validator but not by the Encode
>version that ships with Perl 5.8.2, we would first need to figure
>out for which encodings we would need to drop support or find other
>replacements.

Or we would just (temporarily) drop those that are not supported.


>Other problems might come from our dependencies, if we rely on data
>from these modules we would need to check carefully whether this
>data has the UTF-8 flag set and how they cope with data that has a
>UTF-8 flag set. They might have similar problems with \s and other
>symbols aswell and thus cause undesired side effects.
>
>I really don't think we should make such changes without a proper
>automated test suite in place and I am not sure whether switching
>to proper Unicode internals fits into 0.7.0, for 0.8.0 when we
>switch to a SGML::Parser::OpenSP infrastructure a number of these
>problems would already be solved and dealing with legacy workaround
>might turn out to be difficult.
>
>Also note that the current code works with Perl 5.6.x, Encode.pm
>would only work with Perl 5.7.x+, I am not sure whether we really
>agreed to shift the requirements for the 0.7.0 release. Users that
>have a problem with Text::Iconv might have even more problems with
>Perl 5.8.2+.

Not sure. Upgrading to a new perl version may be easier than
getting a specific module.


Regards,    Martin.
Received on Monday, 20 September 2004 22:35:49 UTC