Re: Text::Iconv 1.4, new Validator bundle? from Bjoern Hoehrmann on 2004-09-16 (public-qa-dev@w3.org from September 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 16 Sep 2004 16:35:04 +0200
To: Martin Duerst <duerst@w3.org>
Cc: public-qa-dev@w3.org
Message-ID: <418b9bfe.462792840@smtp.bjoern.hoehrmann.de>

* Martin Duerst wrote:
>I'm working on getting rid of the dependency of Text::Iconv anyway,
>using perl unicode stuff. I should be able to check in the code next
>week. So I wouldn't worry too much about Text::Iconv anymore.

Do you mean you are working on a general purpose module for check,
checklink, etc. that we can plug into the new Markup Validator or
do you mean you are working on a few changes to check? In case of
the latter, what version exactly? I was under the impression that
we agreed that using Encode and proper Perl Unicode features were
not planned for 0.7.0 which will be the next version of the Markup
Validator. In that case, I would be concerned that such changes
introduce a number of additional complexities that might be
difficult to deal with without a test suite and such. It is worth
to point out that switching to proper Unicode internals is by no
means trivial, for example

  % perl -MEncode -e "print decode 'utf-16be', qq(\x00\xf6)"
  Unknown encoding 'utf-16be' at -e line 1

using the Encode.pm that ships with Perl 5.8.2 even though the
encoding would be supported if written as "UTF-16BE". Other things
to consider would be semantic changes to various symbols e.g. in
regular expressions,

  #!perl -w
  use strict;
  use warnings;
  use Text::Iconv;
  use Encode;

  my $t1 = qq(\x20\x28);
  my $s1 = Text::Iconv->new("UTF-16BE" => "utf-8")->convert($t1);
  my $s2 = Encode::decode("UTF-16BE", $t1);

  print "ok1\n" if $s1 =~ /\s/;
  print "ok2\n" if $s2 =~ /\s/;

This would print "ok2" but not "ok1", we would have to go through
all of these and check which behavior we desire, and have tests so
that later changes do not introduce bugs. Iconv and Encode also do
not support the same set of character encodings, GB18030 for example
is supported by the current Markup Validator but not by the Encode
version that ships with Perl 5.8.2, we would first need to figure
out for which encodings we would need to drop support or find other
replacements.

Other problems might come from our dependencies, if we rely on data
from these modules we would need to check carefully whether this
data has the UTF-8 flag set and how they cope with data that has a
UTF-8 flag set. They might have similar problems with \s and other
symbols aswell and thus cause undesired side effects.

I really don't think we should make such changes without a proper
automated test suite in place and I am not sure whether switching
to proper Unicode internals fits into 0.7.0, for 0.8.0 when we
switch to a SGML::Parser::OpenSP infrastructure a number of these
problems would already be solved and dealing with legacy workaround
might turn out to be difficult.

Also note that the current code works with Perl 5.6.x, Encode.pm
would only work with Perl 5.7.x+, I am not sure whether we really
agreed to shift the requirements for the 0.7.0 release. Users that
have a problem with Text::Iconv might have even more problems with
Perl 5.8.2+.

I thus hope you are working on an external module, in that case it
would be good if you could share some details on your plan.

Received on Thursday, 16 September 2004 14:35:56 UTC