- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 28 Sep 2004 14:45:05 +0900
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: public-qa-dev@w3.org
At 20:10 04/09/24 +0200, Bjoern Hoehrmann wrote: >* Martin Duerst wrote: > >>I was under the impression that we agreed that using Encode and > >>proper Perl Unicode features were not planned for 0.7.0 which will > >>be the next version of the Markup Validator. > > > >Who agreed? You suggested to use proper Perl Unicode, didn't you? > >I also suggested that we release often; the 0.6.7 release is now three >months old and it does not seem to me that the next version will be >released in October. I have never seen anybody on this list who suggested that releasing often is a bad idea. But whenever it came close to releasing, everybody seemed to me sceptical. The only way, in my experience, to release often is 'just do it'. >Currently our main focus is on stabilizing the code >in HEAD which is the result of merging the improvements in the former >HEAD and 0.6.7, fixing all the bugs so that it has at least the level of >quality that 0.6.7 had and then see what comes next, I would expect a >Beta release to get broader review. Why not do the beta release before we have 'at least the quality of 0.6.7'? >I see switching to Unicode internals >now as making that more difficult. I was successfully able, in my checkout version, to get rid of the counting problems when indicating where on a line an error occurred. That's definitely a bug fix, and for some people (all those working outside ASCII), it may be a real feature. The actual disadvantage would be non-support for GB18030. The other things that you have mentioned will have to be checked very carefully eventually, but should be okay for most cases (and going through the code and replacing \s and friends in regular expressions with actual precise [] shouldn't be such a big issue). Also, my code got a lot simpler because Encode is much better at handling decoding errors in various ways. > >A lot of things would be better with a test suite. But I'm > >not ready to wait for one. > >You don't have to wait! You can contribute to it and make it happen >sooner! Valuable contributions would be ideas, test documents, >documentation, source code for a test module and/or script, reports >on bugs in the current code, etc. > > >> % perl -MEncode -e "print decode 'utf-16be', qq(\x00\xf6)" > >> Unknown encoding 'utf-16be' at -e line 1 > >> > >>using the Encode.pm that ships with Perl 5.8.2 even though the > >>encoding would be supported if written as "UTF-16BE". > > > >Good to know. Does this apply to all encodings, or only to > >a few? > >Only to a few as far as I can tell. A list of encoding names (including >different spellings) we currently support and which we would support >just by using Encode and/or Encode::Alias and/or I18N::Charset would be >very useful. Maybe that's something you can look into? I have. I remember that UTF-16 somehow showed up in this list, and GB18030. I don't think there were any others. One big advantage would be that Text::Iconv bases on a machine's iconv, and that is varying. As an example, Solaris has a rather bad one out of the box. Also, please note that the encodings we currently support are not simply those of Iconv; Iconv would support many others. But we check whether there is actually an IANA registration, and we use only the MIME preferred name. > >>and check which behavior we desire, and have tests so > >>that later changes do not introduce bugs. Iconv and Encode also do > >>not support the same set of character encodings, GB18030 for example > >>is supported by the current Markup Validator but not by the Encode > >>version that ships with Perl 5.8.2, we would first need to figure > >>out for which encodings we would need to drop support or find other > >>replacements. > > > >Or we would just (temporarily) drop those that are not supported. > >That's an option, too. Maybe we should discuss this in one of our >upcoming meetings? Please discuss. I think it should be feasible. Regards, Martin.
Received on Tuesday, 28 September 2004 05:56:02 UTC