Re: [WMVS] Some initial thoughts on code M12N...

* Martin Duerst wrote:
>>Well, I need the functionality for a number of other things than the
>>Validator
>
>What other things are these?

My HTML::Tidy module needs this as HTML Tidy has only limited, highly-
experimental and off-by-default functionality in this regard which won't
change too soon, I already mentioned the experimental AppC Validator,
and there will be a PerlSAX extension that annotates event streams with
additional information that depends on the availability of the source
code of the document in form of a character string.

>What kind of trials and errors would this include? Are you thinking
>about some encoding detection heuristics, or something else?

I do not know yet. One example could be to choose a different encoding
if the encoding has been determined as X but is not legal X, especially
in case of conflicting declarations and/or specifications.

>I agree that this should be the goal. But the wish for the
>perfect now is the enemy of the good soon (such as release often).

That depends... As I wrote, it is often much simpler to address such
issues in an external module (possibly started from scratch) than
messing with the code deeply burried into check.

>Yes, getting more knowledge of aliases and stuff into such
>a module would probably be something to do.

Encode::Alias and I18N::Charset or modules building on top of those
might be better places though.

>>Just like people tend to disagree what
>>the encoding of a document http://www.example.org/
>>
>>   ...
>>   Content-Type: text/html
>>
>>   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
>>     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
>>   <html xmlns="http://www.w3.org/1999/xhtml"><head>
>>   <meta http-equiv = "Content-Type" content =
>>     "text/html;charset=iso-8859-2" />
>>   <title></title></head><body><p>...</p></body></html>
>>
>>would be.
>
>I know some people might claim that this is iso-8859-1,
>but they definitely would be wrong. If it's not for the
>specs, then for all the implementations out there.

Other claims are

  * UTF-8
  * ISO-8859-2
  * US-ASCII
  * implementation defined
  * ...

And implementations do disagree here. The Markup Validator for example
would consider it ISO-8859-2, the W3C CSS Validator would consider it
UTF-8 encoded. But implementations do not seem very relevant here, I
know some people might claim that (if the type were text/xml) it is
US-ASCII, but they definitely would be wrong. If it's not for the
specs, then for all the implementations out there...

Received on Tuesday, 28 September 2004 11:36:17 UTC