Re: Validation of XHTML 1.0 files erroneously says Unknown Encoding?!

From: Masayasu Ishikawa <mimasa@w3.org>

> Eric Maryniak <e.maryniak@pobox.com> wrote:
> 
> > note the encoding ("UTF-8"). However, the validator still says:
> > 
> >     Character encoding: unknown
> > 
> > Is this correct?
> 
> Although it would be better to recognize the encoding declaration,
> the "correct" way to specify the character encoding is to use
> the charset parameter of the "Content-Type" HTTP response header.

I made a patch to recgnize encoding from XML declaration. I hope
this change will be accepted by W3C's original validator.

Takuya ASADA @ W3C/Keio

--
*** check.org	Sat Jul  1 05:33:50 2000
--- check	Tue Sep  5 18:29:44 2000
***************
*** 269,274 ****
--- 269,283 ----
  }
  
  #
+ # If we find a XML declaration with charset information, we take it into account.
+ $line = shift(@{$File->{Content}});
+ if ($line =~ /<\?xml\s/) {
+   if ($line =~ /encoding\s*=[\s\"]*([^\s;\">]*)/) {
+     $File->{XML_Charset} = lc $1;
+   }
+ }
+ 
+ #
  # If we find a META element with charset information, we take it into account.
  foreach my $line (@{$File->{Content}}) {
    # @@ needs to handle meta elements that span more than one line
***************
*** 284,289 ****
--- 293,300 ----
  # Figure out which charset to use for the validation.
  if ($File->{HTTP_Charset}) {
    $File->{Charset} = $File->{HTTP_Charset};
+ } elsif ($File->{XML_Charset}) {
+   $File->{Charset} = $File->{XML_Charset};
  } elsif ($File->{META_Charset}) {
    $File->{Charset} = $File->{META_Charset};
  } else {
***************
*** 433,438 ****
--- 444,459 ----
        <em><span class="warning">The character encoding specified in the HTTP
        header ("<code>$File->{HTTP_Charset}</code>") is different from the one
        specified in the META element ("<code>$File->{META_Charset}</code>").
+       I will use "<code>$File->{Charset}</code>" for this validation.</span></em>
+ EOHD
+ } elsif ($File->{HTTP_Charset} ne $File->{XML_Charset}
+     and $File->{HTTP_Charset} ne ''
+     and $File->{XML_Charset} ne ''
+     and $File->{Charset} ne 'unknown') {
+   print <<"EOHD";
+       <em><span class="warning">The character encoding specified in the HTTP
+       header ("<code>$File->{HTTP_Charset}</code>") is different from the one
+       specified in the XML declaration ("<code>$File->{XML_Charset}</code>").
        I will use "<code>$File->{Charset}</code>" for this validation.</span></em>
  EOHD
  }

Received on Tuesday, 5 September 2000 06:03:21 UTC