8-bit chars in US-ASCII documents (was Re: Embarrassing typo!)

On Sat, 21 Apr 2001, Terje Bless wrote:

> More worrying is the fact that we don't catch ISO-8859-1 in documents
> labelled as US-ASCII (see TODO #1 <URL:http://validator.w3.org/todo.html>)
> and I don't quite know why. Do any of you (Liam, Nick? Anyone?) have any
> ideas? What does Page Valet and the WDG Validator (and A Real Validator for
> that matter) do with that doc?

The WDG HTML Validator labels US-ASCII documents as ISO-8859-1 when
passing off to lq-nsgmls, and so it considers that example document valid.
And it is valid:

  "An XML document is valid if it has an associated document type
   declaration and if the document complies with the constraints expressed
   in it." [1]

The 8-bit character is an error, but it's an error in a similar way to
including <a href="foo bar"> in an HTML document.  URIs can't contain
spaces, but HTML validators don't complain.

It would be nice if HTML validators could warn of invalid URI syntax and
character coding problems, but it's not required.

I wouldn't know how to warn with Text::Iconv, but it should be possible to
report the problem with Unicode::Map8 by subclassing and overriding the
unmapped_to8 member.  But then there's still the problem of multi-byte
encoding problems...

[1] http://www.w3.org/TR/REC-xml#dt-valid

-- 
Liam Quinn

Received on Saturday, 21 April 2001 13:21:42 UTC