Re: Codepage

On Mon, 4 Oct 2004, Frank Ellermann wrote:

> It can handle windows-1252, therefore it could also handle 437
> or 858.

Handling a very common proprietary encoding doesn't mean you need to
handle all.

> Supporting IANA registered charsets is not "encouraging" to
> use this stuff where it isn't needed.

It is. And "IANA registered" is fairly irrelevant. Windows-1252 was widely
used on the Web before it was registered at IANA, and so was text/css, and
text/javascript isn't registered even now. The IANA registrations are a
formality; a useful one if you ask me, but not taken so seriously by most
players in the field. And if someone or his brother registered a few
hundred encodings just because it's possible, should the validator start
supporting those encodings too?

> DOS and OS/2 systems
> with these charsets simply exist, plus applications using
> these charsets, plus text documents using these charsets,

For use on the Web or even in intranets, those files just need to be
converted.

> and authors might wish to add some "text screen shots" in a
> HTML document.

Then we should not encourage them to use e.g. box drawing characters in
them. Images work better in such cases. For an image, you can at least
specify an alt attribute (and a validator will report if you forget to
include any alt attribute); a "text screen shot", especially when
containing characters like box drawing, is just mumbo-jumbo to a screen
reader, for example.

> >> Today it's either windows-1252 or Unicode for scripts
> >> roughly covered by Latin-1.
> > I wonder why you don't mention the most obvious alternative.
>
> Not sure what you're talking about,

ISO-8859-1.

>  [box drawing characters]
> > I think very few people actually use them, and hardly anyone
> > _needs_ them.
>
> They have applications using these characters in their output.

Such antique programs might be interesting for nostalgic reasons, but we
are discussing documents using markup like HTML or XML.

> If you wanted to say that nobody creates _new_ texts with these
> characters you have a point (as far as I'm concerned, but there
> were questions about 437 and 858 more than once here, so some
> users apparently still "need"/want this for whatever reasons).

Whatever the reasons are, the right answer is to convert to an
internationally standardized encoding, such as iso-8859-1 or
utf-8.

> > Depending on what you imagine as the potential use of box
> > drawing characters, they would better be replaced by the use
> > of CSS (especially border properties)
>
> Sure, for _new_ texts.  But if you want to insert some curses
> output of a chess game in your blog "as is" that's no option.

A blog is generated and maintained by software. Get or write software that
can handle the data you want to play with.

> > or images with suitable alt texts
>
> That's a possible workaround,

No, solution. The "text screen shot" you propose is a proposed workaround,
which does not work around limitations but creates them.

> > Is this what you meant to present? Why?
>
> The source is pc-multilingual-850+euro, and what you saw was
> the result of applying xhtml.kex on itself.  Only relevant for
> systems where 858 is the native charset, forget it.

OK. I just suspected there might have been some point.

> I wanted to use some symbolic names defined for MathML as far
> as they could be used instead of box drawing characters and PC
> graphics.  Of course no browser supports this, or at least not
> yet.

Yes, I know that. I suspected you didn't, since you referred to such usage
on the Web as an argument in our discussion.

> For HTML 4 I'd have to learn SGML,

Not really. Even the creators of HTML 4 didn't know SGML very well, and
very few people using HTML 4 knows SGML.

> For you HTML is fine, because you know
> all practically relevant SGML oddities.

It's not that hard to learn them, and the real oddities are in the
(lack of) browser support to SGML - that is, actual browser behavior, and
this is something we need to know anyway, as authors.

But do you know the practically relevant XML oddities? For example,
the principle that parsers (and hence browsers) need not read
external subsets and need not even tell they don't? That is, they may
happily ignore your attempts to include entity references from an external
file.

> > The XHTML 1.0 specification requires the use of one of
> > specific DOCTYPE declarations, literally.
>
> You can't add your own definitions ?

The spec says: "A Strictly Conforming XHTML Document is an XML document
that requires only the facilities described as mandatory in this
specification." The terminology is quite confused and confusing, as so
often in W3C documents when normative conformance is described.
There is no other conformance defined in the specification but
strict conformance. And this has absolutely nothing to do with the issue
of XHTML Strict vs. XHTML Transitional.

But my description was oversimplified. Unlike HTML 4.01, XHTML 1.0
specification does not say that you must use of the specific DOCTYPE
declarations listed. Instead it says: "The public identifier included in
the DOCTYPE declaration must reference one of the three DTDs found in DTDs
using the respective Formal Public Identifier. The system identifier may
be changed to reflect local system conventions." So technically you can
add something there. But since it is not mandatory for a parser to process
an external subset, the document would not be a "Strictly Conforming XHTML
document", i.e. not a conforming XHTML document, i.e. not an XHTML
document (though it may well be a valid XML document and might actually be
be reported as "Valid XHTML 1.0!" by the validator, which is yet another
indication of the inadequacy of such wordings in the reports).

> But I'm used to update this page
> whenever the validator changes

You are joking, right? Or don't you know that a validator only performs
some trivial syntax checking, without checking _even_ the syntax except in
some respects? (And as you use XHTML, the scope of these checks is more
limited than when using HTML, simply because the metalanguage used for
XHTML is much much more limited - this is the reason for replacing
SGML by XML, remember?)

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Monday, 4 October 2004 06:51:06 UTC