RE: Are the public HTML DTDs valid XML?

Hello Vadim,

> -----Original Message-----
> On Friday 07 December 2001 00:20, Christian Wolfgang Hujer wrote:
> [...]
> |
> |   I recommend the use of ASCII only and encoding all Unicode characters
> | with a character number greater than 159 (128 to 159 are of no interest,
> | they are control characters and may not be used in XML documents anyway)
> | using their correspondig character entities, e.g. ü for
> the German u
> | Umlaut or Ą for the Polish A with "ogonek".
> |
>
> Hello Christian!
>
> I guess you have never used Cyrillic - as your advice (quoted above) is
> absolutely useless for Cyrillic-based alphabets.
that's partially not true :)
I haven't used Cyrillic that much, I only use Cyrillic, next to Klingon and
Bopomofo, in XML courses to demonstrate students the power of Unicode.
But my advice is definitely not useless, but also very useful for all
non-Latin alphabets.

To be precise, I didn't mention I meant *publishing*, not *writing*. No I
say it.
I mean the encoding for publishing, not the encoding for writing.

> You should use ISO-8859-1 or its successor, ISO-8859-15, only
> when your page uses this character range.
...and one is too lazy to use ASCII.

> For all other cases, you should use Unicode (UTF-8).
> Unicode TTF fonts are widely available nowdays, so I see no problem with
> transition to Unicode. Windows 2000 has good support for Unicode,
> KDE (Linux,
> UNIX, FreeBSD) supports Unicode natively and I guess MacOS X too.
> So all major platforms completed migration and supporting
> *legacy* technics
> like  ü for Umlaut make no sence anymore.

That's where I cannot agree.

- Does your cell phone have Unicode/UTF-8 support?
- Do Opera 5, 4, 3.6, Voyager, iBrowse, AWeb have Unicode/UTF-8 support?
- How many users do Amiga OS, Atari, BeOS, Mac OS 9 and older, some older
Linuxes, BSDs etc. have?

So
a) Legacy encodings are bad for known reasons
b) UTF-8 is still not supported enough
What's left?
Yes, ASCII.


Of course I do not suggest you *write* using ASCII, that can be annoying,
even in German, where it is required to use ä, Ä, ö, Ö,
ü, Ü and ß. How annoying must it be in Chinese!
I suggest write in whatever encoding you like.

I suggest you *publish* in ASCII because that's always supported.


A simple transformation like


<xsl:transform
	version="1.0"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
	<xsl:output
		indent="no"
		encoding="ASCII"
		doctype-public="-//W3C//XHTML Basic 1.0//EN"
		doctype-system="http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd"
	/>

	<xsl:template match="node()|@*">
		<xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
	</xsl:template>

</xsl:transform>


will read your UTF-8, Cyrillic or whatever XHTML document and output an
ASCII document with all non-ASCII-characters encoded as character entities.
And, as a side effect, it will remove many (not all) superflous whitespace
and comments, too (if you don't like that, set indent to yes and add a copy
template for comments).

The transformation is not perfect, strip-space and preserve-space are
missing.

And last but not least ASCII is no legacy encoding, it is a subset of UTF-8.

And there's no reason to be anxious of the growing file size. Simply use
"cat file.html | gzip -9 -f >file.html.gz" when publishing, and most
browsers will receive a compressed version.

A Makefile like this helps (needs Cygwin on Windows):

SRCDIR=src/
DESTDIR=htdocs/
SRCFILES=$(shell find ${SRCDIR} -name "*.html")
DESTFILES=$(patsubst ${SRCDIR}%,${DESTDIR}%,${SRCFILES})
ALLFILES=${DESTFILES} $(addsuffix .gz,${DESTFILES})

all: ${ALLFILES}

${DESTDIR}%.html: ${SRCDIR}%.html
	saxon $< toAscii >$@

${DESTDIR}%.html.gz: ${DESTDIR}%.html
	cat $< | gzip -9 -f >$@


Since I didn't copy existing files but wrote it from my mind there might be
typos.
And I know it's also possible solving this task with Ant in a more
platform-independant way, and probably a bit faster because Ant will start
the JVM for the Transformation only once, if properly used.


I understand your protest, but your protest is not neccessary.


Greetings

Christian

Received on Friday, 7 December 2001 08:19:04 UTC