- From: Christian Wolfgang Hujer <Christian.Hujer@itcqis.com>
- Date: Fri, 7 Dec 2001 14:17:06 +0100
- To: "Vadim Plessky" <lucy-ples@mtu-net.ru>, <www-html@w3.org>
Hello Vadim, > -----Original Message----- > On Friday 07 December 2001 00:20, Christian Wolfgang Hujer wrote: > [...] > | > | I recommend the use of ASCII only and encoding all Unicode characters > | with a character number greater than 159 (128 to 159 are of no interest, > | they are control characters and may not be used in XML documents anyway) > | using their correspondig character entities, e.g. ü for > the German u > | Umlaut or Ą for the Polish A with "ogonek". > | > > Hello Christian! > > I guess you have never used Cyrillic - as your advice (quoted above) is > absolutely useless for Cyrillic-based alphabets. that's partially not true :) I haven't used Cyrillic that much, I only use Cyrillic, next to Klingon and Bopomofo, in XML courses to demonstrate students the power of Unicode. But my advice is definitely not useless, but also very useful for all non-Latin alphabets. To be precise, I didn't mention I meant *publishing*, not *writing*. No I say it. I mean the encoding for publishing, not the encoding for writing. > You should use ISO-8859-1 or its successor, ISO-8859-15, only > when your page uses this character range. ...and one is too lazy to use ASCII. > For all other cases, you should use Unicode (UTF-8). > Unicode TTF fonts are widely available nowdays, so I see no problem with > transition to Unicode. Windows 2000 has good support for Unicode, > KDE (Linux, > UNIX, FreeBSD) supports Unicode natively and I guess MacOS X too. > So all major platforms completed migration and supporting > *legacy* technics > like ü for Umlaut make no sence anymore. That's where I cannot agree. - Does your cell phone have Unicode/UTF-8 support? - Do Opera 5, 4, 3.6, Voyager, iBrowse, AWeb have Unicode/UTF-8 support? - How many users do Amiga OS, Atari, BeOS, Mac OS 9 and older, some older Linuxes, BSDs etc. have? So a) Legacy encodings are bad for known reasons b) UTF-8 is still not supported enough What's left? Yes, ASCII. Of course I do not suggest you *write* using ASCII, that can be annoying, even in German, where it is required to use ä, Ä, ö, Ö, ü, Ü and ß. How annoying must it be in Chinese! I suggest write in whatever encoding you like. I suggest you *publish* in ASCII because that's always supported. A simple transformation like <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > <xsl:output indent="no" encoding="ASCII" doctype-public="-//W3C//XHTML Basic 1.0//EN" doctype-system="http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd" /> <xsl:template match="node()|@*"> <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy> </xsl:template> </xsl:transform> will read your UTF-8, Cyrillic or whatever XHTML document and output an ASCII document with all non-ASCII-characters encoded as character entities. And, as a side effect, it will remove many (not all) superflous whitespace and comments, too (if you don't like that, set indent to yes and add a copy template for comments). The transformation is not perfect, strip-space and preserve-space are missing. And last but not least ASCII is no legacy encoding, it is a subset of UTF-8. And there's no reason to be anxious of the growing file size. Simply use "cat file.html | gzip -9 -f >file.html.gz" when publishing, and most browsers will receive a compressed version. A Makefile like this helps (needs Cygwin on Windows): SRCDIR=src/ DESTDIR=htdocs/ SRCFILES=$(shell find ${SRCDIR} -name "*.html") DESTFILES=$(patsubst ${SRCDIR}%,${DESTDIR}%,${SRCFILES}) ALLFILES=${DESTFILES} $(addsuffix .gz,${DESTFILES}) all: ${ALLFILES} ${DESTDIR}%.html: ${SRCDIR}%.html saxon $< toAscii >$@ ${DESTDIR}%.html.gz: ${DESTDIR}%.html cat $< | gzip -9 -f >$@ Since I didn't copy existing files but wrote it from my mind there might be typos. And I know it's also possible solving this task with Ant in a more platform-independant way, and probably a bit faster because Ant will start the JVM for the Transformation only once, if properly used. I understand your protest, but your protest is not neccessary. Greetings Christian
Received on Friday, 7 December 2001 08:19:04 UTC