RE: encodings, and "publishing documents" [Re: Are the public HTML DTDs valid XML?] from Christian Wolfgang Hujer on 2001-12-07 (www-html@w3.org from December 2001)

From: Christian Wolfgang Hujer <Christian.Hujer@itcqis.com>
Date: Fri, 7 Dec 2001 18:01:04 +0100
To: "Vadim Plessky" <lucy-ples@mtu-net.ru>, <www-html@w3.org>
Message-ID: <000101c17f40$c1a2f000$3495e23e@andromedacwh>
Hello Vadim,

> -----Original Message-----
> |   I haven't used Cyrillic that much, I only use Cyrillic, next
> to Klingon
> | and Bopomofo, in XML courses to demonstrate students the power
> of Unicode.
> | But my advice is definitely not useless, but also very useful for all
> | non-Latin alphabets.
>
> Now I should ask you what is Klingon and Bopomofo. :-)
>
> well, I know that this is quite common practice to encode *non-ASCII*
> characters (using &xxxx; ).
> I found MS Word guilty in such broken practice, Macromedia
> products have same
> problems (not always but quite often). Allaire HomeSite tend to
> do this as
> well (and often doesn't understand cut'n'pasted Cyrillic due to
> this reason).
> I refer here to Windows versions of those programs.
> BTW it partialy explains why I do not use Windows anymore :-))
What about that practice is broken?

It is good practice to encode all non-ASCII-characters using &xxxx; because
then every software that is capable of reading ASCII is capable of reading
the document.
If you use UTF-8, even Internet Explorer 6.0 will get XHTML documents wrong
when the <meta/> delcaration for the charset is missing.
And sometimes Internet Explorer does not interpret the charset declaration
in a <meta/> element.


> |   To be precise, I didn't mention I meant *publishing*, not
> *writing*. No I
> |   say it.
> |   I mean the encoding for publishing, not the encoding for writing.
>
> ok, now I am confused.
> Anyway, let me explain how I see typical publishing of *documents* (under
> *documents* I understand typical business memo, article in
> newspaper, etc.)
> You type article/text in word processor. Than you "Save As HTML". You get
> HTML or XHTML file as an output.
> What encoding you get in such HTML/XHTML depends on your word processor.
> I use KWord in such cases, and save docs as XHTML Strict with
> CSS2 formatting.
> MS IE, Mozilla, Netscape, Konqueror do not have problems
> opening/rendering
> such docs.
> Default encoding for such HTML exported from KWord is Unicode/UTF8.
If *publishing* is *saving* them from a "word processor" (ouch, that's not
the tool to generate HTML, especially if its name is Microsoft Word), then
you're right.

But for me, even Macromedia DreamWeaver is not the tool to create XHTML
documents. I want valid documents. So I threw away HomeSite, Fusion,
FrontPage, DreamWeaver and all those.

I write XHTML by hand and use Transformation for all tasks like adding tocs,
headers, footers, style and so on.

> [...]
> |   > For all other cases, you should use Unicode (UTF-8).
> |   > Unicode TTF fonts are widely available nowdays, so I see no problem
> |   > with transition to Unicode. Windows 2000 has good support
> for Unicode,
> |   > KDE (Linux,
> |   > UNIX, FreeBSD) supports Unicode natively and I guess MacOS X too.
> |   > So all major platforms completed migration and supporting
> |   > *legacy* technics
> |   > like  &uuml; for Umlaut make no sence anymore.
> |
> |   That's where I cannot agree.
>
> why?
Mac OS 9 and older have a big market share in USA and a small (but not too
small) market share in Europe. Unicode is a problem for Browsers on Mac OS.

> |
> |   - Does your cell phone have Unicode/UTF-8 support?
>
> it's pretty well known that current models of mobile phones are terrible.
> I hope you don't use some "rrecent model with WAP support", do you?
> Anyway, until G3 cellular networks became common, mobile phone
> users will not
> use Internet from phones.
>
> |   - Do Opera 5, 4, 3.6, Voyager, iBrowse, AWeb have
> Unicode/UTF-8 support?
>
> It's known that Opera5 has problems with Unicode support.
> IIRC this was one of the (officcial) reasons why MSN blocked
> access for Opera
> browser. (you can check some links on my web page, http://kde2.newmail.ru)
> Please get me correctly: I like Opera browser, it has nice features.
> But fact that Opera5 can't support Unicode correctly - is problem
> of company
> named Opera Software.
> I use Konqueror, it has good Unicode support.
> as about Voyager, iBrowse, AWeb - I guess these are some
> minor/experimental
> browsers? I haven't heard about those ones.
> If they do not support Unicode - than they should get support
> ASAP. Otherwise
> they will disappear earlier than they matured :-)
Well, they are all about 8 years old, except for AWeb, which is a bit
younger.

Their OS just has no big market share: Amiga OS.

> |   - How many users do Amiga OS, Atari, BeOS, Mac OS 9 and
> older, some older
> |   Linuxes, BSDs etc. have?
>
> my recent research on nation-wide sites (in Russia) shows that
> web surfers
> with MacOS have 0.9% market share, and Linux users from 1.3% to 2.6%,
> depending on method of calculation. This mean that Windows has
> about 96.5%
> market share. I am going to right article about it but that
> article is not
> yet ready.
> Amiga OS, Atari, BeOS - is history.
> // don't get me wrong I was programming on AtariST in 1988. But
> again' that's
> history.

Oldtimers are also history, but streets are still built in a way that old
timers can drive on it.
Amiga OS, Atari, BeOS might be history to *you*, but not to the freaks that
"still" use them.
But that's not the place to discuss that.

But as much as you would like to see support for Linux, they would like to
see support of their platforms, just by using standards and a chance for
migrating to newer technologies step by step.


> |   So
> |   a) Legacy encodings are bad for known reasons
> |   b) UTF-8 is still not supported enough
> |   What's left?
> |   Yes, ASCII.
>
> Not for Cyrillic users.
Yes, for all users. That's what character entities are for.

Of course, as already said, I do not request you to *write* them. I just
said that UTF-8 as an encoding isn't supported enough, so in general it's
best to use ASCII and character entities for publishing.

> There are appx. 300 million people using Cyrillic. It's usage includes
> Russian language but not limited to it. Bulgarian, Serbian, Macedonian,
> Ukranian, Belorussian languages, other ex-USSR countries use Cyrillic
> alphabet.
> ASCII knows nothing  about Cyrillic.
But character entities know.

> What options are left for us? Well, people in Russia use windows-cp1251
> encoding, invented and implemented by Microsoft.
To me that's not an "encoding" at all, it's ******** to me ;)

> ASCII has no usage here. So, frankly speaking your (mine) choice
> is between
> 2 options: cp-1251 and Unicode.  I prefer Unicode but agree that
> cp1251 has
> dominance, due to the fact that Microsoft software is installed
> on more than
> 90% of desktops.
I prefer the third choice: ASCII with character entities for those Unicode
characters of the Cyrillic alphabet.

> |   Of course I do not suggest you *write* using ASCII, that can
> be annoying,
> |   even in German, where it is required to use &auml;, &Auml;, &ouml;,
> | &Ouml;, &uuml;, &Uuml; and &szlig;. How annoying must it be in Chinese!
> |   I suggest write in whatever encoding you like.
> |
> |   I suggest you *publish* in ASCII because that's always supported.
>
> well, let me back here XML (while I understand that it can be partially
> off-topic on www-html mailing list)
> default encoding for XML is UTF-8. So frankly speaking I do not
> understand why you want to use ASCII when UTF-8 is default (standard)
UTF-8 and UTF-16 are default. That way, ASCII automatically is default, too.
What's the point about "default" then?


> As I have mentioned, I use KWord for documents. KWord's native
> format is XML.
> XML documents are encoded in UTF8. To save disk space, all XML files and
> gzipped. (.tar.gz)
> KWord "publishes" docs in HTML, XHTML, PostScript or PDF. New
> export filters
> are coming, but currentl list covers more than 99% of typical usage.
Great. So what's the point?

> So thanks for proposed conversion method but I think it's rather
> useless for me, as I use more advanced technics ;-)

"more advanced" - you should tell that James J. Clark, the inventor of XSLT.

KWord won't automatically add icons for off-site or foreign-language-links,
so what shall be "more advanced"?
Anyway, I prefer vim ;) (which is capable of Unicode and UTF-8, which I also
use for writing, but I use ASCII for publishing).


Of course the final solution is UTF-8. But until many major browsers,
including IE, have problems regarding UTF-8, an intermediate but compatible
solution is required. And ASCII is compatible to everything except EBCDIC
and -alikes.


Greetings

Christian
Received on Friday, 7 December 2001 12:02:57 UTC