- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Tue, 18 Mar 2003 12:52:49 +0200 (EET)
- To: html-tidy@w3.org
I first sent this message to HTML-Kit maintainers. I was kindly told that the feature I wrote about is actually a Tidy feature. Please excuse me for describing the problem in HTML-Kit terms. I hope you can translate them into Tidy concepts. I have found the "Strip surplus tags in Word 2000 pages" in HTML-Kit very useful. However, it seems to go too far in one issue: it removes language markup. Specifically, when a document is saved in "HTML" format by Word, information about language is included. This seems to apply to such information as specified by the author's settings of the language for different parts of the document (as opposite to Word's heuristic guesses of language). It is useful for an author to set the language for text fragments in Word, since he gets spelling checking according to each language. It is potentially very useful to carry this information over to HTML. Although language markup is still used rather little, it has great potential, as described e.g. at http://www.w3.org/TR/html4/struct/dirlang.html http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign And part of making the potential a reality is to favor the use of language markup - preferably so that it's easy to the author. And Word makes it relatively easy. (It's apparently not automatic, though. If Word heuristically recognizes the language of a text fragment, it applies suitable spelling checking algorithms but doesn't seem to store the information about language into the document when saved onto disk. Perhaps the reason is that the authors of Word thought that it would be redundant to store information that can be reconstructed on the fly.) What happens when I, say, paint some words in a document otherwise in Finnish and use Word's menu to set the language of those words English (in a situation where Word's heuristics hasn't worked)? I don't really know the intrinsics of Word, but what it writes into "HTML" is like this: <p class=MsoNormal><span lang=FI>(text in Finnnish)</span><span lang=EN-US style='mso-ansi-language:EN-US'>(text in English)</span><span lang=FI>(text in Finnish)</span></p> This isn't optimally clever, and the style attribute with a Microsoft specific stuff should stripped off, but the lang attributes _should_ be preserved, together with any <span> markup needed to carry them. The lang attribute is completely standard (though HTML-Kit might wish to add equivalent xml:lang attributes as well) and structural. Note that currently HTML-Kit strips away e.g. a LANG attribute from <body> as well, so this isn't just a by-product of stripping <span>. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Tuesday, 18 March 2003 05:54:25 UTC