Stripping lang markup

I first sent this message to HTML-Kit maintainers. I was kindly told that
the feature I wrote about is actually a Tidy feature. Please excuse me for
describing the problem in HTML-Kit terms. I hope you can translate
them into Tidy concepts.

I have found the "Strip surplus tags in Word 2000 pages" in HTML-Kit very
useful.

However, it seems to go too far in one issue: it removes language markup.
Specifically, when a document is saved in "HTML" format by Word,
information about language is included. This seems to apply to such
information as specified by the author's settings of the language for
different parts of the document (as opposite to Word's heuristic guesses
of language).

It is useful for an author to set the language for text fragments in
Word, since he gets spelling checking according to each language.
It is potentially very useful to carry this information over to HTML.
Although language markup is still used rather little, it has great
potential, as described e.g. at
http://www.w3.org/TR/html4/struct/dirlang.html
http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign

And part of making the potential a reality is to favor the use of
language markup - preferably so that it's easy to the author. And Word
makes it relatively easy. (It's apparently not automatic, though.
If Word heuristically recognizes the language of a text fragment,
it applies suitable spelling checking algorithms but doesn't seem to
store the information about language into the document when saved onto
disk. Perhaps the reason is that the authors of Word thought that it would
be redundant to store information that can be reconstructed on the fly.)

What happens when I, say, paint some words in a document otherwise in
Finnish and use Word's menu to set the language of those words English
(in a situation where Word's heuristics hasn't worked)? I don't really
know the intrinsics of Word, but what it writes into "HTML" is like this:

<p class=MsoNormal><span lang=FI>(text in Finnnish)</span><span
lang=EN-US style='mso-ansi-language:EN-US'>(text in English)</span><span
lang=FI>(text in Finnish)</span></p>

This isn't optimally clever, and the style attribute with a Microsoft
specific stuff should stripped off, but the lang attributes _should_
be preserved, together with any <span> markup needed to carry them.
The lang attribute is completely standard (though HTML-Kit might
wish to add equivalent xml:lang attributes as well) and structural.

Note that currently HTML-Kit strips away e.g. a LANG attribute from
<body> as well, so this isn't just a by-product of stripping <span>.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Tuesday, 18 March 2003 05:54:25 UTC