W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2003

Re: Stripping lang markup

From: Charles Reitzel <creitzel@rcn.com>
Date: Tue, 18 Mar 2003 11:28:13 -0500
Message-Id: <>
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
Cc: html-tidy@w3.org

Hi Jukka,

Thanks for the note.  You have found the right place.

+1 on your proposal.   I agree that the text language identification, e.g. 
<span lang="fr">Foo</span>, should be kept.    It is a simple tweak to 
avoid doing so.

Comments anyone?

take it easy,

At 12:52 PM 3/18/2003 +0200, Jukka K. Korpela wrote:
>I have found the "Strip surplus tags in Word 2000 pages" in HTML-Kit very 
>However, it seems to go too far in one issue: it removes language 
>markup.  Specifically, when a document is saved in "HTML" format by Word, 
>information about language is included. This seems to apply to such 
>information as specified by the author's settings of the language for
>different parts of the document (as opposite to Word's heuristic guesses 
>of language).
>It is useful for an author to set the language for text fragments in Word, 
>since he gets spelling checking according to each language.  It is 
>potentially very useful to carry this information over to HTML.  Although 
>language markup is still used rather little, it has great potential, as 
>described e.g. at
>And part of making the potential a reality is to favor the use of language 
>markup - preferably so that it's easy to the author. And Word makes it 
>relatively easy.
>What happens when I, say, paint some words in a document otherwise in 
>Finnish and use Word's menu to set the language of those words English (in 
>a situation where Word's heuristics hasn't worked)? I don't really
>know the intrinsics of Word, but what it writes into "HTML" is like this:
><p class=MsoNormal><span lang=FI>(text in Finnnish)</span><span lang=EN-US
>style='mso-ansi-language:EN-US'>(text in English)</span><span 
>lang=FI>(text in
>This isn't optimally clever, and the style attribute with a Microsoft 
>specific stuff should stripped off, but the lang attributes _should_ be 
>preserved, together with any <span> markup needed to carry them.
>The lang attribute is completely standard (though HTML-Kit might wish to 
>add equivalent xml:lang attributes as well) and structural.
>Note that currently HTML-Kit strips away e.g. a LANG attribute from <body> 
>as well, so this isn't just a by-product of stripping <span>.
>Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Tuesday, 18 March 2003 11:17:06 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:53 UTC