Re: Stripping lang markup

Hi Jukka,

Thanks for the note.  You have found the right place.

+1 on your proposal.   I agree that the text language identification, e.g. 
<span lang="fr">Foo</span>, should be kept.    It is a simple tweak to 
avoid doing so.

Comments anyone?

take it easy,
Charlie


At 12:52 PM 3/18/2003 +0200, Jukka K. Korpela wrote:
>I have found the "Strip surplus tags in Word 2000 pages" in HTML-Kit very 
>useful.
>
>However, it seems to go too far in one issue: it removes language 
>markup.  Specifically, when a document is saved in "HTML" format by Word, 
>information about language is included. This seems to apply to such 
>information as specified by the author's settings of the language for
>different parts of the document (as opposite to Word's heuristic guesses 
>of language).
>
>It is useful for an author to set the language for text fragments in Word, 
>since he gets spelling checking according to each language.  It is 
>potentially very useful to carry this information over to HTML.  Although 
>language markup is still used rather little, it has great potential, as 
>described e.g. at
>http://www.w3.org/TR/html4/struct/dirlang.html
>http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign
>
>And part of making the potential a reality is to favor the use of language 
>markup - preferably so that it's easy to the author. And Word makes it 
>relatively easy.
>
>What happens when I, say, paint some words in a document otherwise in 
>Finnish and use Word's menu to set the language of those words English (in 
>a situation where Word's heuristics hasn't worked)? I don't really
>know the intrinsics of Word, but what it writes into "HTML" is like this:
>
><p class=MsoNormal><span lang=FI>(text in Finnnish)</span><span lang=EN-US
>style='mso-ansi-language:EN-US'>(text in English)</span><span 
>lang=FI>(text in
>Finnish)</span></p>
>
>This isn't optimally clever, and the style attribute with a Microsoft 
>specific stuff should stripped off, but the lang attributes _should_ be 
>preserved, together with any <span> markup needed to carry them.
>The lang attribute is completely standard (though HTML-Kit might wish to 
>add equivalent xml:lang attributes as well) and structural.
>
>Note that currently HTML-Kit strips away e.g. a LANG attribute from <body> 
>as well, so this isn't just a by-product of stripping <span>.
>
>--
>Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Tuesday, 18 March 2003 11:17:06 UTC