- From: Charles Reitzel <creitzel@rcn.com>
- Date: Tue, 18 Mar 2003 11:28:13 -0500
- To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
- Cc: html-tidy@w3.org
Hi Jukka, Thanks for the note. You have found the right place. +1 on your proposal. I agree that the text language identification, e.g. <span lang="fr">Foo</span>, should be kept. It is a simple tweak to avoid doing so. Comments anyone? take it easy, Charlie At 12:52 PM 3/18/2003 +0200, Jukka K. Korpela wrote: >I have found the "Strip surplus tags in Word 2000 pages" in HTML-Kit very >useful. > >However, it seems to go too far in one issue: it removes language >markup. Specifically, when a document is saved in "HTML" format by Word, >information about language is included. This seems to apply to such >information as specified by the author's settings of the language for >different parts of the document (as opposite to Word's heuristic guesses >of language). > >It is useful for an author to set the language for text fragments in Word, >since he gets spelling checking according to each language. It is >potentially very useful to carry this information over to HTML. Although >language markup is still used rather little, it has great potential, as >described e.g. at >http://www.w3.org/TR/html4/struct/dirlang.html >http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign > >And part of making the potential a reality is to favor the use of language >markup - preferably so that it's easy to the author. And Word makes it >relatively easy. > >What happens when I, say, paint some words in a document otherwise in >Finnish and use Word's menu to set the language of those words English (in >a situation where Word's heuristics hasn't worked)? I don't really >know the intrinsics of Word, but what it writes into "HTML" is like this: > ><p class=MsoNormal><span lang=FI>(text in Finnnish)</span><span lang=EN-US >style='mso-ansi-language:EN-US'>(text in English)</span><span >lang=FI>(text in >Finnish)</span></p> > >This isn't optimally clever, and the style attribute with a Microsoft >specific stuff should stripped off, but the lang attributes _should_ be >preserved, together with any <span> markup needed to carry them. >The lang attribute is completely standard (though HTML-Kit might wish to >add equivalent xml:lang attributes as well) and structural. > >Note that currently HTML-Kit strips away e.g. a LANG attribute from <body> >as well, so this isn't just a by-product of stripping <span>. > >-- >Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Tuesday, 18 March 2003 11:17:06 UTC