Re: Stripping lang markup

i am not positive that the "lang" attribute is part of the HTML 4.0 
specification or not.  I do not use that attribute, so I am not 
really familiar with the issue.

however, I am very familiar with useing MS Word's "save as html" 
feature and find the proprietary HTML that M$ uses to be annoying and 
frustrating.  What's worse is WordXP, which has now introduced 'smart 
tags" which clutter up and litter each document with vast quantities 
of unneeded proprietary XML markup for all sorts of idiotic things.  
It would be nice if M$ could truly make the tags "smart" by allowing 
them to be turned off so that those of us who do not like them can 
opt out of their brand of HMTL.  

Its truly rediculous.  Once you remove the proprietary HTML, the 
document size is reduced by 5-10 fold.

peace,
dude

> 
> Hi Jukka,
> 
> Thanks for the note.  You have found the right place.
> 
> +1 on your proposal.   I agree that the text language
> identification, e.g. <span lang="fr">Foo</span>, should be kept.  
>  It is a simple tweak to avoid doing so.
> 
> Comments anyone?
> 
> take it easy,
> Charlie
> 
> 
> At 12:52 PM 3/18/2003 +0200, Jukka K. Korpela wrote:
>> I have found the "Strip surplus tags in Word 2000 pages" in
>> HTML-Kit very useful.
>> 
>> However, it seems to go too far in one issue: it removes language
>> markup.  Specifically, when a document is saved in "HTML" format
>> by Word, information about language is included. This seems to
>> apply to such information as specified by the author's settings
>> of the language for different parts of the document (as opposite
>> to Word's heuristic guesses of language).
>> 
>> It is useful for an author to set the language for text fragments
>> in Word, since he gets spelling checking according to each
>> language.  It is potentially very useful to carry this
>> information over to HTML.  Although language markup is still used
>> rather little, it has great potential, as described e.g. at
>> http://www.w3.org/TR/html4/struct/dirlang.html
>> http://www.w3.org/TR/WCAG10/#gl-abbreviated-and-foreign
>> 
>> And part of making the potential a reality is to favor the use of
>> language markup - preferably so that it's easy to the author. And
>> Word makes it relatively easy.
>> 
>> What happens when I, say, paint some words in a document
>> otherwise in Finnish and use Word's menu to set the language of
>> those words English (in a situation where Word's heuristics
>> hasn't worked)? I don't really know the intrinsics of Word, but
>> what it writes into "HTML" is like this:
>> 
>> <p class=MsoNormal><span lang=FI>(text in Finnnish)</span><span
>> lang=EN-US style='mso-ansi-language:EN-US'>(text in
>> English)</span><span lang=FI>(text in
>> Finnish)</span></p>
>> 
>> This isn't optimally clever, and the style attribute with a
>> Microsoft specific stuff should stripped off, but the lang
>> attributes _should_ be preserved, together with any <span> markup
>> needed to carry them. The lang attribute is completely standard
>> (though HTML-Kit might wish to add equivalent xml:lang attributes
>> as well) and structural.
>> 
>> Note that currently HTML-Kit strips away e.g. a LANG attribute
>> from <body> as well, so this isn't just a by-product of stripping
>> <span>.
>> 
>> --
>> Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

_________________________________________________________________
    http://fastmail.ca/ - Fast Secure Web Email for Canadians

Received on Tuesday, 18 March 2003 12:14:19 UTC