- From: Lee Passey <lee@www.dysfunctionals.org>
- Date: Wed, 20 Feb 2002 14:42:26 -0700
- To: "'Tidy Development'" <tidy-develop@lists.sourceforge.net>
- CC: Dave Raggett <dave.raggett@openwave.com>, html-tidy <html-tidy@w3.org>
I've spent a fair amount of time on an off during the last couple of weeks looking at "mshtm" output from Word 2000. While it has many, many flaws, over-abundance of non-breaking spaces is not one of them. I suspect that the --bare option was more designed to allow people to gracefully downgrade from the most advanced named entities more than to clean up Micro$ofts mess -- that's what I use it for. Along these same lines, another recent named entity that Word 2000 seems enamoured with, and which many of my older browsers support only half-heartedly, is …. So I would like to add some code to the MakeBare option which will convert that entity to  . . . (that's three periods delimited by three non-breaking spaces). Unless I am faced with vehement objections, I'll check in these changes about this time next week. In the meantime, I'm going to continue to explore how we can deal with mshtm more effectively. Dave Raggett wrote: >If I remember rightly, Word 2000 was creating way too many >non-breaking spaces, and it was going to make the user's >work much easier to just get rid of them. Remember the -bare >option is a draconian clean up of the messy HTML exported by >Word 2000. Ideally, some kind soul would volunteer to write >an expert system to figure out the authors intentions from >amongst the complexity of the Word output. This for instance >would involve understanding the various uses of tab stops >and symbols for lists. > >--bare is probably the wrong name. I think I added it to >avoid polluting the work I had done on the word2000 option. >It would be good for someone to investigate the Office 2000 >issue in more detail and come up with a recommendation. > >Regards, > >-- Dave Raggett <dave.raggett@openwave.com> or <dsr@w3.org> >W3C Visiting Fellow, see http://www.w3.org/People/Raggett >tel/fax: +44 1225 866-240 (or 867-351) +44 771 213 7629 (gsm) > >-----Original Message----- >From: tidy-develop-admin@lists.sourceforge.net >[mailto:tidy-develop-admin@lists.sourceforge.net]On Behalf Of Jelks >Cabaniss >Sent: Saturday, February 02, 2002 3:34 PM >To: 'Tidy Development' >Subject: RE: [Tidy-dev] MakeBare and hyphens > > >Lee Passey wrote: > >> Filters from Word and PowerPoint often use smart >> quotes resulting in character codes between 128 >> and 159. Unfortunately, the corresponding HTML 4.0 >> entities for these are not widely supported. The >> following converts dashes and quotation marks to >> the nearest ASCII equivalent. >> > >>The code maps m- and n-dashes to hyphen, right and left >>double quotes (which never seem to look good on the screen) >>to " and right and left single quotes to '. >> > >Ah, so -bare is kind of a "downgrade Unicode characters to their ASCII >equivalents"? In that case, it could indeed be useful at times. > >>the -word-2000 >>option correctly maps these codes to their unicode >>equivalents, which are in the range 0x2013 to 0x201E, but >>only the most recent UAs know how to display them correctly. >> > >Last I looked, NS4 would display these correctly if the &#nnn; >representation were used (at least the dashes, don't recall about the >curly quotes). So if the author wanted, he could invoke >-numeric-entities. > >>Unfortunately, it also maps to  (the reqular >>ASCII space) which _is_ fairly universally recognized. If I >>ruled the world, I would remove that "feature." >> > >Agreed. Getting rid of 's is an assumption sometimes warranted, >but sometimes otherwise. > > >/Jelks > > >_______________________________________________ >Tidy-develop mailing list >Tidy-develop@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/tidy-develop > > > > >_______________________________________________ >Tidy-develop mailing list >Tidy-develop@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/tidy-develop >
Received on Wednesday, 20 February 2002 16:45:09 UTC