W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2002

RE: don't collapse two spaces at the end of a sentence

From: Reitzel, Charlie <CReitzel@arrakisplanet.com>
Date: Wed, 2 Jan 2002 19:16:35 -0500
Message-ID: <B5C79DDBC655D311B6BD0008C7E64D76013C1A29@exchange.arrakisplanet.com>
To: "'Richard A. O'Keefe'" <ok@atlas.otago.ac.nz>, html-tidy@w3.org
Interesting points.  In particular, your previous point about the rendering
of HTML _source_ vs. the document itself is well taken.  

A couple questions:

1) Can you give us a link to current TeX sources?  I'll bet these will be
generally useful.

2) Can you give us a reference to the Unicode sentence break algorithm?  I
searched at www.unicode.org, but didn't see it.  I did find line break
algorithms, but that is something else.

3) Can you give some guidance on where, within the Tex sources, you would
find the sentence end detection code (and, by implication, how you arrive at
your size estimate for sentence end support)?

In the end, I think it boils down to priorities.  I get the impression that
decent HTML handling is more important than source niceties.  For example, I
would guess that decent asian language support is more important that
handling two spaces after sentences.  Patches are always welcome, however.

take it easy,
Charlie


-----Original Message-----
From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz]
Sent: Monday, December 17, 2001 9:12 PM
To: Todd_Lewis@unc.edu; html-tidy@w3.org; lee@novonyx.com
Subject: Re: don't collapse two spaces at the end of a sentence


	I understood the original problem to be that when Tidy
     rewraps raw blocks of text, it doesn't do the two-space
     two step.

The problem is not that it doesn't _add_ double-spacing,
but that it doesn't _preserve_ double-spacing that is already there.

	All the issues you brought up about how to determine the
     end of sentences (in various languages no less) have been
     worked out for years in TeX, and the code is free for the
     taking.

Since HTML 4 and XHTML are based on Unicode, it may be relevant to note that
the Unicode standard includes a method for determining sentence boundaries.
It's not claimed to be perfect, but it works pretty well for a wide range of
languages and scripts.

	If it were important enough to some coder to preserve his
     two spaces (or "correct" it in HTML from other authors /
     sources), then he could take the appropriate part of TeX's
	code and incorporate it into Tidy, therefore doubling it's
     size (or there abouts -- I'm guessing).

A very wild guess indeed.  A better guess would be 0.5%.
Received on Wednesday, 2 January 2002 19:16:42 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:51 GMT