W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2001

RE: SourceForge Project Approved

From: Reitzel, Charlie <CReitzel@arrakisplanet.com>
Date: Tue, 29 May 2001 11:57:18 -0400
Message-ID: <B5C79DDBC655D311B6BD0008C7E64D76013C1573@exchange.arrakisplanet.com>
To: "'Richard A. O'Keefe'" <ok@atlas.otago.ac.nz>, Valeri.Atamaniouk@nokia.com, html-tidy@w3.org
Hey guys, why don't we take this over to Source Forge where it will do some
good?  I think you are both capturing some important points for the library
to address.  If things go well this summer, we may have the opportunity to
confront these issues in the fall.

My digs: 
Many C compilers (all PC implementations) implement a decent suballocator
within malloc/free.  Some Unix compilers may not, yet.  I'm not sure.  Tidy
already uses a wrapper function for memory allocation.  On platforms that
need it, there is no reason why we can't call a sub-allocator.

I think setjmp/longjmp are a last resort.  I'm glad to see agreement about
_not_ exposing these to the public interface.  If need be, this can be
implemented later w/out breaking application code.  Valeri, I think you
identified the thread-safety problem exactly.

I would rather see a clean, exception-safe coding style used.  I.e. would
shouldn't leave data structures in a munged state if an error is
encountered.  Thus, cleanup will alway be possible.  This goes hand-in-hand
with a "no globals" design.  Thus, free-ing a Tidy document object will
release all related resources.  

For the time being, I'd rather see the discussion focus on what the public
interface for error handling should look like.  I think the application
should control the lifetime of the document objects.  For example, it may be
desireable to keep the object around to report the error, including error
message(s) and locator(s) into the input document.

More on SF.

take it easy,
Charlie

-----Original Message-----
From: Richard A. O'Keefe [mailto:ok@atlas.otago.ac.nz]
Sent: Sunday, May 27, 2001 8:22 PM
To: Valeri.Atamaniouk@nokia.com; html-tidy@w3.org; ok@atlas.otago.ac.nz
Subject: RE: SourceForge Project Approved


Valeri.Atamaniouk@nokia.com wrote:

	Tidy operates with 32-bit characters :).

Yes, and the W3C DOM ***forbids*** that.

	And it actually consumes more memory but I believe this is the
	fastest solution (ANSI C requires int to be the fastest integral
	type of at least 16bit length).

Since neither C89 nor C99 requires int to be 32 bits wide, this is
irrelevant to a discussion of 32-bit characters.

	Unfortunately I think it is better not to use standard library
	functions:  they may have support for Unicode (UTF-32/UCS4) but

My point exactly, but misunderstood.  They *may*, but they *might not*.
My point was not that wchar_t is a 32-bit type (in many C systems it is
only an 8-bit type) but simply that it is not guaranteed to be a 16-bit
type, which is what the W3C DOM requires.

Tidy operates in an enviroment with "something" on ASCII subset.  It is
better because allow to process a lot of encodings with a minimal
effort.

I wrote:
	> Once again, requiring Tidy-as-a-library to use some sort of 
	> library-level
	> exception handling interface would make it of very little use to C
	> programmers trying to use it.  It would make life *more* 
	> complicated for
	> them, not less.  (It's different in a language with 
	> language-level exceptions.)
	> To do this just so one could conform to a deeply flawed 
	> interface designed for
	> other purposes entirely strikes me as, um, perverse.

Valeri.Atamaniouk@nokia.com wrote:

	Just an example:  is you process something and in
	some_very_deep_function you encounter a problem, you'll have to
	return a error code, check it in the higher-level function,
	return error code etc.  At the end you'll have to release all
	the memory allocated (supposing that structure's integrity is not
	destroyed).

It's not clear what the point of this reply is.  I think we can reasonably
take in that most people reading this list know what exception handling is
for.  I'm sure Dave Raggett does.

The thing is, how is this relevant to HTML Tidy?  The whole point of Tidy
is to *be* the error handling code.  Well, I exaggerate.  But its job is
to take sows' ears and turn them into silk purses, from messy input to
create a clean data structure.

	Another approach: you create a memory allocator and start
processing. All
	functions get their memory from this allocator. If there is a error,
an
	exception is emulated (longjmp would be fine) to the starting point.
Having
	it we just release allocator (and all the memory wich was allocated
from it
	in one operation). This makes unnessesary numerous error checks.
Besides
	mechanism with preallocation of a chunk of a memory and then
distributing it
	as needed works faster, than standard malloc/realloc/free. You will
be
	suprised how much faster.
	
Not really.  I have been allocating memory in great big blocks and running
a simple suballocator for hmm, the last 20 years.  Anyone who is interested
will find the technique expounded in Donald Knuth's book "The Stanford
GraphBase".  For what it's worth, that is a collection of graph algorithms
with data structure doing very complex things, with no general exception-
handling machinery in sight.

The fact remains that the W3C DOM is framed in terms of exception handling
(although strangely, some obvious erroneous operations, like creating a
comment with an embedded "--", are *forbidden* to raise an exception) and
there is no "standard model" or consensus for how to translate that to C.
Converting Tidy to use the W3C DOM would not provide any portability
benefits to anyone whatsoever.

The story is very different for JTidy, except that Java programmers like
the W3C DOM so much that they are flocking to an unfinished rival called
JDOM.
Received on Tuesday, 29 May 2001 11:57:43 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:45 GMT