W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2001

RE: SourceForge Project Approved

From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
Date: Thu, 24 May 2001 11:25:14 +1200 (NZST)
Message-Id: <200105232325.LAA409489@atlas.otago.ac.nz>
To: Valeri.Atamaniouk@nokia.com, html-tidy@w3.org
Valeri.Atamaniouk@nokia.com wrote:

	In general I agree with you :). But also what prevents you from creating all
	those fine things in C :).

Time and money.  The thing is that Tidy doesn't *need* any of these things
for what it does, it has a data structure that suits it perfectly well.

	DOM do not make any limits on implementation,

This turns out not to be true.  The DOM places very severe limits on how
you implement things, limits which dramatically increase the storage costs
(minimum of 2.5 times) and time (I have measured an approximately 20%
slowdown forced by one particular DOM requirement).

	and it is possible to implement strings, storage reclamation &
	exceptions.  As far as I understand the last two would be really
	usefull as definitevely improve performance (exit(2) is not an
	appropriate solution for library function :)).

Strings?  Yes, but the DOM explicitly requires immutable *UNICODE* strings.
Even if using UTF-8 internally would save you nearly a factor of two in
space, the DOM does not allow you to do that.  (Note that the wchar_t type
and wcs* strings in standard C don't help, because they are commonly 32-bit
characters, not 16-bit characters, which is what the DOM absolutely demands.)

Why should someone implement a 16-bit string library that Tidy doesn't
need and wouldn't particularly benefit from, just because a data structure
that was designed for Javascript demands them?

The Boehm conservative garbage collector for C exists, is freely available,
has seen a lot of use, and is generally a Fine Thing.  However, it is a
*conservative* garbage collector, that being pretty much the best that is
possible in C (where you can take a pointer, convert it to an integer,
mangle the integer, and then days later demangle the integer, convert it
back to a pointer, and expect to be able to use the pointer as if nothing
had happened), and will on occasion leak space that could have been reclaimed.
It was a *major* piece of work.

Tidy-as-a-program doesn't *need* garbage collection.

Tidy-as-a-library *will* need careful storage management design, which is
one reason why I'd like to see tidy-as-a-library wait until known bug-fixes
are installed and tested.  But if Tidy-as-a-library is to be usable in
other people's C code, it had better not demand that *they* write for
garbage collection too.

Exceptions:  there are a couple of versions around for C, one of them comes
as an example with FunnelWeb.  I've done ny own, too.  However, no-one in
their right mind would say that library-level exception-handling in C could
be expeected to improve performance.

Once again, requiring Tidy-as-a-library to use some sort of library-level
exception handling interface would make it of very little use to C
programmers trying to use it.  It would make life *more* complicated for
them, not less.  (It's different in a language with language-level exceptions.)
To do this just so one could conform to a deeply flawed interface designed for
other purposes entirely strikes me as, um, perverse.

	> The next time someone suggests that Tidy should use the W3C DOM, let's
	> require them to implement, oh, one of the W3C suggestions:
	>     change <I>...</I> to <EM>...</EM>
	>        and <B>...</B> to <STRONG>...</STRONG>
	> in both the W3C DOM and Tidy's data structure, and see 
	> whether they still
	> think it would be a good idea.  If they do, then demand that 
	> they do it.
	Agreed again.  But regarding DOM implementation is limits only
	the _minimal_ functionality.  You may provide you own as well.
If you provide extra operations *inside* the DOM "capsule", you lose the
alleged portability advantages of the DOM.  People trying to write code
according to the DOM interfaces will not be able to use your operations,
because those operations are not in the DOM.

If you provide extra operations *outside* the DOM "capsule", fine,
as long as you make it clear to people that to move their code to a different
implementation of the DOM they will have to copy the source code of your
extra operations.

However, putting operations like "change the name of this element" outside
the DOM "capsule" makes them more expensive than they needed to be.

	But anyway your arguments seem good for me.  But I still think
	that automatic storage reclamation and exception 'emulation'
	(via setjmp/longjump) would be really usefull.

Oh, they are.  That's why Lisp, Smalltalk, Prolog, Erlang, Mercury, ...
have GC and exceptions.  When Dave Raggett wrote Tidy in C, he gave us
    - portability (any reasonable C compiler should handle Tidy)
    - smallness (Tidy doesn't carry any "baggage" it doesn't need for the job)
    - speed (so that it is feasible to clean up large sites quickly).
and *not* implementing Unicode strings, automatic storage reclamation,
or exception handling was one of the means he used to give us those things.

It would be a pity if Tidy-as-a-library were to lose any of these merits.

Nothing in any of my messages should be construed as arguing against a
*separate* adapter library that converts Tidy's tree to a W3C-conformant
DOM should anyone happen to need such a thing.  Tidy-to-SAX should be
straightforward, and SAX-to-DOM is a well-trodden route.
Received on Wednesday, 23 May 2001 19:25:41 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:50 UTC