- From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
- Date: Thu, 24 May 2001 11:25:14 +1200 (NZST)
- To: Valeri.Atamaniouk@nokia.com, html-tidy@w3.org
Valeri.Atamaniouk@nokia.com wrote: In general I agree with you :). But also what prevents you from creating all those fine things in C :). Time and money. The thing is that Tidy doesn't *need* any of these things for what it does, it has a data structure that suits it perfectly well. DOM do not make any limits on implementation, This turns out not to be true. The DOM places very severe limits on how you implement things, limits which dramatically increase the storage costs (minimum of 2.5 times) and time (I have measured an approximately 20% slowdown forced by one particular DOM requirement). and it is possible to implement strings, storage reclamation & exceptions. As far as I understand the last two would be really usefull as definitevely improve performance (exit(2) is not an appropriate solution for library function :)). Strings? Yes, but the DOM explicitly requires immutable *UNICODE* strings. Even if using UTF-8 internally would save you nearly a factor of two in space, the DOM does not allow you to do that. (Note that the wchar_t type and wcs* strings in standard C don't help, because they are commonly 32-bit characters, not 16-bit characters, which is what the DOM absolutely demands.) Why should someone implement a 16-bit string library that Tidy doesn't need and wouldn't particularly benefit from, just because a data structure that was designed for Javascript demands them? The Boehm conservative garbage collector for C exists, is freely available, has seen a lot of use, and is generally a Fine Thing. However, it is a *conservative* garbage collector, that being pretty much the best that is possible in C (where you can take a pointer, convert it to an integer, mangle the integer, and then days later demangle the integer, convert it back to a pointer, and expect to be able to use the pointer as if nothing had happened), and will on occasion leak space that could have been reclaimed. It was a *major* piece of work. Tidy-as-a-program doesn't *need* garbage collection. Tidy-as-a-library *will* need careful storage management design, which is one reason why I'd like to see tidy-as-a-library wait until known bug-fixes are installed and tested. But if Tidy-as-a-library is to be usable in other people's C code, it had better not demand that *they* write for garbage collection too. Exceptions: there are a couple of versions around for C, one of them comes as an example with FunnelWeb. I've done ny own, too. However, no-one in their right mind would say that library-level exception-handling in C could be expeected to improve performance. Once again, requiring Tidy-as-a-library to use some sort of library-level exception handling interface would make it of very little use to C programmers trying to use it. It would make life *more* complicated for them, not less. (It's different in a language with language-level exceptions.) To do this just so one could conform to a deeply flawed interface designed for other purposes entirely strikes me as, um, perverse. > The next time someone suggests that Tidy should use the W3C DOM, let's > require them to implement, oh, one of the W3C suggestions: > change <I>...</I> to <EM>...</EM> > and <B>...</B> to <STRONG>...</STRONG> > in both the W3C DOM and Tidy's data structure, and see > whether they still > think it would be a good idea. If they do, then demand that > they do it. Agreed again. But regarding DOM implementation is limits only the _minimal_ functionality. You may provide you own as well. If you provide extra operations *inside* the DOM "capsule", you lose the alleged portability advantages of the DOM. People trying to write code according to the DOM interfaces will not be able to use your operations, because those operations are not in the DOM. If you provide extra operations *outside* the DOM "capsule", fine, as long as you make it clear to people that to move their code to a different implementation of the DOM they will have to copy the source code of your extra operations. However, putting operations like "change the name of this element" outside the DOM "capsule" makes them more expensive than they needed to be. But anyway your arguments seem good for me. But I still think that automatic storage reclamation and exception 'emulation' (via setjmp/longjump) would be really usefull. Oh, they are. That's why Lisp, Smalltalk, Prolog, Erlang, Mercury, ... have GC and exceptions. When Dave Raggett wrote Tidy in C, he gave us - portability (any reasonable C compiler should handle Tidy) - smallness (Tidy doesn't carry any "baggage" it doesn't need for the job) - speed (so that it is feasible to clean up large sites quickly). and *not* implementing Unicode strings, automatic storage reclamation, or exception handling was one of the means he used to give us those things. It would be a pity if Tidy-as-a-library were to lose any of these merits. Nothing in any of my messages should be construed as arguing against a *separate* adapter library that converts Tidy's tree to a W3C-conformant DOM should anyone happen to need such a thing. Tidy-to-SAX should be straightforward, and SAX-to-DOM is a well-trodden route.
Received on Wednesday, 23 May 2001 19:25:41 UTC