W3C home > Mailing lists > Public > html-tidy@w3.org > April to June 2001

Re: Html-Tidy BUG ???

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 21 Jun 2001 02:53:24 +0200
To: "Reitzel, Charlie" <CReitzel@arrakisplanet.com>
Cc: "'Gary L Peskin'" <garyp@firstech.com>, html-tidy@w3.org
Message-ID: <8gg2jtop4drm7208m4objnoh4mcf851deg@4ax.com>
* Reitzel, Charlie wrote:
>DOM compatibility is an issue for the C version of Tidy as well.

In my first HTML::Parser::Tidy hack, I had something like

void _process_tree(CHANDLER* hdl, Lexer *lexer, Node *node) {
	Node *content;

	if (node == null)
		return;

	if (node->type == TextNode)
		_fire_characters_e(hdl->handlers[E_CHARACTERS], lexer,
                    node);

	else if (node->type == RootNode) {
		for (content = node->content;
				content != null;
				content = content->next)
			_process_tree(hdl, lexer, content);
	}

	else if (node->type == StartTag) {
		_fire_start_element_e(hdl->handlers[E_START_ELEMENT],
                    node);

		for (content = node->content;
				content != null;
				content = content->next)
			_process_tree(hdl, lexer, content);

		_fire_end_element_e(hdl->handlers[E_END_ELEMENT], node);
	}
	else {
		printf("uups, type: %d\n", node->type);
	}
}

This traverses the given Tidy Node* and fires the appropriate SAX events
to the given Perl SAX handler e.g. to rebuild a DOM structure with
XML::Handler::BuildDOM or XML::XPath::Builder, see the
XML::Parser::PerlSAX for more informations.

Something that drove me nuts was this:

void _fire_characters_e(SV* handler, Lexer *lexer, Node *node) {
	HV* hash;
	STRLEN len = 0;
	uint i, count, bytes, c = 0, pos = 0;
	char *p;

	New(42, p, (node->end - node->start + 6), char);

	for (i = node->start; i < node->end;)
	{
		bytes = 1;

		c = (unsigned char)lexer->lexbuf[i];

		if (c > 0x7F) {
			bytes += GetUTF8((unsigned char *)lexer->lexbuf
                                   + i, &c);
		}

		for (count = 0; count < bytes; count++) {
			p[pos+count] = lexer->lexbuf[i+count];
		}

		i += bytes;
		pos += bytes;
	}
	p[pos] = '\0';

	hash = newHV();
	hv_store(hash, "Data", 4, newSVpv(p, len), 0);

	_fire_event(handler, hash);

	Safefree(p);
}

Since node->end isn't the real end but the start of the last UTF-8
sequence...

Anyway, as you can see, adding SAX support is quite easy. I don't think
it's a good idea (since it would mean a lot of work with small use) to
add DOM support to Tidy. If someone is likely to do so anyway, we could
have another dom.c that provides this functionality. Nearly all
information required for DOM Level 1 support is stored in the Node*.
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Received on Wednesday, 20 June 2001 20:52:28 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:45 GMT