Re: Using TidyLib as an HTML parser from John Snelson on 2008-01-22 (html-tidy@w3.org from January to March 2008)

From: John Snelson <john.snelson@oracle.com>
Date: Tue, 22 Jan 2008 13:30:40 +0000
To: Arnaud Desitter <arnaud02@users.sourceforge.net>
CC: html-tidy@w3.org
Message-ID: <4795F000.4050701@oracle.com>

Arnaud Desitter wrote:
> On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
>> Is there a better way to do what I want? I would be quite happy to
>> implement a new API method to do this if that's required - does anyone
>> else think this would be useful?
> 
> Please refer to http://tidy.sf.net/issue/1636028.
> Your contribution to a new API would be welcome. Please post it using the
> tidy patch tracker.

Thanks for the pointer. From the bug report linked, it's not obvious 
what the correct way to fix this is. Should I change tidyNodeGetText() 
to return the unescaped value of the node, or should I add a new method?

Here's what I propose - I'll add a new method:

Bool tidyNodeGetValue( TidyDoc tdoc, TidyNode tnod, TidyBuffer* buf );

For attribute, text, comment, and processing instruction nodes this 
method will fill the buffer with the value of the node. The value will 
be unescaped, and not serialized (no "<!--" or "<?" etc.).

Some questions:

1) Are there other node types the method should work for?
2) Should I respect the specified output encoding, or use UTF-8? (For 
instance, the tidyNodeGetName() function always returns UTF-8)
3) What should I do about unrepresentable characters?

John

-- 
John Snelson, Oracle Corporation            http://snelson.org.uk/john
Berkeley DB XML:        http://www.oracle.com/database/berkeley-db/xml
XQilla:                                  http://xqilla.sourceforge.net

Received on Tuesday, 22 January 2008 13:31:53 UTC