W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2008

Re: Using TidyLib as an HTML parser

From: John Snelson <john.snelson@oracle.com>
Date: Tue, 22 Jan 2008 21:05:08 +0000
Message-ID: <47965A84.9030909@oracle.com>
To: Arnaud Desitter <arnaud02@users.sourceforge.net>
CC: html-tidy@w3.org

I've uploaded my patch that implements tidyNodeGetValue(), which can be 
found here:



Arnaud Desitter wrote:
> On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
>> Arnaud Desitter wrote:
>>> On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
>>>> Is there a better way to do what I want? I would be quite happy to
>>>> implement a new API method to do this if that's required - does anyone
>>>> else think this would be useful?
>>> Please refer to http://tidy.sf.net/issue/1636028.
>>> Your contribution to a new API would be welcome. Please post it using the
>>> tidy patch tracker.
>> Thanks for the pointer. From the bug report linked, it's not obvious
>> what the correct way to fix this is. Should I change tidyNodeGetText()
>> to return the unescaped value of the node, or should I add a new method?
>>From the bug reports, please add a new function.
>> Here's what I propose - I'll add a new method:
>> Bool tidyNodeGetValue( TidyDoc tdoc, TidyNode tnod, TidyBuffer* buf );
>> For attribute, text, comment, and processing instruction nodes this
>> method will fill the buffer with the value of the node. The value will
>> be unescaped, and not serialized (no "<!--" or "<?" etc.).
>> Some questions:
>> 1) Are there other node types the method should work for?
>> 2) Should I respect the specified output encoding, or use UTF-8? (For
>> instance, the tidyNodeGetName() function always returns UTF-8)
> Could you add that to include/tidy.h please ?
>> 3) What should I do about unrepresentable characters?
> IMO, UTF8 is a good choice. Bjorn or others may comment.
> Because it is a new function, there is no backward compatibility issue
> so it can be modified until it feels right.

John Snelson, Oracle Corporation            http://snelson.org.uk/john
Berkeley DB XML:        http://www.oracle.com/database/berkeley-db/xml
XQilla:                                  http://xqilla.sourceforge.net
Received on Tuesday, 22 January 2008 21:07:02 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:56 UTC