Re: Using TidyLib as an HTML parser

On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
> Arnaud Desitter wrote:
> > On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
> >> Is there a better way to do what I want? I would be quite happy to
> >> implement a new API method to do this if that's required - does anyone
> >> else think this would be useful?
> >
> > Please refer to http://tidy.sf.net/issue/1636028.
> > Your contribution to a new API would be welcome. Please post it using the
> > tidy patch tracker.
>
> Thanks for the pointer. From the bug report linked, it's not obvious
> what the correct way to fix this is. Should I change tidyNodeGetText()
> to return the unescaped value of the node, or should I add a new method?

>From the bug reports, please add a new function.

>
> Here's what I propose - I'll add a new method:
>
> Bool tidyNodeGetValue( TidyDoc tdoc, TidyNode tnod, TidyBuffer* buf );
>
> For attribute, text, comment, and processing instruction nodes this
> method will fill the buffer with the value of the node. The value will
> be unescaped, and not serialized (no "<!--" or "<?" etc.).
>
> Some questions:
>
> 1) Are there other node types the method should work for?
> 2) Should I respect the specified output encoding, or use UTF-8? (For
> instance, the tidyNodeGetName() function always returns UTF-8)

Could you add that to include/tidy.h please ?

> 3) What should I do about unrepresentable characters?

IMO, UTF8 is a good choice. Bjorn or others may comment.
Because it is a new function, there is no backward compatibility issue
so it can be modified until it feels right.

Regards,

>
> John
>
> --
> John Snelson, Oracle Corporation            http://snelson.org.uk/john
> Berkeley DB XML:        http://www.oracle.com/database/berkeley-db/xml
> XQilla:                                  http://xqilla.sourceforge.net
>

Received on Tuesday, 22 January 2008 13:58:07 UTC