W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2008

Re: Using TidyLib as an HTML parser

From: Arnaud Desitter <arnaud02@users.sourceforge.net>
Date: Wed, 23 Jan 2008 09:54:34 +0000
Message-ID: <a240ddd00801230154j538e3c6cr7bd02811e1eaf068@mail.gmail.com>
To: "John Snelson" <john.snelson@oracle.com>
Cc: html-tidy@w3.org

Thanks. I will get down to it when time allows.
You can always post a revised patch in issue 1877642 if you have new ideas.
Regards,

On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
> I've uploaded my patch that implements tidyNodeGetValue(), which can be
> found here:
>
> http://sourceforge.net/tracker/index.php?func=detail&aid=1877642&group_id=27659&atid=390965
>
> John
>
> Arnaud Desitter wrote:
> > On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
> >> Arnaud Desitter wrote:
> >>> On 22/01/2008, John Snelson <john.snelson@oracle.com> wrote:
> >>>> Is there a better way to do what I want? I would be quite happy to
> >>>> implement a new API method to do this if that's required - does anyone
> >>>> else think this would be useful?
> >>> Please refer to http://tidy.sf.net/issue/1636028.
> >>> Your contribution to a new API would be welcome. Please post it using the
> >>> tidy patch tracker.
> >> Thanks for the pointer. From the bug report linked, it's not obvious
> >> what the correct way to fix this is. Should I change tidyNodeGetText()
> >> to return the unescaped value of the node, or should I add a new method?
> >
> >>From the bug reports, please add a new function.
> >
> >> Here's what I propose - I'll add a new method:
> >>
> >> Bool tidyNodeGetValue( TidyDoc tdoc, TidyNode tnod, TidyBuffer* buf );
> >>
> >> For attribute, text, comment, and processing instruction nodes this
> >> method will fill the buffer with the value of the node. The value will
> >> be unescaped, and not serialized (no "<!--" or "<?" etc.).
> >>
> >> Some questions:
> >>
> >> 1) Are there other node types the method should work for?
> >> 2) Should I respect the specified output encoding, or use UTF-8? (For
> >> instance, the tidyNodeGetName() function always returns UTF-8)
> >
> > Could you add that to include/tidy.h please ?
> >
> >> 3) What should I do about unrepresentable characters?
> >
> > IMO, UTF8 is a good choice. Bjorn or others may comment.
> > Because it is a new function, there is no backward compatibility issue
> > so it can be modified until it feels right.
>
> --
> John Snelson, Oracle Corporation            http://snelson.org.uk/john
> Berkeley DB XML:        http://www.oracle.com/database/berkeley-db/xml
> XQilla:                                  http://xqilla.sourceforge.net
>
Received on Wednesday, 23 January 2008 09:56:53 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:58 GMT