W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2003

Re: Using htmltidy to parse: getting the "body" of a node

From: Lee Passey <lee@novomail.net>
Date: Thu, 02 Oct 2003 13:29:18 -0600
Message-ID: <3F7C7C8E.9060106@novomail.net>
To: joe user <palehaole@yahoo.com>
Cc: html-tidy@w3.org

joe user wrote:

> Hello Tidy people,
> 
> I am trying to use Tidy to do its magic on (possibly
> broken) html files, for input to other layers of
> processing in C.  I need to get access to the body of
> stuff.
> 
> For example, in this block:
> 
> <p>This is some text.</p>
> 
> how do I get access to the "This is some text." part? 
> I can get a stream of TidyNodes, which have
> attributes, but what about the actual content?  I
> assume that the entire sequence of <p>Text</p> counts
> as a single TidyNode?
> 
> Thanks for any tips on this.


//  Fills rgBuffer with all the text from the lexer's buffer up to nBufSize
//  minus one characters. Text is UTF-8 encoded. No markup is included.


int getTextFromNode( TidyDocImpl* doc, Node *node, char *rgBuffer, int nBufSize )
{
    int len = 0;
    Node *pTemp;
    
    nBufSize--;     //  So we have room to null terminate.

    for (pTemp = node->content; NULL != pTemp && len < nBufSize; pTemp = pTemp->next )
    {
        len += getTextFromNode( doc, pTemp, &rgBuffer[ len ], nBufSize - len );
    }
    if (len < nBufSize && (node->type == TextNode || node->type == CommentTag ))
    {
        int nToCopy = node->end - node->start;
        if (0 != nToCopy )
        {
            if (nToCopy > nBufSize)
                nToCopy = nBufSize;
            memcpy( &rgBuffer[ len ], &doc->lexer->lexbuf[ node->start ], nToCopy );
        }
        len += nToCopy;
    }
    rgBuffer[ len ] = '\0';
    return len;
}
Received on Thursday, 2 October 2003 15:29:33 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:38:54 UTC