- From: Lee Passey <lee@novomail.net>
- Date: Thu, 02 Oct 2003 13:29:18 -0600
- To: joe user <palehaole@yahoo.com>
- Cc: html-tidy@w3.org
joe user wrote:
> Hello Tidy people,
>
> I am trying to use Tidy to do its magic on (possibly
> broken) html files, for input to other layers of
> processing in C. I need to get access to the body of
> stuff.
>
> For example, in this block:
>
> <p>This is some text.</p>
>
> how do I get access to the "This is some text." part?
> I can get a stream of TidyNodes, which have
> attributes, but what about the actual content? I
> assume that the entire sequence of <p>Text</p> counts
> as a single TidyNode?
>
> Thanks for any tips on this.
// Fills rgBuffer with all the text from the lexer's buffer up to nBufSize
// minus one characters. Text is UTF-8 encoded. No markup is included.
int getTextFromNode( TidyDocImpl* doc, Node *node, char *rgBuffer, int nBufSize )
{
int len = 0;
Node *pTemp;
nBufSize--; // So we have room to null terminate.
for (pTemp = node->content; NULL != pTemp && len < nBufSize; pTemp = pTemp->next )
{
len += getTextFromNode( doc, pTemp, &rgBuffer[ len ], nBufSize - len );
}
if (len < nBufSize && (node->type == TextNode || node->type == CommentTag ))
{
int nToCopy = node->end - node->start;
if (0 != nToCopy )
{
if (nToCopy > nBufSize)
nToCopy = nBufSize;
memcpy( &rgBuffer[ len ], &doc->lexer->lexbuf[ node->start ], nToCopy );
}
len += nToCopy;
}
rgBuffer[ len ] = '\0';
return len;
}
Received on Thursday, 2 October 2003 15:29:33 UTC