- From: Lee Passey <lee@novomail.net>
- Date: Thu, 02 Oct 2003 13:29:18 -0600
- To: joe user <palehaole@yahoo.com>
- Cc: html-tidy@w3.org
joe user wrote: > Hello Tidy people, > > I am trying to use Tidy to do its magic on (possibly > broken) html files, for input to other layers of > processing in C. I need to get access to the body of > stuff. > > For example, in this block: > > <p>This is some text.</p> > > how do I get access to the "This is some text." part? > I can get a stream of TidyNodes, which have > attributes, but what about the actual content? I > assume that the entire sequence of <p>Text</p> counts > as a single TidyNode? > > Thanks for any tips on this. // Fills rgBuffer with all the text from the lexer's buffer up to nBufSize // minus one characters. Text is UTF-8 encoded. No markup is included. int getTextFromNode( TidyDocImpl* doc, Node *node, char *rgBuffer, int nBufSize ) { int len = 0; Node *pTemp; nBufSize--; // So we have room to null terminate. for (pTemp = node->content; NULL != pTemp && len < nBufSize; pTemp = pTemp->next ) { len += getTextFromNode( doc, pTemp, &rgBuffer[ len ], nBufSize - len ); } if (len < nBufSize && (node->type == TextNode || node->type == CommentTag )) { int nToCopy = node->end - node->start; if (0 != nToCopy ) { if (nToCopy > nBufSize) nToCopy = nBufSize; memcpy( &rgBuffer[ len ], &doc->lexer->lexbuf[ node->start ], nToCopy ); } len += nToCopy; } rgBuffer[ len ] = '\0'; return len; }
Received on Thursday, 2 October 2003 15:29:33 UTC