Robot Having a problem with Non-HTML-tag <!--- ...---->

Hi,
I have found a similar problem with the robot. If a HTML page contains comments
of the form "<!--- ....--->" which are used by many authoring tools, the robot
stops parsing this page at the point the comment begins. It does not
resynchronize again. Is this a bug or a feature?
mfg Karl-Otto

Raffaele Sena wrote:

> >
> > In this function ... e.g.:
> >
> > PRIVATE void unparsedBeginElement (HText* pDataStruct, const char*
> > cpszBuffer, int iLength)
> > {
> > HTPrint("\n\nFound a unparsed Element -> [%d]*%s*\n", iLength, cpszBuffer);
> > }
> >
> > ... i only receive the unknown tag in cpszBuffer and its length in iLength.
> > But the rest of this tag, its parameter ... How may I access them ?
> >
> > Hope u can help me ;)
> >
>     unfortunately there is no way without changing libwww.
>
>     I noticed that some time ago, but there isn't an easy fix.
>
>     The way the SGML parser works today is that when it first find a tag it
> checks
>     if it's valid, before parsing the attributes. If not, it will call the
> unparsed_begin_element
>     with no attributes, and then throw them away.
>
>     I guess it could be changed to in a state where it collects everything up
> to the end tag
>     and then call the appropriate callback (but then you'll have to parse the
> full line).
>
>     A better way could be to collect the attributes without checking them and
> passing them
>     to the callback in the attributes array, maybe in the form
> "ATTRNAME=VALUE" that should
>     be easy to parse.
>
>     ...but whatever way, it needs to be implemented.
>
>     In your specific case you may want to add the <EMBED> tag to HTMLPDTD.c
> and HTMLPDTD.h
>     (just put it in the right place :)
>
> -- Raffaele

Received on Friday, 20 August 1999 15:59:12 UTC