Re: DOM Structure Problems - Suggestions for modification??? from Scott Dossett on 2000-08-04 (html-tidy@w3.org from July to September 2000)

From: Scott Dossett <sdossett@metaphoria.net>
Date: Fri, 4 Aug 2000 11:45:22 -0400
To: "HTMLTidy" <html-tidy@w3.org>
Message-ID: <000001bffe43$cd570780$046f10ac@duesenberg>
In reference to the discussion below (please read):

Problem:
    JTidy DOM structure is not constructed correctly due to "incorrect"
parsing of singleton tags (ie. META, LINK, BASE, etc...).

Solution:
    I made the following changes:

    1.)  Modified the parseXMLElement() method of ParserImpl class to treat
specific singleton tags (META, IMG, LINK, AREA, BASE, BR, COL, FRAME, HR,
ISINDEX, INPUT, NEXTID, WBR) as singletons rather than container tags.  This
was accomplished by recognizing these known singleton tags (using a
hash-based comparison) when they are received as a Start_Tag token and not
recursing - simply marking the tag as closed and inserting at end of list.

    2.)  Modified the parseXMLElement() method of ParserImpl class to
properly process option tag for storage in DOM structure.
    OPTION tags may only be followed by text data (according to W3C HTML 4.0
DTD), so the getNextToken() is called, and if next node is a text node, it
is appended to the list then OPTION is appended, otherwise the next token is
pushed back on - using ungetToken() - and processed the next time.  /OPTION
tokens are ignored and no warnings are thrown.  Note, this required a hack
on the error reporting options for OPTION.

    3.)  Added a class - SingletonCmp - for comparing tags against known
singletons using a hash.  This accomplishes the hash comparison for the
known singletons.


Thanks,
Scott

----- Original Message -----
From: "Andy Quick" <ac.quick@sympatico.ca>
To: "Scott Dossett" <sdossett@metaphoria.net>
Sent: Wednesday, August 02, 2000 9:35 PM
Subject: Re: DOM Structure Problems - Suggestions for modification???


> Scott,
>
> As pointed out on the html-tidy list, Java tidy's DOM tree is a thin layer
> built
> on C tidy's internal parse tree.  Even though it is incorrect, I would
> prefer that
> Java tidy not deviate from C tidy when it comes to changing the structure
of
> the parse tree.
>
> I suggest that if you want to make the change, post it to the list and ask

> Dave
> Raggett to make the change to C tidy as well.  Then I will port it to Java
> tidy.
>
> Unfortunately, Java tidy's DOM interface is an incomplete attempt at full
> DOM
> compliance - there is quite a lot of work to be done to complete it, and I
> don't
> have the time to do it at present.  If you want full DOM compliance right
> now, I
> suggest you use Java tidy to generate XHTML into a buffer, then pass the
> buffer
> to a real XML parser such as Xerces from xml.apache.org.
>
> Andy Quick
> ----- Original Message -----
> From: Scott Dossett <sdossett@metaphoria.net>
> To: <ac.quick@sympatico.ca>
> Sent: Thursday, July 27, 2000 4:05 PM
> Subject: DOM Structure Problems - Suggestions for modification???
>
>
> > There seem to be some problems with the JTidy HTML DOM structure.  For
> > example, when a <META> tag is included in the <HEAD> portion of an html
> doc,
> > all other tags within the <HEAD> portion of the document are considered
> > children of the <META> tag (apparently due to the lack of trailing slash
> in
> > the <META> tag).  Therefore, whenever I call the getParent() method on
any
> > of these nodes following <META>, <META> is returned as the parent
instead
> of
> > <HEAD>.
> >
Received on Friday, 4 August 2000 14:34:52 UTC