Re: Error recovery spec

James Clark scripsit:

> I assume this means that every element is a PossibleChild of every
> other element.

Yes.  That should be reformulated in terms of NonPossibleChild properties.

> In the default case (when there is no document type info available),
> does this produce the same result as what's in the spec currently?

I believe so but have not proved it.  Someone might want to use a
protocol verifier.

In addition, some language is needed for what to do at EOF.  Essentially,
EOF is the end-tag of an element that is a NonPossibleChild of every
element.

TagSoup has three more wrinkles:

1) The "form" and "table" elements have the UnClosable property, which
means that end-tags are never inserted for them except at EOF.

2) Character data can be a NonPossibleChild and may have a PreferredParent
too (for HTML it is "p").  It is never necessary to push it on the stack,
fortunately.

3) Attempts to explicitly close the root element are ignored, leaving
the matter to EOF processing.  This means that a second root element
will be the final child of the first root element, and so on recursively.
This could replace #doc-insertion.  I don't know how strongly you feel
about keeping it; with wrinkles #2 and #3 in place, streaming processing
is now possible.

TagSoup has the strong property that pushing an element on to the stack
always involves emitting a start-tag (when viewed in terms of streaming),
and popping an element from the stack always involves emitting an end-tag.
This guarantees that the output forms a hierarchy and is thus well-formed,
provided that character data and names meet the repertoire restrictions
(which TagSoup makes sure they do by brute force).

It is further true (again, I have not proved this formally) that provided
there are no loops in the PreferredParent graph, TagSoup will always
make progress and always terminate, however convoluted the input.

-- 
As you read this, I don't want you to feel      John Cowan
sorry for me, because, I believe everyone       cowan@ccil.org
will die someday.                               http://www.ccil.org/~cowan
        --From a Nigerian-type scam spam

Received on Monday, 17 December 2012 15:52:25 UTC