Re: Precision and error handling (was URL work in HTML 5) from Eric J. Bowman on 2012-10-08 (www-tag@w3.org from October 2012)

From: Eric J. Bowman <eric@bisonsystems.net>
Date: Mon, 8 Oct 2012 03:54:05 -0600
To: "Michael[tm] Smith" <mike@w3.org>
Cc: Noah Mendelsohn <nrm@arcanedomain.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, Robin Berjon <robin@w3.org>, Larry Masinter <masinter@adobe.com>, W3C TAG <www-tag@w3.org>
Message-Id: <20121008035405.296a6a71.eric@bisonsystems.net>
"Michael[tm] Smith" wrote:
>
> > 
> > Excellent point.  My server will consume user-created HTML 5, which
> > it won't process with an HTML 5 parser.
> >
> 
> Why not? Is the HTML that you're processing not meant to also ever be
> consumed by browsers?
> 

Didn't realize it's a requirement that servers providing HTML 5
representations, must use an HTML 5 parser on the back end?  ;-)

>
> I don't think that's true. I think the correct architectural choice
> is to tie the media type to the spec that attempts to be the most
> comprehensive specification for the language. That's no different
> from the case of the HTML4 spec. There was not a separate author spec
> for HTML4 -- there was just one spec.
> 

No, there wasn't a separate browser spec for HTML 4, so yes this is
different. ;-)  I would also say the spec in question isn't attempting
to be the most comprehensive spec for the language, but rather for a
particular application of that language. Presentation is a separate
issue from accepting a subset of markup in user-created content, which
surely only needs reference to the author document, as do other
examples of back-end HTML usage.

Structured, static HTML (embedded and as documents) conveys semantics
which have long made it ideal for message transport and storage, quite
separate from its use as a presentation language.  This is why devices
can interpret data tables for those who can't rely on visual rendering
of table/row relationships.  I like storing HTML 4.01 <table> markup in
XMLDB cells +Xquery, better than SQL+PHP.

>
> > All it needs to know is the syntax of HTML 5.
> 
> The spec provides that information:
> 
>   http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing
> 

The point being, not all consumers of HTML 5 are going to meet any
conformance level specified for HTML 5 parsers.  Tying the media type
to the browser spec implies, or creates the expectation, that certain
processing rules will be followed.  Tying to the author document only
creates expectations around the semantics of the markup, i.e. <p> is a
paragraph.

Which is all I need in order to process a document I have no intention
to render -- is a tag block-level, nestable etc.  This is how I wind up
with an XML DB, where each URI/cell represents one day, holding a multi-
dimensional array expressed as a <table> (think, "of box scores") where
some <td>'s may contain further semantic HTML i.e. <ol> for pitch tally.
Highly Xquery-able for custom stats, like who threw the most pitches in
a given month in the National League; re-using HTML semantics is a great
way to build Web apps around randomly-structured data (as opposed to
re-structuring the data for RDBMS efficiency), funner to code anyway.

So I don't need all those rendering and error-correction rules, and
would like to avoid creating the mistaken impression that I'm following
them by using a media type tied to processing rules I don't feel
beholden to implement.  What I am re-using is syntax and semantics, and
as this is the most general-purpose takeaway from HTML 5, the author
doc ought to be tied to the media type, to make it more widely
applicable to how HTML has traditionally been used.

>
> > Otherwise an expectation is created, that consumers accepting HTML
> > will parse it a certain way.
> 
> I don't think that's necessarily true. The HTML spec explicitly
> defines particular conformance classes, and makes it clear which
> parts of the spec apply to which particular conformance classes and
> which do not.
> 

Whereas I'm slurping a restricted subset of elements and attributes,
and rejecting any junk rather than trying to fix it, as the expected
source of input is either my UI or other well-behaved (i.e. AtomPub)
client.  Nothing else is trusted enough by the server for it to attempt
error correction or recovery, and nothing the HTML parser "fixes" would
be generated by my UI or a decent AtomPub editor.

Any resemblance between my results and any conformance class you
mention is therefore strictly unintended and entirely coincidental.  I
will never be as liberal in the markup I allow, as the browser vendors
admittedly need to be.  So I lack the motivation to become an expert on
the HTML parser for my small subset of well-formed allowable markup.

My real concern, as I have yet to take the HTML 5 plunge, is that some
user comes along who knows more about the HTML parser than I do and is
able to use that knowledge to vandalize my site via legitimate post, by
causing some markup quirk I wasn't aware of or expecting, and hadn't
tested my CSS against (like, say, nested <p>'s).  If my own legacy
content doesn't self-vandalize, on certain pages and in a way that
can only be detected by eyeballs.  Assuming I implement it on the back-
end to process user-created content...

> 
> You can have a parser that follows the tokenization rules in the
> HTML5 spec and that then uses SAX or whatever to expose the start
> tags, characters, etc., to the rest of your application without ever
> constructing a DOM.
> 

Or I can continue to use XSLT to process all input and output markup,
and if I follow the rules for authoring HTML 5, browsers ought to do
what I expect.  I see no reason to re-architect my system to support
HTML 5 -- all HTML is produced using XSLT templates, each capable of
executing on the server or any capable client, so altering one .xsl 
file can upgrade an entire website from HTML 4 to HTML 5.

Aging demo here:  http://charger.bisonsystems.net/conneg/
(date/content links and 'type' menu work, nothing else, no posting)

The server accepts and stores user-generated HTML embedded in Atom.
The XSLT templates transform this Atom structure into HTML, or not,
based on conneg between atom+xml, xhtml+xml (client-side XSLT), or
text/html.  I'll get around to HTML 5, but it will be generated via
XSLT calling Atom w/ embedded HTML..

The big change is allowing HTML 5 syntax in the user-created content.
I've tried various filtering setups with HTML Tidy and TagSoup, but
found it's best to use libxslt or other engine configurable to read
HTML.  All I'm doing is extracting user-generated markup from an Atom
wrapper, which matches my own rules about what elements and attributes
are allowed, and how to handle those that aren't.  The serialization is
XHTML, and an output validator may be enabled to ensure XSLT output
conforms to my rules after editing .xsl files.  Garbage in, gold out.

I'd like to expand/alter my subset of allowed markup to take advantage
of HTML 5, even before I've updated the output XSLT to produce HTML 5.
But note these are *my* rules, not those of an HTML parser.  I'm a hard
sell for re-architecting my system around an HTML parser, when all I
really want to do is update the markup I'm allowing folks to post by
editing- in-place the XSLT/RELAX NG/Schematron of a working system --
which will still have older HTML content embedded in Atom that isn't
HTML 5, so I want no surprises from my legacy content, either.

Anyway, allowing some new syntax = easy; re-architecting around HTML
parser = fraught with peril, expensive, and unnecessary.

-Eric
Received on Monday, 8 October 2012 09:54:30 UTC