- From: Eric J. Bowman <eric@bisonsystems.net>
- Date: Mon, 8 Oct 2012 03:54:05 -0600
- To: "Michael[tm] Smith" <mike@w3.org>
- Cc: Noah Mendelsohn <nrm@arcanedomain.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, Robin Berjon <robin@w3.org>, Larry Masinter <masinter@adobe.com>, W3C TAG <www-tag@w3.org>
"Michael[tm] Smith" wrote: > > > > > Excellent point. My server will consume user-created HTML 5, which > > it won't process with an HTML 5 parser. > > > > Why not? Is the HTML that you're processing not meant to also ever be > consumed by browsers? > Didn't realize it's a requirement that servers providing HTML 5 representations, must use an HTML 5 parser on the back end? ;-) > > I don't think that's true. I think the correct architectural choice > is to tie the media type to the spec that attempts to be the most > comprehensive specification for the language. That's no different > from the case of the HTML4 spec. There was not a separate author spec > for HTML4 -- there was just one spec. > No, there wasn't a separate browser spec for HTML 4, so yes this is different. ;-) I would also say the spec in question isn't attempting to be the most comprehensive spec for the language, but rather for a particular application of that language. Presentation is a separate issue from accepting a subset of markup in user-created content, which surely only needs reference to the author document, as do other examples of back-end HTML usage. Structured, static HTML (embedded and as documents) conveys semantics which have long made it ideal for message transport and storage, quite separate from its use as a presentation language. This is why devices can interpret data tables for those who can't rely on visual rendering of table/row relationships. I like storing HTML 4.01 <table> markup in XMLDB cells +Xquery, better than SQL+PHP. > > > All it needs to know is the syntax of HTML 5. > > The spec provides that information: > > http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#writing > The point being, not all consumers of HTML 5 are going to meet any conformance level specified for HTML 5 parsers. Tying the media type to the browser spec implies, or creates the expectation, that certain processing rules will be followed. Tying to the author document only creates expectations around the semantics of the markup, i.e. <p> is a paragraph. Which is all I need in order to process a document I have no intention to render -- is a tag block-level, nestable etc. This is how I wind up with an XML DB, where each URI/cell represents one day, holding a multi- dimensional array expressed as a <table> (think, "of box scores") where some <td>'s may contain further semantic HTML i.e. <ol> for pitch tally. Highly Xquery-able for custom stats, like who threw the most pitches in a given month in the National League; re-using HTML semantics is a great way to build Web apps around randomly-structured data (as opposed to re-structuring the data for RDBMS efficiency), funner to code anyway. So I don't need all those rendering and error-correction rules, and would like to avoid creating the mistaken impression that I'm following them by using a media type tied to processing rules I don't feel beholden to implement. What I am re-using is syntax and semantics, and as this is the most general-purpose takeaway from HTML 5, the author doc ought to be tied to the media type, to make it more widely applicable to how HTML has traditionally been used. > > > Otherwise an expectation is created, that consumers accepting HTML > > will parse it a certain way. > > I don't think that's necessarily true. The HTML spec explicitly > defines particular conformance classes, and makes it clear which > parts of the spec apply to which particular conformance classes and > which do not. > Whereas I'm slurping a restricted subset of elements and attributes, and rejecting any junk rather than trying to fix it, as the expected source of input is either my UI or other well-behaved (i.e. AtomPub) client. Nothing else is trusted enough by the server for it to attempt error correction or recovery, and nothing the HTML parser "fixes" would be generated by my UI or a decent AtomPub editor. Any resemblance between my results and any conformance class you mention is therefore strictly unintended and entirely coincidental. I will never be as liberal in the markup I allow, as the browser vendors admittedly need to be. So I lack the motivation to become an expert on the HTML parser for my small subset of well-formed allowable markup. My real concern, as I have yet to take the HTML 5 plunge, is that some user comes along who knows more about the HTML parser than I do and is able to use that knowledge to vandalize my site via legitimate post, by causing some markup quirk I wasn't aware of or expecting, and hadn't tested my CSS against (like, say, nested <p>'s). If my own legacy content doesn't self-vandalize, on certain pages and in a way that can only be detected by eyeballs. Assuming I implement it on the back- end to process user-created content... > > You can have a parser that follows the tokenization rules in the > HTML5 spec and that then uses SAX or whatever to expose the start > tags, characters, etc., to the rest of your application without ever > constructing a DOM. > Or I can continue to use XSLT to process all input and output markup, and if I follow the rules for authoring HTML 5, browsers ought to do what I expect. I see no reason to re-architect my system to support HTML 5 -- all HTML is produced using XSLT templates, each capable of executing on the server or any capable client, so altering one .xsl file can upgrade an entire website from HTML 4 to HTML 5. Aging demo here: http://charger.bisonsystems.net/conneg/ (date/content links and 'type' menu work, nothing else, no posting) The server accepts and stores user-generated HTML embedded in Atom. The XSLT templates transform this Atom structure into HTML, or not, based on conneg between atom+xml, xhtml+xml (client-side XSLT), or text/html. I'll get around to HTML 5, but it will be generated via XSLT calling Atom w/ embedded HTML.. The big change is allowing HTML 5 syntax in the user-created content. I've tried various filtering setups with HTML Tidy and TagSoup, but found it's best to use libxslt or other engine configurable to read HTML. All I'm doing is extracting user-generated markup from an Atom wrapper, which matches my own rules about what elements and attributes are allowed, and how to handle those that aren't. The serialization is XHTML, and an output validator may be enabled to ensure XSLT output conforms to my rules after editing .xsl files. Garbage in, gold out. I'd like to expand/alter my subset of allowed markup to take advantage of HTML 5, even before I've updated the output XSLT to produce HTML 5. But note these are *my* rules, not those of an HTML parser. I'm a hard sell for re-architecting my system around an HTML parser, when all I really want to do is update the markup I'm allowing folks to post by editing- in-place the XSLT/RELAX NG/Schematron of a working system -- which will still have older HTML content embedded in Atom that isn't HTML 5, so I want no surprises from my legacy content, either. Anyway, allowing some new syntax = easy; re-architecting around HTML parser = fraught with peril, expensive, and unnecessary. -Eric
Received on Monday, 8 October 2012 09:54:30 UTC