[whatwg] several messages about XML syntax and HTML5

Alexey Feldgendler wrote:
> 
> LiveJournal, a popular blogging service, inserts hand-authored content into hand-authored templates. While the templates are written by competent authors who (mostly) know how to write proper HTML, blog posts are most often written by people who barely learnt how to use a bunch of tags. LiveJournal makes some simple preprocessing (breaks paragraphs on newlines and strips dangerous markup like <script>) but otherwise leaves the content as is. That's why most blog pages on LiveJournal aren't even close to being valid HTML.
> 
> 

Actually, LiveJournal's HTML sanitizer[1] is not as simple as you 
suggest here. It does actually attempt to "fix" various markup errors 
such as:
  * Auto-closing badly-nested or unclosed elements
  * Escaping instances of bare special characters (<, >, &, etc)
  * Adding quotes around all attribute values
  * ...

Of course, it isn't a validator, so apart from some special cases 
(filtering <script>, for example) it has no knowledge about the content 
models of various HTML elements, so it's a good example of the fact that 
it's unfeasible for a tool such as this to "fix" a user's mess when 
dealing with hand-made markup.

LiveJournal actually has a WYSIWYG editor in addition to accepting 
hand-edited HTML, but since it's based on the in-browser designMode 
thing it often generates worse markup than most users.


(It doesn't help that there is a coding standard in force for 
LiveJournal which mandates XHTML served as text/html across the board 
and that the system itself injects invalid HTML into the otherwise-valid 
templates.)

[1] LiveJournal actually has two of these. One is stream-based and is 
used to fix up the template output:
    <http://code.sixapart.com/svn/miscperl/trunk/HTMLCleaner.pm>
    ...while the other is a lot more picky and is used for fixing up 
content such as user entries and comments:
    <http://code.sixapart.com/svn/livejournal/trunk/cgi-bin/cleanhtml.pl>

Links to source code just included in case anyone is interested.

Received on Friday, 8 December 2006 10:58:10 UTC