- From: Martin Atkins <mart@degeneration.co.uk>
- Date: Fri, 08 Dec 2006 18:58:10 +0000
Alexey Feldgendler wrote: > > LiveJournal, a popular blogging service, inserts hand-authored content into hand-authored templates. While the templates are written by competent authors who (mostly) know how to write proper HTML, blog posts are most often written by people who barely learnt how to use a bunch of tags. LiveJournal makes some simple preprocessing (breaks paragraphs on newlines and strips dangerous markup like <script>) but otherwise leaves the content as is. That's why most blog pages on LiveJournal aren't even close to being valid HTML. > > Actually, LiveJournal's HTML sanitizer[1] is not as simple as you suggest here. It does actually attempt to "fix" various markup errors such as: * Auto-closing badly-nested or unclosed elements * Escaping instances of bare special characters (<, >, &, etc) * Adding quotes around all attribute values * ... Of course, it isn't a validator, so apart from some special cases (filtering <script>, for example) it has no knowledge about the content models of various HTML elements, so it's a good example of the fact that it's unfeasible for a tool such as this to "fix" a user's mess when dealing with hand-made markup. LiveJournal actually has a WYSIWYG editor in addition to accepting hand-edited HTML, but since it's based on the in-browser designMode thing it often generates worse markup than most users. (It doesn't help that there is a coding standard in force for LiveJournal which mandates XHTML served as text/html across the board and that the system itself injects invalid HTML into the otherwise-valid templates.) [1] LiveJournal actually has two of these. One is stream-based and is used to fix up the template output: <http://code.sixapart.com/svn/miscperl/trunk/HTMLCleaner.pm> ...while the other is a lot more picky and is used for fixing up content such as user entries and comments: <http://code.sixapart.com/svn/livejournal/trunk/cgi-bin/cleanhtml.pl> Links to source code just included in case anyone is interested.
Received on Friday, 8 December 2006 10:58:10 UTC