Is this something for the primer?

I have run, recently, into a nasty bug in my implementation; I wonder whether the reasons for this bug should be added to the primer. Not sure...

The manifestation of the bug: in some cases the encoding (ie, the UTF-8 encoding) of literals went wrong.

The reasons were not any of the obvious bugs (missing encoding on output, stuff like that) though of course this is where I started. I then realized that this goes wrong only when I use the HTML5 parser and it looked a bit random at first... To make the long story short, it is related to content sniffing.

What happens (I guess) is the following: the HTML5 parser ignores the possible <? instruction for encoding (which is of course o.k.); instead, it looks into the header to see if a <meta> for encoding is present or not. If it finds it, it goes for the encoding specified there, otherwise it falls back to the default, which is the windows encoding. That being said, the parser can be started with an explicit encoding parameter specifying the encoding, in which case that is used.

Why does it go wrong? If one has an RDFa source with lots of prefix definition in the HTML element, then sniffing may go wrong because sniffing looks at the first ??? characters only (I do not remember the number from the top of my head). And that was the reason of my bug.

What I did to counter the bug is that I look now into the return HTTP header (which I did already), and if I find an encoding there, I use it as an explicit encoding for the parsing. But that, of course, presupposes that the content encoding and the HTTP return is in synchrony (which should be the case, but, well...). And, of course, this does not work for local files (which I simply consider as UTF-8).

So... we have a potential for practical problems here. This is alleviated by the default profile mechanism because many of the prefix definitions may become unnecessary (foaf, rdf, etc), and the profile mechanism in general. But there is a potential problem nevertheless.

What the primer could say is to draw attention to this problem and give the advice to concentrate the prefix definitions on the <body> element instead of the <html> element. If done there, no problem occurs.

Opinions?

Ivan

P.S. Yes, it sounds simple once solved, but it took me about an hour or maybe more to realize that this was the problem! It was a way for me to fight against jet-lag, I realized the problem in my hotel room in Hyderabad...

----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf

Received on Wednesday, 20 April 2011 11:44:58 UTC