Re: [Semantic Data Extractor]

On Thursday 10 September 2009, Patrick Boens wrote:
> Hello,
>
> When I use the "Semantic Data Extractor" on  <http://www.latosensu.be/>
>
> Using org.apache.xerces.parsers.SAXParser
>
> Exception net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException:
> The markup in the document following the root element must be well-formed.
>
> org.xml.sax.SAXParseException: The markup in the document following the
> root element must be well-formed.
>
> However, when I validate this page with the W3C validator, it seems that
> the document is perfectly well-formed.
>
> I don't know exactly why the parser blows up.

Neither do I, nor can I reproduce the problem at the moment.  But here are 
some things to look into that I noticed when grabbing the above URL with wget 
and libwww-perl's HEAD tool:

- No charset parameter in HTTP Content-Type header
- XHTML 1.1 served as text/html
- No XML declaration (IIRC this means XML processors will default to UTF-8)
- The document contents are ISO-8859-1

Unlike the markup validator, plain XML parsers quite likely will not do 
anything with the charset in the document's meta http-equiv tag.

Received on Friday, 11 September 2009 19:25:45 UTC