Re: Feed Validator : Parsing error in Atom [entity preceding closing tag] from Sam Ruby on 2006-05-25 (www-validator@w3.org from May 2006)

From: Sam Ruby <rubys@intertwingly.net>
Date: Thu, 25 May 2006 07:39:00 -0400
To: www-validator@w3.org
CC: Neil Smith <Neil_Smith@hargreaveslansdown.co.uk>
Message-ID: <44759754.9010208@intertwingly.net>

David Dorward wrote:
> On Thu, May 25, 2006 at 11:15:27AM +0100, Neil Smith wrote:
> 
>>When submitting a document in Atom format to the feed validator service 
>>http://validator.w3.org/feed/check.cgi
>>
>>Inclusion of an &amp; entity followed by a single character in the
>>range a-zA-Z only, before the closing <title /> element tag causes
>>the feed validator to report " EOF in middle of entity" :
> 
> I'm not an expert on ATOM, but I believe this is what is happening:
> 
> Your title element has a type attribute that specifies it contains
> HTML and so the text must have special characters represented by
> character references.
> 
> This HTML is being represented in XML, so any special characters in
> the HTML source must also be represented as character entities.
>  
> Thus: foo&bar in text becomes
>       foo&amp;bar in HTML and
>       foo&amp;amp; in XML encoded HTML
> 
> You've only encoded the ampersand once, so are getting a warning.

Exactly.

>>Use of more than one alpha character after the &amp; entity does not
>>cause this error in the validator.  It should of course be
>>reasonable to end a title element in for example E&amp;O, or in our
>>case the abbreviation for a company, i.e A&amp;L
> 
> I'm now entering the realm of guesswork, but I suspect that you can't
> have named entities with only one letter, so the parser knows that &O;
> isn't a real entity, but that &Ox; could be.

it seems that the parser doesn't like unclosed entites at the end of the
string.  If you have access to Python, you can experiment with the
following code:

---

text="Viridian results higher on Irish businessE&amp;O"

from HTMLParser import HTMLParser, HTMLParseError
from xml.sax.saxutils import unescape

try:
  parser=HTMLParser()
  parser.feed(unescape(text))
  parser.close()
  print 'ok'
except HTMLParseError, error:
  print error

---

> (I read the mailing list, please address responses there and do not CC
> me.)

OK ;-)

- Sam Ruby

Received on Thursday, 25 May 2006 11:39:43 UTC