W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > February 2012

[Bug 14993] The list of named character references at http://www.w3.org/TR/html5/named-character-references.html (8.5 Named character references) should also be available in an easy-to-parse format (e.g. plain text or json). This will allow developers to use it with

From: <bugzilla@jessica.w3.org>
Date: Tue, 28 Feb 2012 11:59:15 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1S2Li7-0003Bj-Jp@jessica.w3.org>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=14993

--- Comment #7 from David Carlisle <davidc@nag.co.uk> 2012-02-28 11:59:14 UTC ---
(In reply to comment #6)
> That would work for me.  Note that unlike the list you linked the trailing ';'
> should be included where necessary in the list of HTML5 references (the list at
> http://www.w3.org/TR/html5/named-character-references.html includes both
> references with and without the ';').

I wouldn't want to put the ; in the names (it's a syntactic feature that html
lets you omit the ; in some cases but the name of the entity doesn't have the ;
(it would also make it a lot harder to use that data in xml) However There
would be no problem in having an additional json array that listed the ones
that didn't need ;. Actually I don't think uniocde.xml has that information,
all the rest of the html entity list is extracted from that file, but the
additional ones without
are currently added during that extraction process. I should probably record
that list in the source file anyway, for consistency,

> I also noticed that in your list the "DotDot" entry (and a few others) is
> equivalent to " \u20DC" (with a leading space), whereas the entry in the HTML5
> list only mentions U+020DC (without mentioning the leading space).  This is a
> combining character, so the reason for the extra space might be to prevent it
> to combine with the previous character. 

Yes it is to ensure the resulting documents meed the "w3c normalisation form"
(in one of the charmod drafts that never progressed to recommendation status)
that said that entities should never start with a combining character, so that
entity expansion and unicode normalisation can be performed in either order.
There are 4 such cases, documented here:

http://www.w3.org/2003/entities/2007doc/Overview.html#chars_math-multiple-tables

> I don't know if this should be the
> same for the HTML5 list too though.

I thought that this had been raised before but I don't see it in an existing
bug.

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Tuesday, 28 February 2012 11:59:23 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 28 February 2012 11:59:27 GMT