Re: HTML in EPUB 3.4 (was: Publishing Maintenance WG Teleconference - April 24 2025) from Romain Deltour on 2025-05-09 (public-pm-wg@w3.org from May 2025)

From: Romain Deltour <rdeltour@gmail.com>
Date: Fri, 9 May 2025 09:50:56 +0200
To: matt.garrish@gmail.com, Eric Hellman <eric@hellman.net>
Cc: W3C PM Working Group <public-pm-wg@w3.org>, Ivan Herman <ivan@w3.org>, Laurent Le Meur <laurent.lemeur@edrlab.org>
Message-Id: <1860CBE0-BBF2-487B-8757-C39E3F738B3C@gmail.com>
Hi,

Sorry I missed this thread.

> > Is that an oversight by epubcheck (and thus not implicated in 3.3 -> 3.4), or is that something intended?
>  Epubcheck doesn’t fully integrate all of the validator.nu java code checks, so it’s more of a development time and budget issue. There have been plans to try and better integrate the two, but I’m not the person to ask on where that stands. Maybe Romain will chime in with more insight.
>  Epubcheck largely relies on the relaxng and schematron checks with some additional code integration, so I expect the checks on table rows is probably in the Java code that currently isn’t implemented.
>  But this is where having html support could have benefits beyond the syntax options alone. If more development goes into validation then some of these current discrepancies will disappear.

Matt is right: the reason EPUBCheck is not fully integrating the validator.nu HTML checker is mostly due to timing and budgeting constraints. It is also due to the project history.

I personally believe that integrating the full validator.nu checks is sensible and should be the project’s objective. But it requires significant work on EPUBCheck, and can hardly be done on one's free time or as pure maintenance work.

As for the specific issue on table row’s content validation: yes this is a typical case of something covered by the HTML checker’s code outside the schemas, so that is not included and reported by EPUBCheck. 

Romain.



> On 29 Apr 2025, at 20:58, <matt.garrish@gmail.com> <matt.garrish@gmail.com> wrote:
> 
> > Is that an oversight by epubcheck (and thus not implicated in 3.3 -> 3.4), or is that something intended?
>  Epubcheck doesn’t fully integrate all of the validator.nu java code checks, so it’s more of a development time and budget issue. There have been plans to try and better integrate the two, but I’m not the person to ask on where that stands. Maybe Romain will chime in with more insight.
>  Epubcheck largely relies on the relaxng and schematron checks with some additional code integration, so I expect the checks on table rows is probably in the Java code that currently isn’t implemented.
>  But this is where having html support could have benefits beyond the syntax options alone. If more development goes into validation then some of these current discrepancies will disappear.
>  Matt
>  From: Eric Hellman <eric@hellman.net> 
> Sent: April 29, 2025 2:42 PM
> To: matt.garrish@gmail.com
> Cc: Ivan Herman <ivan@w3.org>; Laurent Le Meur <laurent.lemeur@edrlab.org>; W3C PM Working Group <public-pm-wg@w3.org>
> Subject: Re: HTML in EPUB 3.4 (was: Publishing Maintenance WG Teleconference - April 24 2025)
>  Matt,
>  Just as a concrete example, epubcheck does not care about the html5 rules about the number of cells in a table row, whereas the html5 validator marks them as errors.
>  Is that an oversight by epubcheck (and thus not implicated in 3.3 -> 3.4), or is that something intended?
>  Does EPUB3 XML use HTML5 content models? (like the "transparent" models?)
>  I apologize if I've not paid attention to the standards discussion, except when a file fails epubcheck. (I probably have more epubfiles that pass epubcheck than anyone on the list!)
>  Eric
> 
> 
>> On Apr 29, 2025, at 11:11 AM, <matt.garrish@gmail.com> <matt.garrish@gmail.com> wrote:
>>  And to be clear, XHTML support in EPUB 3 is not XHTML 1.1 support like it was in EPUB 2.
>>  XHTML in the context of EPUB 3 is the xml syntax of html (previously known as html5). We’re looking to add the html syntax. Two parts of the same thing.
>>  So, yes, epubcheck already uses part of the validator.nu code base to check epub 3 files. This change might make things a little easier in the long run since it would now encompass both syntaxes instead of one – no more hiving off the unwanted parts.
>>  Matt
>>  From: Ivan Herman <ivan@w3.org> 
>> Sent: April 29, 2025 9:52 AM
>> To: Eric Hellman <eric@hellman.net>
>> Cc: Laurent Le Meur <laurent.lemeur@edrlab.org>; W3C PM Working Group <public-pm-wg@w3.org>
>> Subject: Re: HTML in EPUB 3.4 (was: Publishing Maintenance WG Teleconference - April 24 2025)
>>   
>> 
>> 
>>> On 29 Apr 2025, at 15:19, Eric Hellman <eric@hellman.net> wrote:
>>>  Ivan,
>>>  Is it contemplated that the epub 3.4 validator will use a proper html5 validator for html content files? So as to have two different validators for two different flavors of epub 3.4 content files? 
>>>  In other words, will epub 3.4 require html5 validation of html5 content files?
>> 
>>  Yes. 
>>  To be clear about: at this moment the reason we are putting all this out there is because we have to check the feasibility of all this. The maintainers of epubcheck are part of this discussion, and they will have to tell us about the details; their experiences will be vital. But, from the specification side we are talking about the full HTML5 being checked.
>> 
>> 
>> 
>>> WIll epub 3.4 allow only a subset of html5 features? If so, how will it validate compliance?
>> 
>>  No. There is no plan to subset HTML5. That would be a bad idea, it would clearly be error-prone.
>>  There are some "profiling" done in EPUB 3.3 already[1][2] which are serialization independent. The only issue that has needs an HTML specific alternative is the usage of epub:type, which relies on namespaces; the new version introduces epub-type as an the way for HTML[3].
>>  [1] https://www.w3.org/TR/epub-33/#sec-xhtml
>> [2] https://www.w3.org/TR/epub-34/#sec-xhtml
>> [3] https://www.w3.org/TR/epub-34/#sec-epub-type-attribute
>> 
>> 
>> 
>>>  Or perhaps, will epub 3.4 unify a valid DOM across both xml and html content files?
>> 
>>  I am not sure what you mean. EPUB 3.3 (and probably earlier versions as well) refers to the HTML specification (and the DOM is clearly defined in that spec). No change there. But we also referred to the XML serialization thereof (which is part of the HTML spec). The problem is that the HTML specification, whilst defining the XML serialization[4], clearly states that it is discouraged. So EPUB 3.4 does not do anything really new in this respect, only to remove the required usage of that XML serialization.
>>  Maybe what is misleading: we did not rely on DTD/XML Schemas, i.e., the old versions of XHTML 1.1. Afaik, epubcheck does not do that, but uses the html validator itself (but I may be wrong on that).
>>  [4] https://html.spec.whatwg.org/multipage/xhtml.html#the-xhtml-syntax
>>  Ivan
>> 
>> 
>> 
>>>  Because "publishers may use HTML content documents in future EPUB publications" is not exactly a plan.
>>>  Eric
>>> 
>>> 
>>> 
>>>> On Apr 29, 2025, at 1:33 AM, Ivan Herman <ivan@w3.org> wrote:
>>>>  Hi Eric 
>>>>  (this comment is with my staff contact hat put down…)
>>>>  I have the impression, based on your comments below, that one factor is not clear (and we should be very careful about the messaging on that aspect). The plan is not to replace XHTML by HTML. We could not and should not do that; we have a strong constraint in our charter whereby we should keep backward compatibility. Any valid EPUB 3.3 document must remain valid EPUB 3.4. What we propose is that publishers may use HTML content documents in future EPUB publications. In other words, for example, publishers are not expected to "convert" their XHTML content files to HTML (which would definitely not obvious, just as you say). 
>>>>  That is also why the WG has decided, during the charter discussion, to keep to the EPUB 3.4 denomination, b.t.w.
>>>>  Cheers
>>>>  Ivan 
>>>> 
>>>> 
>>>> 
>>>>> On 24 Apr 2025, at 15:48, Eric Hellman <Eric@hellman.net> wrote:
>>>>>  A few points, based on using HTML5 files as source format for 75,000 different ebooks.
>>>>>  1. My "tooling" would easily switch to html5-inside EPUB.
>>>>>  2. Communicating the change to users would be impossible unless it was called "EPUB4". Also, forget EPUB4, it should EPUB5.
>>>>>  3. Publishers wanting to switch will discover that their converted XHTML files don't validate to HTML5 because the HTML5 validator is able to find more errors than the DTD/schema based XML validators. In particular, the requirement that tables should have the right number of cells in every row, while present in the early standards, was never checked by validators, and IS chaecked by HTML5 validators.
>>>>>  Eric
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Apr 24, 2025, at 3:31 AM, Laurent Le Meur <laurent.lemeur@edrlab.org> wrote:
>>>>>>  Just a warning: I asked the developer of FBReader - one of our members - for his opinion on this evolution. His answer is in brief: 
>>>>>>  "That's a major change that will require significant additional effort on our end. So we will not be happy. However, I think it's not an absolute nightmare for us." 
>>>>>>  Note: FBReader does not use a Web view for rendering EPUB. It uses an XML format internally, not HTML. It also provides basic support for plain HTML files through a separate mechanism. Therefore, it must port some modern features from the XML parser to the HTML parser.
>>>>>>  Conclusion: It is a logical move, but we must communicate extensively and in advance (at least 1 year) so that reading system developers can prepare for that evolution. 
>>>>>>  Best regards
>>>>>> Laurent LE MEUR / EDRLab
>>>>>>  << Attend the Digital Publishing Summit, 16-17 June 2025, Dublin - https://www.edrlab.org/events/digital-publishing-summit-2025/ >>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Le 24 avr. 2025 à 09:15, Gregorio Pellegrino - Fondazione LIA <gregorio.pellegrino@fondazionelia.org> a écrit :
>>>>>>>  Very interesting. I did some quick tests with the HTML-EPUB. It seems that the reading solutions read it without problems, while the supply chain tools (EPUBCheck, Ace, etc.) generate blocking errors.
>>>>>>>  This confirms what was discussed earlier: this change to the specification mainly impacts the supply chain (where many tools based on XML technologies are present), than the reading solutions, which in many cases are able to read HTML without problems.
>>>>>>>  In the meantime, I send the regrets for today: I am at the IAAP Europe conference in Brno and during the meeting time I have to moderate a panel discussion 😊
>>>>>>>  Gregorio
>>>>>>>  Da: Toshiaki Koike <koike@voyager.co.jp>
>>>>>>> Data: giovedì, 24 aprile 2025 alle ore 05:34
>>>>>>> A: public-pm-wg@w3.org <public-pm-wg@w3.org>
>>>>>>> Oggetto: Re: [AGENDA] Publishing Maintenance WG Teleconference - April 24 2025
>>>>>>> Hi all,
>>>>>>>  I have created a script to experimentally convert existing XHTML-based EPUB 3 files to HTML-based EPUB 3. Using this script, I converted a Japanese EPUB 3 sample into an HTML-based sample.
>>>>>>> As expected, EPUBCheck v5.2.1 reports errors for this file.
>>>>>>>  https://github.com/toshiakikoike/html-based-epub-experimental
>>>>>>>   
>>>>>>  
>>>>>  
>>>>  
>>>> ----
>>>> Ivan Herman, W3C 
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +33 6 52 46 00 43
>>>> 
>>>> 
>>>>  
>>>  
>>  
>> ----
>> Ivan Herman, W3C 
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +33 6 52 46 00 43
Received on Friday, 9 May 2025 07:51:16 UTC