One thought here is that the HTML elements/attributes for scholarly content should come from how best to capture in HTML5 the concepts/information/structure that scholarly/academic articles contain. That is, rather than jumping to the HTML5, first enumerate the concepts in some language-agnostic way and then see what HTML5 best fits.  The suggestions so far, both Johannes' below and the entire set of SH and friends candidate languages, are likely intending to provide already the benefit of having gone through this exercise, but since there's apparently several different conclusions, maybe it would be worth going through the analysis explicitly again?

There's no shortage of material to draw on, given that there have been years (decades really) of defining the concepts important to research/scholarly articles. These are embodied in the many DTDs and other schema both public and proprietary that publishers and consumers of such content have defined. I'm thinking here of everything starting from Majour headers through commercial publishers' DTDs to PubMed and JATS. If the efforts that have been listed already have done this analysis, can that be shared here? Our own "WileyML" [1] intends to capture all of the concepts that Wiley has found necessary so far for its academic/journal content (but we are in flux, anticipating the future). These now extend beyond academic/scholarly articles, but there’s a pared-down list below that I’ve tried to restrict to concepts relevant to scholarly/academic articles.

As usual the devil is in the details, with a prime example being even something as fundamental as paragraphs. The reality is that HTML's <p>, even in HTML5, does not fully capture the semantic notion of "paragraph" since that can for example contain displayed equations (it's relatively common for an equation to be followed by “where x is….”, clearly part of the same paragraph, although that’s not the only case). However the proper HTML for displayed objects such as equations involves a <div>, which cannot be within <p>.

And of course it's not just a matter of specifying which elements/atts/values may be needed, but also structuring and additional rules that may be appropriate, but the list could be a good start. Is it worth us filling this out to agree on all the concepts we want to capture for scholarly/academic articles and then specifying the best HTML5 construct for each? (no doubt the answer for many is <span> with some attribute(s)). We could also add restrictions/structuring. RELAX NG and Schematron anyone? :)

Scholarly/Academic Concepts, not including OASIS tables and MathML
Metadata        HTML /structure/constraints
journal level
DOI for journal
issn (print)
issn (electronic)
id (journal)
title (of journal)
abbreviated title (of journal)
subject (of journal)
issue level
position in volume
DOI for issue
title (issue/supplement)
copyright owner (issue)
copyright line (issue)
volume number
issue number
supplement number
editor (for special issue)
date issue started
date issue completed
cover date
cover date (display form)
article level
article type
article status
position in issue
page total
word count
access type (open?)
ToC heading for article 1st level
ToC heading for article 2nd level
ToC heading for article 3rd level
MedLine PubType
MeSH checkword
MeSH descriptor
MeSH descriptor major topic?
MeSH descriptor tree number
MeSH descriptor unique ID
MeSH qualifier
MeSH qualifier major topic
MeSH qualifier tree number
MeSH qualifier unique ID
link to typeset version
link to typeset version first page
link to plain text version
link to author manuscript version
embargo end date for author manuscript
title - ToC form
title - short (running)
title - short authors(running)
erratum target DOI
retraction target DOI
subject (article level)
subject relevance
editorial office ID
file ID
society ID
supplier ID
title (main article title)
subtitle  (article)
article category title
pageHeading title
first page
last page
article copyright
doi (article)
online pub date
         creator role
         affiliation link
        current affiliation link
        given names
        name prefix (van der)
        family name
        name suffix (Jr.)
        titles after names
        preferred display name
        alternative name
        job title
       biographical info
       biographical photo
       email (for creator)
manuscript received date
manuscript revised date
manuscript accepted date
funding agency
funding grant number
funder DOI
Fundref name
license (legalStatement)
supporting information
corresponding author info
         country code
                country part
header footnotes
abstract type
abstract language
abstract title
keyword classification

Body Content
accession ID (e.g. GenBank ID)
appendix of an article
bold text
block that can float (box, quotation, graphic, pull quote, sidebar, text)
block that is fixed (box, dialogue, graphic, poetry, quotation, signature block, text)
caption of a figure
chemical structure (image and possible description and number)
computer code (block of lines)
data for a media resource, such as hex coding or TeX
definition list (abbreviation list)
displayedItem (equation, reaction)
email address
fixed case text
feature that can float
feature fixed in place
fixed italic text
field in a record
figure part
fixed roman text (text that must stay in roman)
italic text
information asset, such as a chemical name or gene
inline graphic
label (for an irregularly numbered object)
letter (such as a letter to the editor)
line (e.g. of computer code or poetry)
lineated text (group of lines, possibly numbered)
link to another object
list (various styles)
list item
list item pair wrapper
paired list (such as for definitions)
paired list column header
math statement attribution or other detail
math statement (theorem, lemma, etc.)
mediaResource (binary resource, possibly with MIME type etc.)
note (footnote, "marginal", or assigned to a whole object)
end notes
paragraph (semantic)
laboratory protocol
protocol materials
protocol procedure/recipe
protocol section
protocol step
record similar to a database record (with fields)
region of an image
salutation in a letter
small caps
source for a figure, table, etc.
span (for CSS styling, assigning an ID, etc.)
sub-article (such as an historical article for commentary)
tabular content that can float
tabular content fixed in place
term definition
title of a section, figure, table, list item, etc.

References/Bibliographies (some of the other elements above can also be used in citations)
article title in a citation
author in a citation
bibliographic item (may be several citations)
bibliography section
book series title in a citation
book title in a citation
chapter title in a citation
citation (may also occur inline in main body text)
corporate or collaborative group name in a citation
defendant in a legal citation
journal title in a citation
title in a citation other than book, journal, article, e.g. dissertation or online resource
plaintiff in a legal citation
publisher location in a citation
publisher name in a citation
publication year in a citation
statute title in a legal citation
volume number in a citation

Hi Johannes,

thanks for sharing that list, it is useful. I'm just adding some parts that we've seen needed below (not necessarily exhaustive). The list we have comes from encoding actual articles into SH.

> **Block level elements for textual contents**
> - P
> - H1-H3
> - Blockquote
> - Code
> - UL/OL

Point of terminology: I got tired of saying the many variations on "block content", "blocks such as paragraphs and tables", or "blocks but of text not, like, sections and stuff". Instead I minted "hunks", which means exactly: the blockish things inside sections, that aren't the title.

We've found a need for pretty much arbitrary header depth, not just beyond h3 but in cases beyond h6. For that we use h6 with aria-level set to the real depth.

Things like code (and images, tables, block equations) we all handle as figures (even if without a figcaption, which is fine). Beyond consistently making them captionable, this also provides nice common hooks for styling (and as a bonus it provides a container inside of which to set up horizontal scrolling on small screens, which all of these types can need).

> **Inline text elements**
> - Links (standard HTML links)
> - Footnotes (Have to be displayed off to the side or below the text,
> and need to be able to contain all the things that body elements can
> contain)

I've listed a few more:

Of those, the most notable are the ones that enable internationalisation (ruby and friends), inline math or code, and simply making it possible to hang semantics off an added span.

• Robin Berjon - - @robinberjon • — intelligent science publishing •

