Re: Listing what is out there? from Peter Murray-Rust on 2015-11-26 (public-scholarlyhtml@w3.org from November 2015)

From: Peter Murray-Rust <pm286@cam.ac.uk>
Date: Thu, 26 Nov 2015 16:34:28 +0000
To: Silvio Peroni <silvio.peroni@unibo.it>
Cc: Ivan Herman <ivan@w3.org>, Sarven Capadisli <info@csarven.ca>, W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
Message-ID: <CAD2k14OeEPYifk1drPFsUSRtxWZnj+j+XGFHtdYPx6Hft1LxmA@mail.gmail.com>

Greeting all, and delighted to talk with W3C stalwarts and see this
exciting development happening. (I was very involved in developing XML and
SAX and ran the XML-DEV list 18 years ago...)

Very brief history. We met in Cambridge 4-5 years ago because we needed a
consistent semantic output for/from scientific documents. We chose HTML -
effectively pre-HTML5 , and argued that what was necessary was an agreement
on HTML document structure and labelling of sections and fragments.

As simple as possible, and responsive to the community. We came up with the
name ScholarlyHTML, and are delighted that this is valuable today.

2 years ago I started contentmine.org to extract facts from scientific
publications. Because the inputs are so varied (HTML, XHTML, PDF, XML,
DOCX, LaTeX, etc.) we normalize them to (X)HTML and create consistent
labelling. (For example Senay Kafka at Eur Bioinf Inst has created 20
labels for scientific scholarly documents (Introduction, Methods,
Acknowledgements, etc.) with regexes to identify them).  I'd seriously
consider adopting them as she has done the research work.

I think the details can and should be changed from our draft as it was
pre-HTML5.  The orimary principle is simply:
- well-structured HTML
- with clear sectioning
- labelled with community agreed labels, but with flexibility

and of course specs, examples, validators (very useful for building
software).

Not sure if I am yet subscribed - if not please forward this to the list.

Best

P.

On Thu, Nov 26, 2015 at 4:18 PM, Silvio Peroni <silvio.peroni@unibo.it>
wrote:

> Hi Ivan and Sarven,
>
> Good find. It looks like this is it:
>
> https://github.com/ScholarlyHTML/spec
>
>
> Sarven, thanks for this.
>
> Ah. Sorry Silvio, I have not seen this when I replied.
>
> Yep, but even that one seems to be pretty shallow, too…
>
> Anyway, I will add this to the list.
>
>
> Ivan, as far as I know, it is the format used in the ContentMine.org
> <http://contentmine.org> project, which is pretty alive. I’ve CCed Peter
> Murray-Rust to the conversation, in case he would like to add something
> about the ScholarlyHTML guidelines they use.
>
> Have a nice day :-)
>
> S.
>
>
>
>
>
> ----------------------------------------------------------------------------
> Silvio Peroni, Ph.D.
> Department of Computer Science and Engineering
> University of Bologna, Bologna (Italy)
> Tel: +39 051 2094871
> E-mail: silvio.peroni@unibo.it
> Web: http://www.essepuntato.it
> Twitter: essepuntato
>
>

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Received on Thursday, 26 November 2015 16:34:58 UTC