- From: Peter Murray-Rust <pm286@cam.ac.uk>
- Date: Thu, 26 Nov 2015 16:34:28 +0000
- To: Silvio Peroni <silvio.peroni@unibo.it>
- Cc: Ivan Herman <ivan@w3.org>, Sarven Capadisli <info@csarven.ca>, W3C Scholarly HTML CG <public-scholarlyhtml@w3.org>
- Message-ID: <CAD2k14OeEPYifk1drPFsUSRtxWZnj+j+XGFHtdYPx6Hft1LxmA@mail.gmail.com>
Greeting all, and delighted to talk with W3C stalwarts and see this exciting development happening. (I was very involved in developing XML and SAX and ran the XML-DEV list 18 years ago...) Very brief history. We met in Cambridge 4-5 years ago because we needed a consistent semantic output for/from scientific documents. We chose HTML - effectively pre-HTML5 , and argued that what was necessary was an agreement on HTML document structure and labelling of sections and fragments. As simple as possible, and responsive to the community. We came up with the name ScholarlyHTML, and are delighted that this is valuable today. 2 years ago I started contentmine.org to extract facts from scientific publications. Because the inputs are so varied (HTML, XHTML, PDF, XML, DOCX, LaTeX, etc.) we normalize them to (X)HTML and create consistent labelling. (For example Senay Kafka at Eur Bioinf Inst has created 20 labels for scientific scholarly documents (Introduction, Methods, Acknowledgements, etc.) with regexes to identify them). I'd seriously consider adopting them as she has done the research work. I think the details can and should be changed from our draft as it was pre-HTML5. The orimary principle is simply: - well-structured HTML - with clear sectioning - labelled with community agreed labels, but with flexibility and of course specs, examples, validators (very useful for building software). Not sure if I am yet subscribed - if not please forward this to the list. Best P. On Thu, Nov 26, 2015 at 4:18 PM, Silvio Peroni <silvio.peroni@unibo.it> wrote: > Hi Ivan and Sarven, > > Good find. It looks like this is it: > > https://github.com/ScholarlyHTML/spec > > > Sarven, thanks for this. > > Ah. Sorry Silvio, I have not seen this when I replied. > > Yep, but even that one seems to be pretty shallow, too… > > Anyway, I will add this to the list. > > > Ivan, as far as I know, it is the format used in the ContentMine.org > <http://contentmine.org> project, which is pretty alive. I’ve CCed Peter > Murray-Rust to the conversation, in case he would like to add something > about the ScholarlyHTML guidelines they use. > > Have a nice day :-) > > S. > > > > > > ---------------------------------------------------------------------------- > Silvio Peroni, Ph.D. > Department of Computer Science and Engineering > University of Bologna, Bologna (Italy) > Tel: +39 051 2094871 > E-mail: silvio.peroni@unibo.it > Web: http://www.essepuntato.it > Twitter: essepuntato > > -- Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
Received on Thursday, 26 November 2015 16:34:58 UTC