Re: Publication of scientific research from Phillip Lord on 2013-04-25 (public-lod@w3.org from April 2013)

From: Phillip Lord <phillip.lord@newcastle.ac.uk>
Date: Thu, 25 Apr 2013 14:25:29 +0100
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: Sarven Capadisli <info@csarven.ca>, "\<public-lod\@w3.org\>" <public-lod@w3.org>
Message-ID: <87wqrqu4hi.fsf@zerg32.ncl.ac.uk>
You might be interested in this:

http://bio-ontologies.knowledgeblog.org/table-of-contents

These are papers from a workshop that I used to organise. The content as
you can see is in HTML and has included images and so forth. What is
perhaps less obvious is that the source data in most cases is a word
doc. All the content including the images was posted by word. We did
have to do a little reformatting (the conference template is really the
most unhelpful that it could be -- my fault, I wrote it). It takes
around 5 - 10 minutes a paper on average (there is quite a wide variance).

And more, the content has some semantic markup. The journal, publication
date, authors, and title are all clearly described in the HTML; you can
retrieve this metadata as RDF also, if you like. This metadata was not
added independently; it was present in the underlying Word doc. 

We added this by simply adding a little markup using shortcodes
([author]Phillip Lord[/author]). Of course this is entirely horrible, to
the point that a reviewer of my last grant called it a "drunk under a
lamppost idea". But it does work without requiring any modification of
word. And it works for wikipedia. Besides, nothing wrong with being
drunk under a lamppost occasionally.

It's also possible to combine the web and PDF. So, for instance, this link:

http://www.russet.org.uk/blog/2366


is my OWLED paper. In this case, the title, author, date come from
arXiv, and the abstract is transcluded from there. In short, it's an
overlay journal (article). The English summary and reviews are
independent, and subsidiary content. In this case, the knowledge comes
from arXiv where it has been added independently. I took this route
because, sad though it is to say, getting a word doc on the web is much
easier than getting a LaTeX document up.

We even have this working for CEUR-WS, although in this case, we loose
the abstracts; I would describe how we achieved this, but really, you
don't want to know.

The HTML is messy, of course, and dependent on the underlying tool. The
use of short codes is unprincipled and hideous. But it does work. And we
can add as much semantics as authors can be bothered with. Given that
the latter will be the limiting factor, I don't think it's a bad way
forward.

Phil







Hugh Glaser <hg@ecs.soton.ac.uk> writes:
> I hate PDF with a passion, by the way, but in the socio thingy of
> being an editor of a proceedings, it can be an enormous pain when
> people submit HTML that has local links to images, etc., even from MS
> Word documents.
>
> Cheers
>
> On 24 Apr 2013, at 18:23, Sarven Capadisli <info@csarven.ca>
>  wrote:
>
>> On 04/24/2013 05:37 PM, Andrea Splendiani wrote:
>>> There two main issues in moving beyond pdf.
>>> 
>>> One, probably minor, is that there are larger constraints. Some
>>> people need their work to be somewhere "understood" by their
>>> organization. This is a bit less relevant for conferences than for
>>> journals, but still an issue.
>>> 
>>> The other is that some bit of a research paper can lend to
>>> formalization. But there is a lot of variability. In some case you
>>> are closer to what web languages can represent. E.g.: a finding in
>>> RDF, some algorithm shown in JavaScript,... But what is somebody is
>>> publishing a description of an information systems ? It may get so
>>> far from a standard way to talk about think that you won't gain much
>>> with a structured representation.
>>> 
>>> pdf + other technologies, when it applies, could be a good idea,
>>> though.
>> 
>> I can't quite make out the core of the issues that you are trying to describe. So, from I understand:
>> 
>> We could maybe at least give this HTML thing a try. And, later worry about semantic alignments?
>> 
>> IMHO, there is no compelling reason to research and try PDF + other
>> technologies, when we have HTML+RDF + other technologies already in place
>> and staring right at us.
>> 
>> -Sarven
>> 
>
>
>
>

-- 
Phillip Lord,                           Phone: +44 (0) 191 222 7827
Lecturer in Bioinformatics,             Email: phillip.lord@newcastle.ac.uk
School of Computing Science,            http://homepages.cs.ncl.ac.uk/phillip.lord
Room 914 Claremont Tower,               skype: russet_apples
Newcastle University,                   twitter: phillord
NE1 7RU
Received on Thursday, 25 April 2013 13:25:54 UTC