Re: html for scholarly communication: RASH, Scholarly HTML or Dokieli? from Silvio Peroni on 2017-09-09 (public-scholarlyhtml@w3.org from September 2017)

From: Silvio Peroni <silvio.peroni@unibo.it>
Date: Sat, 9 Sep 2017 17:05:21 +0200
To: Johannes Wilm <johanneswilm@gmail.com>
CC: Peter Murray-Rust <pm286@cam.ac.uk>, Scholarly HTML community group <public-scholarlyhtml@w3.org>, Sarven Capadisli <info@csarven.ca>
Message-ID: <6E827543-6DC8-4781-91A4-5DB0D78499E2@unibo.it>
Hi,

I agree. I don't think to have hundreds of tags, but rather the opposite. And then, we can also use existing standards - e.g. DPUB-ARIA - for characterising better the actual semantics of a particular structure by specifying appropriate @roles.

I'm not particularly worried about in-text reference pointers (or in-text citations, as Johannes wrote) right now. But yes, they should be handled in such SH-CG I believe. Something to discuss after the article analysis for sure.

For answering Peter: SH-CG should provide a "standard" way of using a minimal set of HTML tags for describing a scholarly article (independently from the discipline in consideration), and should be enough flexible - e.g. via RDFa - to allow users to assign specific discipline-oriented semantics to the various tags.

Have a nice day :-)

S.

> Il giorno 09 set 2017, alle ore 16:44, Johannes Wilm <johanneswilm@gmail.com> ha scritto:
> 
> Hey,
> just quickly: humanists can probably remember just as much or little as people of other sciences. So 200 or 300 tags is, similarly to JATS, not something the average scientist can remember, so this must be somewhere further down in tge production chain. I will have to look at it in detail again in the case of humanities, but for social sciences the main difficulty I have run into is citations and getting them expressed in systems made for other sciences. It generally boils down to cases like:
> 
> "As Wallström (1945: p. 56) points out... ."
> "This has been written about by a number of people (see for example Plexstein 1956: p. 27, Nielsen 1972: p. 46, Austen 2012: p. 33)."
> 
> Having in-text citations rather than footnotes can many times be a matter just of style, but it's not working the same way if several works need to be referenced in the same citation, or if the author's name needs to be part of the sentence and not inside the parentheses.
> 
> Biblatex and some other systems can handle things like. And I would think that making sure that it also works in what we come up with should not be too difficult. But... Let's try it out. Maybe I'm wrong and we need hundreds of special tags and then we can forget about it.
> 
>> On 9 Sep 2017 2:29 pm, "Peter Murray-Rust" <pm286@cam.ac.uk> wrote:
>> This is a hard problem. We need novel imaginative solutions.
>> 
>> Thousands of committed expert people have built systems for document structure. As an example TEI (mainly digital humanities) - which exemplifies the semantics-presentation division - has 500 tags/concepts (https://en.wikipedia.org/wiki/Text_Encoding_Initiative ) . If we are to take a humanities-based approach then we cannot ignore this history. Similarly JATS (originally biomedical)  has ca 250 tags (at least that is my current count in downloading actual JATS). Add in computer science, geoscience, law, theses, grants, and much else and we have 1000 tags.
>> 
>> So what do we want SH-CG for? My requirement is simple to state. There are (my own figures from analysing CrossRef over several months) ca 7000 "articles" published a day from 500 publishers and "most" have "HTML". 
>> 
>> So what I want is to be able to read this into my machine without having to have 200+ formats and 200+ semantics. I don't expect the semantics to cover all nuances of a discipline (I and Henry developed Chemical Markup Language and even that doesn't capture everything chemists want to publish). So I'd settle for something with about 3-4 div-types 
>> 
>> HTML
>>   HEAD - contains metadata - bibliographic, authors, titles, funders, acknowledgements, etc. This is a solved problem
>>   BODY - 
>>     DIV class="abstract/summary" // this seems to be fairly universally required
>>     DIV class="maintext" - the core of the article
>>     DIV class="assets" - figures, tables, schemes
>>   FOOT:
>>      DIV class="references" (or citations)
>>      DIV class="publisher" - all the stuff that a publisher considers important and I want to strip out
>>      DIV class="supplemental" - many things "attached" to the article and published in paralllel - data sets
>> 
>> and then we can add finer markup which can by used of ignored as the readers wishes.
>> 
>> 
>>> On Sat, Sep 9, 2017 at 9:40 AM, Johannes Wilm <johanneswilm@gmail.com> wrote:
>>> Yes, so anthropology is somewhere between humanities and social sciences. History is more clearly humanities. Political sciences or sociology should require all the things anthropology requires plus some more and would therefore probably be better picks to represent the social sciences. 
>>> 
>>> Do we have someone in those fields? Then I could concentrate on history. Open access and a license that allows remixing would be preferable.
>>> 
>>>> On 9 Sep 2017 9:46 am, "Silvio Peroni" <silvio.peroni@unibo.it> wrote:
>>>> Hi Sarven, all,
>>>> 
>>>>>> Anyone fancy doing a comparative analysis or even mocking up the same
>>>>>> (ideally rather complex) article in ScholarlyHTML, RASH, and anything
>>>>>> else we'd care to compare/discuss?
>>>>> 
>>>>> Great ideas! We can all pitch in from our respective areas.
>>>> 
>>>> Thats a great idea, indeed! However, please, we should not came out with a huge set of example, only one per discipline. Just to be more precise, I would suggest to use the categorisation in https://en.wikipedia.org/wiki/List_of_academic_fields, using the first level entities of such taxonomy, i.e.:
>>>> 
>>>> - Humanities
>>>> - Social sciences
>>>> - Natural sciences
>>>> - Formal sciences
>>>> - Professions and applied sciences
>>>> 
>>>> I think it would be enough to have one paper for each of the aforementioned fields for starting – better if the selected papers are Open Access, just to avoid useless discussions with publishers on rights to be shared in another channel by someone that is not the author of the article. I know that it is possible that subfields of each field can have different needs in terms of article content, but we cannot cover the whole literature at this point, can we?
>>>> 
>>>> I think Sarven and I could cover the “Formal sciences” part – in particular, while selecting a paper in the Computer Science sub-field, since we are actually working there, we need to consider something that include mathematical formulas I believe.
>>>> 
>>>> Could someone else help with the other fields?
>>>> 
>>>> Have a nice day :-)
>>>> 
>>>> S.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ----------------------------------------------------------------------------
>>>> Silvio Peroni, Ph.D.
>>>> Department of Computer Science and Engineering
>>>> University of Bologna, Bologna (Italy)
>>>> Tel: +39 051 2095393
>>>> E-mail: silvio.peroni@unibo.it
>>>> Web: https://www.unibo.it/sitoweb/silvio.peroni/en
>>>> Twitter: essepuntato
>>>> 
>> 
>> 
>> 
>> -- 
>> Peter Murray-Rust
>> Reader Emeritus in Molecular Informatics
>> Unilever Centre, Dept. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
Received on Saturday, 9 September 2017 15:05:46 UTC