Re: How to mark up a document other than a web page? from Herbert Van de Sompel on 2023-01-21 (public-bioschemas@w3.org from January 2023)

From: Herbert Van de Sompel <hvdsomp@gmail.com>
Date: Sat, 21 Jan 2023 18:22:21 +0100
To: Franck Michel <fmichel@i3s.unice.fr>
Cc: Carole Goble <carole.goble@manchester.ac.uk>, "LJ.Garcia" <lj.garcia.co@gmail.com>, Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>, Yvan Le Bras <yvan.le-bras@mnhn.fr>, Dan Bolser <dan.bolser@gmail.com>, public-bioschemas <public-bioschemas@w3.org>, Fabien Gandon <fabien.gandon@inria.fr>, Pierre-Antoine Champin <pierre-antoine@w3.org>
Message-ID: <CAOywMHeqM3rTCnjUkUjxXm5Rbyixm92g3-1r5x_R1=DfRj6-uw@mail.gmail.com>
hi Franck,

Yes, what you describe totally aligns with the Signposting approach. In the
spec, you will see that it also talks about using HTTP Link headers and/or
Link Sets because these allow to convey information (e.g. metadata) for
resources that are not HTML and hence can not have embedded metadata.

The "describedby" relationship, as all other relationships that are used in
Signposting, is listed in the IANA Link Relationship Registry at
https://www.iana.org/assignments/link-relations/link-relations.xhtml . All
these relationships have been defined in formal specifications/standards.

Typed links, like the ones used by Signposting, are commonly used in
RESTful interfaces. They're really mainstream in that way but don't seem to
be well known in the scholarly communication landscape. With Signposting we
want to try and change that by promoting this very low barrier
interoperability approach.

Greetings

Herbert

On Fri, Jan 20, 2023 at 5:08 PM Franck Michel <fmichel@i3s.unice.fr> wrote:

> Dear all,
>
> Thx Leyla for pointing to Signposting, and thx Herbert for the links. I
> did not know about this project but this is indeed very similar to what I
> propose, yet with a different scope.
>
> Signposting suggests to use the HTTP Link header to link scholarly
> resources on the web with metadata about them, in various formats, like
> authoring information, bibliographic information in Bibtex, RIS etc.,
> together with content negotiation.
>
> My proposition can totally be complementary with this. It seeks the
> generalized use of Schema.org (+ extensions such as Bioschemas of course)
> to mark up any resource. The HTTP Link header is used to point to the
> markup data.
> The "describedBy" relation is probably better suited than the "meta" that
> I used in my example. But anyway, the idea is that you can benefit from the
> large scope of Schema.org, Bioschemas and other extensions, to describe all
> your resources that are not webpages.
> Content negotiation can be used as show in the examples of Signposting, so
> if I extend my earlier example, that would give something like this:
>
> curl -I -H "Accept: application/ld+json" https://domain.org/myImage.jpg
>
> Link: <https://domain.org/myImage.jpg?q=markup&format=application/ld+json>
> <https://domain.org/myImage.jpg?q=markup&format=application/ld+json>
>       ; rel="describedby"
>       ; type="application/ld+json"
>
> Note that the query part of the URL "q=markup&format=application/ld+json"
> can be anything you like, it's just a trick to be used by the URL rewriting
> module of the web server, to point to the markup data for that specific
> resource.
>
> Regarding Jerven's remark about XMP, indeed like I said in my previous
> email: "XMP (Extensible Metadata Platform) allows to embed metadata in
> binary files. That's (...) limited to a few file types and this requires to
> parse the content of the file itself."
> By contrast, like explained by the authors of Signposting, using the HTTP
> Link header allows to query headers only (with HTTP method HEAD) such that
> you can get the metadata without even having to download the resource
> itself which may be big.
> Plus, this relies on native HTTP mechanisms only, so that you don't need a
> specific library to parse the header of pdfs, another one for images and so
> on.
>
> Franck.
>
> Le 20/01/2023 à 10:42, Herbert Van de Sompel a écrit :
>
> hi all,
>
> Thanks Carole for adding me to the conversation.
>
> Yes, indeed, Signposting in general, and the FAIR Signposting Profile
> specifically, were introduced as a lightweight mechanism to address the
> issue at hand:
> * https://signposting.org/
> * https://signposting.org/FAIR/
>
> Since Dataverse was mentioned in the email exchange, I can report that
> support for the FAIR Signposting Profile was implemented for Dataverse and
> should come with the next release, see
> https://github.com/IQSS/dataverse/issues/5962
>
> I am happy to answer any questions.
>
> Greetings
>
> Herbert
>
> On Fri, Jan 20, 2023 at 10:20 AM Carole Goble <
> carole.goble@manchester.ac.uk> wrote:
>
>> Looping in Herbert Van de Sompel, worldwide Signposting expert
>>
>>
>>
>> Carole
>>
>>
>>
>>
>>
>> Professor Carole Goble CBE FREng FBCS CITP
>>
>> Department of Computer Science
>>
>> The University of Manchester,
>>
>> Manchester, M13 9PL, UK
>>
>>
>>
>> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>
>>
>>
>> PLEASE Do not send me a calendar invite and expect me to see it. (i)
>> Invites only work 50% of the time (ii) if they do work they do not appear
>> as email so I don’t know they are there until it is too late.
>>
>> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
>> do not use.
>>
>>
>>
>> *From:* LJ.Garcia <lj.garcia.co@gmail.com>
>> *Sent:* 19 January 2023 18:41
>> *To:* Franck Michel <fmichel@i3s.unice.fr>; Stian Soiland-Reyes <
>> soiland-reyes@manchester.ac.uk>
>> *Cc:* Yvan Le Bras <yvan.le-bras@mnhn.fr>; Carole Goble <
>> carole.goble@manchester.ac.uk>; Dan Bolser <dan.bolser@gmail.com>;
>> public-bioschemas <public-bioschemas@w3.org>; Fabien Gandon <
>> fabien.gandon@inria.fr>
>> *Subject:* Re: How to mark up a document other than a web page?
>>
>>
>>
>> Hi Franck,
>>
>>
>>
>> What you mention about using the HTTP header reminds me of Signposting (
>> https://signposting.org/). Have you seen this approach? I am still have
>> to catch up with this subject so adding more people to the loop with better
>> knowledge on it.
>>
>>
>>
>> Kind regards,
>>
>>
>>
>> On Thu, Jan 5, 2023 at 5:48 PM Franck Michel <fmichel@i3s.unice.fr>
>> wrote:
>>
>> Dear all,
>>
>> Thank you for your remarks and comments. Actually I feel like the
>> discussion has already gone way beyond my initial question and proposition.
>>
>> My point was to figure out a simple way to provide metadata about any
>> kind of resource on the web, not only web pages, in the form of Schema.org
>> markup.
>>
>> RO-Crate is definitely a very interesting initiative but it primarily
>> concerns communities used to dealing with large data repositories like
>> Zenodo or Dataverse. Besides, it requires to encapsulate the produced
>> objects within an package (archive) that contains all necessary additional
>> metadata. This is great for enforcing FAIR ROs, but apart from such
>> specific needs, an image on the web will remain available as a raw jpg or
>> png file, same thing for a pdf, music, spreadsheet etc. We cannot expect
>> each web master to encapsulate those objects in RO-Crate packages.
>>
>> A way to mark up an object is to create a web page that links to this
>> object, and add markup on that page. But whenever the object is accessed
>> directly by its URL, it has no more markup data. As a result, SEO practices
>> have terrible recommendations like naming image files with a super long
>> name containing the name of the thing being represented, its description,
>> the image resolution etc. Ugly, right? XMP (Extensible Metadata Platform)
>> allows to embed metadata in binary files. That's much better but this is
>> limited to a few file types and this requires to parse the content of the
>> file itself.
>>
>> So my point is: we can link objects on the web to their metadata with a
>> mechanism that has been there since HTTP 1.0 (RFC1945
>> <https://datatracker.ietf.org/doc/html/rfc1945#page-59>, 1996!), that is
>> almost the beginning of the web: the HTTP Link header. Hence the example of
>> a web server that returns a pdf document along with this header:
>>     Link: <document_metadata.json>; rel="meta"; type="application/ld+json"
>>
>> Upside: it does not break nor impose anything. HTTP clients that don't
>> care or understand JSON-LD will just ignore it. Those that can consume
>> JSON-LD will fetch the metadata and use the Schema.org annotations to do
>> whatever they want. This way, search engines will know precisely what's in
>> the object, making tools like Google Image able to index images much more
>> effectively.
>> Downside: there has to be a second HTTP get query to retrieve the JSON-LD
>> metadata. No big deal.
>>
>> Does it make sense or is it just totally obvious?
>>
>> Franck.
>>
>> Le 05/01/2023 à 11:55, Yvan Le Bras a écrit :
>>
>> Hi Franck, Carole, hi everyone,
>>
>>
>>
>> Let me first wish you all a happy new year !
>>
>>
>>
>> Sorry if I misunderstood or if I am totally wrong, but it appears to me
>> important to try expose my point of view ;)
>>
>>
>>
>> Looking at your question Franck, and at answer from Carole notably, it
>> seems to me that 1/ schemas.org is made to mark-up web pages and e-mail
>> messages 2/ using an intermediate ""metadata layer"" who can be RDFa or
>> JSON-LD for example.
>>
>>
>>
>> Thus, to add schemas.org vocabulary to ""files"", it appears to me the
>> best is to use a metadata standard who describes the data, and for example
>> also URLs to download data files, and then can be exposed in RDFa or
>> JSON-LD for example through web pages where there schemas.org vocabulary
>> is used... So in structured data accessible on the internet.
>>
>>
>>
>> Thus, we can use RO-Crate or other standardized way to produce RO
>> metadata using schemas.org on JSON-LD web pages (for example we do so in
>> Ecology using "Ecological Metadata Language" standard and we can look at
>> the structured data on the data catalog like here
>> https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c
>> where we are using notably "schemaVersion", "url", "dataPublished",
>> "dateModified", "description", "keywords", "creator", "temporalCoverage",
>> "SubjectOf", "fileFormat", "spatialCoverage", ""geo", "latitude",
>> "longitude",  "variableMeasured" schema.org terms)
>>
>>
>>
>> => Here I give the EML oriented example because it allows us to have
>> detailled metadata, notably with the "variableMeasured" who is something
>> allowing our datasets to have a particularly higher FAIRness.
>>
>>
>>
>> Please, don't hesitate to comment !
>>
>>
>>
>> Wishing you a very good end of week,
>>
>>
>>
>> Best,
>>
>>
>>
>> Yvan
>>
>>
>> ------------------------------
>>
>> *De: *"Carole Goble" <carole.goble@manchester.ac.uk>
>> <carole.goble@manchester.ac.uk>
>> *À: *"Dan Bolser" <dan.bolser@gmail.com> <dan.bolser@gmail.com>, "Franck
>> Michel" <fmichel@i3s.unice.fr> <fmichel@i3s.unice.fr>
>> *Cc: *"public-bioschemas" <public-bioschemas@w3.org>
>> <public-bioschemas@w3.org>, "Fabien Gandon" <fabien.gandon@inria.fr>
>> <fabien.gandon@inria.fr>
>> *Envoyé: *Jeudi 5 Janvier 2023 11:09:11
>> *Objet: *RE: How to mark up a document other than a web page?
>>
>>
>>
>> https://zenodo.org/record/7147703#.Y7agoxXP2F4   is a longer talk that
>> sets up the RO-Crate vision
>>
>>
>>
>> Carole
>>
>>
>>
>>
>>
>> Professor Carole Goble CBE FREng FBCS CITP
>>
>> Department of Computer Science
>>
>> The University of Manchester,
>>
>> Manchester, M13 9PL, UK
>>
>>
>>
>> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>
>>
>>
>> PLEASE Do not send me a calendar invite and expect me to see it. (i)
>> Invites only work 50% of the time (ii) if they do work they do not appear
>> as email so I don’t know they are there until it is too late.
>>
>> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
>> do not use.
>>
>>
>>
>> *From:* Carole Goble <carole.goble@manchester.ac.uk>
>> <carole.goble@manchester.ac.uk>
>> *Sent:* 05 January 2023 09:54
>> *To:* Dan Bolser <dan.bolser@gmail.com> <dan.bolser@gmail.com>; Franck
>> Michel <fmichel@i3s.unice.fr> <fmichel@i3s.unice.fr>
>> *Cc:* public-bioschemas@w3.org; Fabien Gandon <fabien.gandon@inria.fr>
>> <fabien.gandon@inria.fr>; Carole Goble <carole.goble@manchester.ac.uk>
>> <carole.goble@manchester.ac.uk>
>> *Subject:* RE: How to mark up a document other than a web page?
>>
>>
>>
>> I have forwarded this thread to RO-Crate folks to pitch in
>>
>>
>>
>> RO-Crate https://www.researchobject.org/ro-crate/  packages files and
>> annotates them with rich metadata (using Bagit). It uses JSON-LD and
>> schema.org. It’s an example of using schema.org for multiple files not
>> web pages.
>>
>>
>>
>> RO-Crate has gained a lot of traction in organisations needing to
>> exchange digital objects with structured machine readable metadata, and is
>> designed to be repository neutral – that is, enable inter-repo exchange.
>> Zenodo and DataVerse have work ongoing to build compliance.
>>
>> https://zenodo.org/record/7376356#.Y7adghXP2F4 is a talk about the
>> repository overlay aspect of RO-Crate
>>
>>
>>
>> Carole
>>
>>
>>
>>
>>
>> Professor Carole Goble CBE FREng FBCS CITP
>>
>> Department of Computer Science
>>
>> The University of Manchester,
>>
>> Manchester, M13 9PL, UK
>>
>>
>>
>> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>
>>
>>
>> PLEASE Do not send me a calendar invite and expect me to see it. (i)
>> Invites only work 50% of the time (ii) if they do work they do not appear
>> as email so I don’t know they are there until it is too late.
>>
>> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
>> do not use.
>>
>>
>>
>> *From:* Dan Bolser <dan.bolser@gmail.com>
>> *Sent:* 05 January 2023 09:39
>> *To:* Franck Michel <fmichel@i3s.unice.fr>
>> *Cc:* public-bioschemas@w3.org; Fabien Gandon <fabien.gandon@inria.fr>
>> *Subject:* Re: How to mark up a document other than a web page?
>>
>>
>>
>> https://www.tomforth.co.uk/scienceandpdfs/
>>
>>
>>
>> Looks useful
>>
>>
>>
>>
>>
>> On Wed, Jan 4, 2023, 5:50 PM Franck Michel <fmichel@i3s.unice.fr> wrote:
>>
>> Dear community,
>>
>> First of all, let me wish you all a happy, richly marked up new year ;).
>>
>> Schema.org is meant to mark up ressources of any kind on the internet,
>> not just web pages. While presenting Bioschemas, I once had this question:
>> how do I mark up a pdf file? More generally, how to mark up any resource
>> other than an html or xml-based content, like pdf, image, csv, Excel sheet,
>> zip archive etc. ?
>>
>> I recently asked this during a BSC meeting but it seemed that nobody had
>> really faced this use case yet. And I did a quick Google search but nothing
>> came up. So I'd be interested in having your thoughts on this.
>>
>> A basic solution would be to insert markup in the web page that provides
>> the download link. Not so satisfying since, when an application downloads
>> the file using its direct URL, there is no more markup.
>>
>> I could think of a simple solution that uses the HTTP Link header to
>> point to a file containing the markup data (similarly to what's been done
>> in JSON-LD <https://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld>
>> or CSCW <https://www.w3.org/TR/tabular-data-model/#link-header>). The
>> exchange would look like this:
>>
>> GET /document.pdf HTTP/1.1
>> Host: example.com
>>
>> ====================================
>>
>> HTTP/1.1 200 OK
>> Content-Type: application/pdf
>> Link: <document_metadata.json>; rel="meta"; type="application/ld+json"
>> ...
>>
>> Where document_metadata.json is a JSON-LD description of the file and its
>> topic (written with Schema.org and Bioschemas of course). I'm not sure
>> whether rel="meta" is the best choice here, but that's just an example.
>>
>> Note that some metadata may already be embedded in pdf and image files by
>> means of XMP <https://en.wikipedia.org/wiki/Extensible_Metadata_Platform>,
>> where Schema.org types and properties could be used. But this does not work
>> with any type of file, plus applications may want to use only HTTP-based
>> mechanisms to get the markup data, rather than have to read the content of
>> binary files.
>>
>> Have you seen this kind of use case and usage somewhere? Any other
>> solution you could think of? Do search engines expect this kind of linking
>> to external markup files?
>>
>> Thx in advance. Regards,
>>    Franck.
>>
>> --
>>
>> Franck MICHEL, CNRS research engineer
>>
>> Université Côte d’Azur, CNRS, Inria
>>
>> I3S laboratory (UMR 7271)
>>
>>
>>
>>
>>
>> --
>>
>> --
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> Yvan Le Bras, PhD
>>                                                           @Yvan2935
>>
>>                                   <°))))><
>>
>>      Responsable scientifique et technique "Pole National de Données de
>> Biodiversité"   https://www.pndb.fr/
>>
>>
>>  Bureau 34, Station marine de Concarneau BP 225, 29182 Concarneau CEDEX ---
>> MNHN Unité de service PatriNat Paris
>>
>>                                               tél.:  +33 (0) 2 98 50 99 35
>> / +33 (0) 6.10.43.96.51
>>
>>
>> yvan.le-bras@mnhn.fr
>>
>>
>>
>>
>
> --
> ==================
> Herbert Van de Sompel
> https://hvdsomp.info
> https://orcid.org/0000-0002-0715-6126
>
>
>

-- 
==================
Herbert Van de Sompel
https://hvdsomp.info
https://orcid.org/0000-0002-0715-6126
Received on Saturday, 21 January 2023 17:22:47 UTC