Re: How to mark up a document other than a web page? from Herbert Van de Sompel on 2023-01-21 (public-bioschemas@w3.org from January 2023)

From: Herbert Van de Sompel <hvdsomp@gmail.com>
Date: Sat, 21 Jan 2023 18:27:11 +0100
To: Franck Michel <fmichel@i3s.unice.fr>
Cc: Carole Goble <carole.goble@manchester.ac.uk>, "LJ.Garcia" <lj.garcia.co@gmail.com>, Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>, Yvan Le Bras <yvan.le-bras@mnhn.fr>, Dan Bolser <dan.bolser@gmail.com>, public-bioschemas <public-bioschemas@w3.org>, Fabien Gandon <fabien.gandon@inria.fr>, Pierre-Antoine Champin <pierre-antoine@w3.org>
Message-ID: <CAOywMHeOFkSF1zLnSfy4EbfBCYTLJcoR7RQ3qZQ_Fnf+ukXUjg@mail.gmail.com>
hi Franck,

I forgot to mention that one can also use a "profile" attribute (with a URI
as value) on a link to convey additional information about the format of
the linked document. That's handy, e.g. when pointing to JSON(-LD) because
JSON is all over the place. The "profile" attribute lets you be more
expressive than the MIME type, i.e. which kind of JSON(-LD) format is being
used.

Greetings

Herbert

On Sat, Jan 21, 2023 at 6:22 PM Herbert Van de Sompel <hvdsomp@gmail.com>
wrote:

> hi Franck,
>
> Yes, what you describe totally aligns with the Signposting approach. In
> the spec, you will see that it also talks about using HTTP Link headers
> and/or Link Sets because these allow to convey information (e.g. metadata)
> for resources that are not HTML and hence can not have embedded metadata.
>
> The "describedby" relationship, as all other relationships that are used
> in Signposting, is listed in the IANA Link Relationship Registry at
> https://www.iana.org/assignments/link-relations/link-relations.xhtml .
> All these relationships have been defined in formal
> specifications/standards.
>
> Typed links, like the ones used by Signposting, are commonly used in
> RESTful interfaces. They're really mainstream in that way but don't seem to
> be well known in the scholarly communication landscape. With Signposting we
> want to try and change that by promoting this very low barrier
> interoperability approach.
>
> Greetings
>
> Herbert
>
> On Fri, Jan 20, 2023 at 5:08 PM Franck Michel <fmichel@i3s.unice.fr>
> wrote:
>
>> Dear all,
>>
>> Thx Leyla for pointing to Signposting, and thx Herbert for the links. I
>> did not know about this project but this is indeed very similar to what I
>> propose, yet with a different scope.
>>
>> Signposting suggests to use the HTTP Link header to link scholarly
>> resources on the web with metadata about them, in various formats, like
>> authoring information, bibliographic information in Bibtex, RIS etc.,
>> together with content negotiation.
>>
>> My proposition can totally be complementary with this. It seeks the
>> generalized use of Schema.org (+ extensions such as Bioschemas of course)
>> to mark up any resource. The HTTP Link header is used to point to the
>> markup data.
>> The "describedBy" relation is probably better suited than the "meta" that
>> I used in my example. But anyway, the idea is that you can benefit from the
>> large scope of Schema.org, Bioschemas and other extensions, to describe all
>> your resources that are not webpages.
>> Content negotiation can be used as show in the examples of Signposting,
>> so if I extend my earlier example, that would give something like this:
>>
>> curl -I -H "Accept: application/ld+json" https://domain.org/myImage.jpg
>>
>> Link:
>> <https://domain.org/myImage.jpg?q=markup&format=application/ld+json>
>> <https://domain.org/myImage.jpg?q=markup&format=application/ld+json>
>>       ; rel="describedby"
>>       ; type="application/ld+json"
>>
>> Note that the query part of the URL "q=markup&format=application/ld+json"
>> can be anything you like, it's just a trick to be used by the URL rewriting
>> module of the web server, to point to the markup data for that specific
>> resource.
>>
>> Regarding Jerven's remark about XMP, indeed like I said in my previous
>> email: "XMP (Extensible Metadata Platform) allows to embed metadata in
>> binary files. That's (...) limited to a few file types and this requires to
>> parse the content of the file itself."
>> By contrast, like explained by the authors of Signposting, using the HTTP
>> Link header allows to query headers only (with HTTP method HEAD) such that
>> you can get the metadata without even having to download the resource
>> itself which may be big.
>> Plus, this relies on native HTTP mechanisms only, so that you don't need
>> a specific library to parse the header of pdfs, another one for images and
>> so on.
>>
>> Franck.
>>
>> Le 20/01/2023 à 10:42, Herbert Van de Sompel a écrit :
>>
>> hi all,
>>
>> Thanks Carole for adding me to the conversation.
>>
>> Yes, indeed, Signposting in general, and the FAIR Signposting Profile
>> specifically, were introduced as a lightweight mechanism to address the
>> issue at hand:
>> * https://signposting.org/
>> * https://signposting.org/FAIR/
>>
>> Since Dataverse was mentioned in the email exchange, I can report that
>> support for the FAIR Signposting Profile was implemented for Dataverse and
>> should come with the next release, see
>> https://github.com/IQSS/dataverse/issues/5962
>>
>> I am happy to answer any questions.
>>
>> Greetings
>>
>> Herbert
>>
>> On Fri, Jan 20, 2023 at 10:20 AM Carole Goble <
>> carole.goble@manchester.ac.uk> wrote:
>>
>>> Looping in Herbert Van de Sompel, worldwide Signposting expert
>>>
>>>
>>>
>>> Carole
>>>
>>>
>>>
>>>
>>>
>>> Professor Carole Goble CBE FREng FBCS CITP
>>>
>>> Department of Computer Science
>>>
>>> The University of Manchester,
>>>
>>> Manchester, M13 9PL, UK
>>>
>>>
>>>
>>> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>>
>>>
>>>
>>> PLEASE Do not send me a calendar invite and expect me to see it. (i)
>>> Invites only work 50% of the time (ii) if they do work they do not appear
>>> as email so I don’t know they are there until it is too late.
>>>
>>> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
>>> do not use.
>>>
>>>
>>>
>>> *From:* LJ.Garcia <lj.garcia.co@gmail.com>
>>> *Sent:* 19 January 2023 18:41
>>> *To:* Franck Michel <fmichel@i3s.unice.fr>; Stian Soiland-Reyes <
>>> soiland-reyes@manchester.ac.uk>
>>> *Cc:* Yvan Le Bras <yvan.le-bras@mnhn.fr>; Carole Goble <
>>> carole.goble@manchester.ac.uk>; Dan Bolser <dan.bolser@gmail.com>;
>>> public-bioschemas <public-bioschemas@w3.org>; Fabien Gandon <
>>> fabien.gandon@inria.fr>
>>> *Subject:* Re: How to mark up a document other than a web page?
>>>
>>>
>>>
>>> Hi Franck,
>>>
>>>
>>>
>>> What you mention about using the HTTP header reminds me of Signposting (
>>> https://signposting.org/). Have you seen this approach? I am still have
>>> to catch up with this subject so adding more people to the loop with better
>>> knowledge on it.
>>>
>>>
>>>
>>> Kind regards,
>>>
>>>
>>>
>>> On Thu, Jan 5, 2023 at 5:48 PM Franck Michel <fmichel@i3s.unice.fr>
>>> wrote:
>>>
>>> Dear all,
>>>
>>> Thank you for your remarks and comments. Actually I feel like the
>>> discussion has already gone way beyond my initial question and proposition.
>>>
>>> My point was to figure out a simple way to provide metadata about any
>>> kind of resource on the web, not only web pages, in the form of Schema.org
>>> markup.
>>>
>>> RO-Crate is definitely a very interesting initiative but it primarily
>>> concerns communities used to dealing with large data repositories like
>>> Zenodo or Dataverse. Besides, it requires to encapsulate the produced
>>> objects within an package (archive) that contains all necessary additional
>>> metadata. This is great for enforcing FAIR ROs, but apart from such
>>> specific needs, an image on the web will remain available as a raw jpg or
>>> png file, same thing for a pdf, music, spreadsheet etc. We cannot expect
>>> each web master to encapsulate those objects in RO-Crate packages.
>>>
>>> A way to mark up an object is to create a web page that links to this
>>> object, and add markup on that page. But whenever the object is accessed
>>> directly by its URL, it has no more markup data. As a result, SEO practices
>>> have terrible recommendations like naming image files with a super long
>>> name containing the name of the thing being represented, its description,
>>> the image resolution etc. Ugly, right? XMP (Extensible Metadata Platform)
>>> allows to embed metadata in binary files. That's much better but this is
>>> limited to a few file types and this requires to parse the content of the
>>> file itself.
>>>
>>> So my point is: we can link objects on the web to their metadata with a
>>> mechanism that has been there since HTTP 1.0 (RFC1945
>>> <https://datatracker.ietf.org/doc/html/rfc1945#page-59>, 1996!), that
>>> is almost the beginning of the web: the HTTP Link header. Hence the example
>>> of a web server that returns a pdf document along with this header:
>>>     Link: <document_metadata.json>; rel="meta";
>>> type="application/ld+json"
>>>
>>> Upside: it does not break nor impose anything. HTTP clients that don't
>>> care or understand JSON-LD will just ignore it. Those that can consume
>>> JSON-LD will fetch the metadata and use the Schema.org annotations to do
>>> whatever they want. This way, search engines will know precisely what's in
>>> the object, making tools like Google Image able to index images much more
>>> effectively.
>>> Downside: there has to be a second HTTP get query to retrieve the
>>> JSON-LD metadata. No big deal.
>>>
>>> Does it make sense or is it just totally obvious?
>>>
>>> Franck.
>>>
>>> Le 05/01/2023 à 11:55, Yvan Le Bras a écrit :
>>>
>>> Hi Franck, Carole, hi everyone,
>>>
>>>
>>>
>>> Let me first wish you all a happy new year !
>>>
>>>
>>>
>>> Sorry if I misunderstood or if I am totally wrong, but it appears to me
>>> important to try expose my point of view ;)
>>>
>>>
>>>
>>> Looking at your question Franck, and at answer from Carole notably, it
>>> seems to me that 1/ schemas.org is made to mark-up web pages and e-mail
>>> messages 2/ using an intermediate ""metadata layer"" who can be RDFa or
>>> JSON-LD for example.
>>>
>>>
>>>
>>> Thus, to add schemas.org vocabulary to ""files"", it appears to me the
>>> best is to use a metadata standard who describes the data, and for example
>>> also URLs to download data files, and then can be exposed in RDFa or
>>> JSON-LD for example through web pages where there schemas.org
>>> vocabulary is used... So in structured data accessible on the internet.
>>>
>>>
>>>
>>> Thus, we can use RO-Crate or other standardized way to produce RO
>>> metadata using schemas.org on JSON-LD web pages (for example we do so
>>> in Ecology using "Ecological Metadata Language" standard and we can look at
>>> the structured data on the data catalog like here
>>> https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c
>>> where we are using notably "schemaVersion", "url", "dataPublished",
>>> "dateModified", "description", "keywords", "creator", "temporalCoverage",
>>> "SubjectOf", "fileFormat", "spatialCoverage", ""geo", "latitude",
>>> "longitude",  "variableMeasured" schema.org terms)
>>>
>>>
>>>
>>> => Here I give the EML oriented example because it allows us to have
>>> detailled metadata, notably with the "variableMeasured" who is something
>>> allowing our datasets to have a particularly higher FAIRness.
>>>
>>>
>>>
>>> Please, don't hesitate to comment !
>>>
>>>
>>>
>>> Wishing you a very good end of week,
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>>
>>> Yvan
>>>
>>>
>>> ------------------------------
>>>
>>> *De: *"Carole Goble" <carole.goble@manchester.ac.uk>
>>> <carole.goble@manchester.ac.uk>
>>> *À: *"Dan Bolser" <dan.bolser@gmail.com> <dan.bolser@gmail.com>,
>>> "Franck Michel" <fmichel@i3s.unice.fr> <fmichel@i3s.unice.fr>
>>> *Cc: *"public-bioschemas" <public-bioschemas@w3.org>
>>> <public-bioschemas@w3.org>, "Fabien Gandon" <fabien.gandon@inria.fr>
>>> <fabien.gandon@inria.fr>
>>> *Envoyé: *Jeudi 5 Janvier 2023 11:09:11
>>> *Objet: *RE: How to mark up a document other than a web page?
>>>
>>>
>>>
>>> https://zenodo.org/record/7147703#.Y7agoxXP2F4   is a longer talk that
>>> sets up the RO-Crate vision
>>>
>>>
>>>
>>> Carole
>>>
>>>
>>>
>>>
>>>
>>> Professor Carole Goble CBE FREng FBCS CITP
>>>
>>> Department of Computer Science
>>>
>>> The University of Manchester,
>>>
>>> Manchester, M13 9PL, UK
>>>
>>>
>>>
>>> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>>
>>>
>>>
>>> PLEASE Do not send me a calendar invite and expect me to see it. (i)
>>> Invites only work 50% of the time (ii) if they do work they do not appear
>>> as email so I don’t know they are there until it is too late.
>>>
>>> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
>>> do not use.
>>>
>>>
>>>
>>> *From:* Carole Goble <carole.goble@manchester.ac.uk>
>>> <carole.goble@manchester.ac.uk>
>>> *Sent:* 05 January 2023 09:54
>>> *To:* Dan Bolser <dan.bolser@gmail.com> <dan.bolser@gmail.com>; Franck
>>> Michel <fmichel@i3s.unice.fr> <fmichel@i3s.unice.fr>
>>> *Cc:* public-bioschemas@w3.org; Fabien Gandon <fabien.gandon@inria.fr>
>>> <fabien.gandon@inria.fr>; Carole Goble <carole.goble@manchester.ac.uk>
>>> <carole.goble@manchester.ac.uk>
>>> *Subject:* RE: How to mark up a document other than a web page?
>>>
>>>
>>>
>>> I have forwarded this thread to RO-Crate folks to pitch in
>>>
>>>
>>>
>>> RO-Crate https://www.researchobject.org/ro-crate/  packages files and
>>> annotates them with rich metadata (using Bagit). It uses JSON-LD and
>>> schema.org. It’s an example of using schema.org for multiple files not
>>> web pages.
>>>
>>>
>>>
>>> RO-Crate has gained a lot of traction in organisations needing to
>>> exchange digital objects with structured machine readable metadata, and is
>>> designed to be repository neutral – that is, enable inter-repo exchange.
>>> Zenodo and DataVerse have work ongoing to build compliance.
>>>
>>> https://zenodo.org/record/7376356#.Y7adghXP2F4 is a talk about the
>>> repository overlay aspect of RO-Crate
>>>
>>>
>>>
>>> Carole
>>>
>>>
>>>
>>>
>>>
>>> Professor Carole Goble CBE FREng FBCS CITP
>>>
>>> Department of Computer Science
>>>
>>> The University of Manchester,
>>>
>>> Manchester, M13 9PL, UK
>>>
>>>
>>>
>>> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>>
>>>
>>>
>>> PLEASE Do not send me a calendar invite and expect me to see it. (i)
>>> Invites only work 50% of the time (ii) if they do work they do not appear
>>> as email so I don’t know they are there until it is too late.
>>>
>>> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
>>> do not use.
>>>
>>>
>>>
>>> *From:* Dan Bolser <dan.bolser@gmail.com>
>>> *Sent:* 05 January 2023 09:39
>>> *To:* Franck Michel <fmichel@i3s.unice.fr>
>>> *Cc:* public-bioschemas@w3.org; Fabien Gandon <fabien.gandon@inria.fr>
>>> *Subject:* Re: How to mark up a document other than a web page?
>>>
>>>
>>>
>>> https://www.tomforth.co.uk/scienceandpdfs/
>>>
>>>
>>>
>>> Looks useful
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 4, 2023, 5:50 PM Franck Michel <fmichel@i3s.unice.fr> wrote:
>>>
>>> Dear community,
>>>
>>> First of all, let me wish you all a happy, richly marked up new year ;).
>>>
>>> Schema.org is meant to mark up ressources of any kind on the internet,
>>> not just web pages. While presenting Bioschemas, I once had this question:
>>> how do I mark up a pdf file? More generally, how to mark up any resource
>>> other than an html or xml-based content, like pdf, image, csv, Excel sheet,
>>> zip archive etc. ?
>>>
>>> I recently asked this during a BSC meeting but it seemed that nobody had
>>> really faced this use case yet. And I did a quick Google search but nothing
>>> came up. So I'd be interested in having your thoughts on this.
>>>
>>> A basic solution would be to insert markup in the web page that provides
>>> the download link. Not so satisfying since, when an application downloads
>>> the file using its direct URL, there is no more markup.
>>>
>>> I could think of a simple solution that uses the HTTP Link header to
>>> point to a file containing the markup data (similarly to what's been done
>>> in JSON-LD <https://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld>
>>> or CSCW <https://www.w3.org/TR/tabular-data-model/#link-header>). The
>>> exchange would look like this:
>>>
>>> GET /document.pdf HTTP/1.1
>>> Host: example.com
>>>
>>> ====================================
>>>
>>> HTTP/1.1 200 OK
>>> Content-Type: application/pdf
>>> Link: <document_metadata.json>; rel="meta"; type="application/ld+json"
>>> ...
>>>
>>> Where document_metadata.json is a JSON-LD description of the file and
>>> its topic (written with Schema.org and Bioschemas of course). I'm not sure
>>> whether rel="meta" is the best choice here, but that's just an example.
>>>
>>> Note that some metadata may already be embedded in pdf and image files
>>> by means of XMP
>>> <https://en.wikipedia.org/wiki/Extensible_Metadata_Platform>, where
>>> Schema.org types and properties could be used. But this does not work with
>>> any type of file, plus applications may want to use only HTTP-based
>>> mechanisms to get the markup data, rather than have to read the content of
>>> binary files.
>>>
>>> Have you seen this kind of use case and usage somewhere? Any other
>>> solution you could think of? Do search engines expect this kind of linking
>>> to external markup files?
>>>
>>> Thx in advance. Regards,
>>>    Franck.
>>>
>>> --
>>>
>>> Franck MICHEL, CNRS research engineer
>>>
>>> Université Côte d’Azur, CNRS, Inria
>>>
>>> I3S laboratory (UMR 7271)
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> Yvan Le Bras, PhD
>>>                                                             @Yvan2935
>>>
>>>                                     <°))))><
>>>
>>>      Responsable scientifique et technique "Pole National de Données de
>>> Biodiversité"   https://www.pndb.fr/
>>>
>>>
>>>  Bureau 34, Station marine de Concarneau BP 225, 29182 Concarneau CEDEX ---
>>> MNHN Unité de service PatriNat Paris
>>>
>>>                                               tél.:  +33 (0) 2 98 50 99 35
>>> / +33 (0) 6.10.43.96.51
>>>
>>>
>>> yvan.le-bras@mnhn.fr
>>>
>>>
>>>
>>>
>>
>> --
>> ==================
>> Herbert Van de Sompel
>> https://hvdsomp.info
>> https://orcid.org/0000-0002-0715-6126
>>
>>
>>
>
> --
> ==================
> Herbert Van de Sompel
> https://hvdsomp.info
> https://orcid.org/0000-0002-0715-6126
>


-- 
==================
Herbert Van de Sompel
https://hvdsomp.info
https://orcid.org/0000-0002-0715-6126
Received on Saturday, 21 January 2023 17:27:36 UTC