Re: How to mark up a document other than a web page? from jerven Bolleman on 2023-01-19 (public-bioschemas@w3.org from January 2023)

From: jerven Bolleman <jerven.bolleman@sib.swiss>
Date: Thu, 19 Jan 2023 20:14:07 +0100
To: "LJ.Garcia" <lj.garcia.co@gmail.com>, Franck Michel <fmichel@i3s.unice.fr>, Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>
Cc: Yvan Le Bras <yvan.le-bras@mnhn.fr>, Carole Goble <carole.goble@manchester.ac.uk>, Dan Bolser <dan.bolser@gmail.com>, public-bioschemas <public-bioschemas@w3.org>, Fabien Gandon <fabien.gandon@inria.fr>
Message-ID: <5d1d38b9-3b30-5c71-c14b-9182804e1a0f@sib.swiss>
Hi Franck,

If the document is a PDF document it can have RDF embedded in it (with 
some limits I believe), this is called XMP.

For the open document family of file formats I know the metadata is also 
based on RDF, so you can just add schema.org markup in RDF/XML. For XLSX 
as produced by Excel you might be able to do something in its packaging 
but I don't think that is widely supported.

For SVG one can use RDFa (which we do for www.swissbiopics.org).

For JPEG/PNG the markup will need to be outside of the document.

Regards,
Jerven




On 19/01/2023 19:41, LJ.Garcia wrote:
> Hi Franck,
> 
> What you mention about using the HTTP header reminds me of Signposting 
> (https://signposting.org/ <https://signposting.org/>). Have you seen 
> this approach? I am still have to catch up with this subject so adding 
> more people to the loop with better knowledge on it.
> 
> Kind regards,
> 
> On Thu, Jan 5, 2023 at 5:48 PM Franck Michel <fmichel@i3s.unice.fr 
> <mailto:fmichel@i3s.unice.fr>> wrote:
> 
>     Dear all,
> 
>     Thank you for your remarks and comments. Actually I feel like the
>     discussion has already gone way beyond my initial question and
>     proposition.
> 
>     My point was to figure out a simple way to provide metadata about
>     any kind of resource on the web, not only web pages, in the form of
>     Schema.org markup.
> 
>     RO-Crate is definitely a very interesting initiative but it
>     primarily concerns communities used to dealing with large data
>     repositories like Zenodo or Dataverse. Besides, it requires to
>     encapsulate the produced objects within an package (archive) that
>     contains all necessary additional metadata. This is great for
>     enforcing FAIR ROs, but apart from such specific needs, an image on
>     the web will remain available as a raw jpg or png file, same thing
>     for a pdf, music, spreadsheet etc. We cannot expect each web master
>     to encapsulate those objects in RO-Crate packages.
> 
>     A way to mark up an object is to create a web page that links to
>     this object, and add markup on that page. But whenever the object is
>     accessed directly by its URL, it has no more markup data. As a
>     result, SEO practices have terrible recommendations like naming
>     image files with a super long name containing the name of the thing
>     being represented, its description, the image resolution etc. Ugly,
>     right? XMP (Extensible Metadata Platform) allows to embed metadata
>     in binary files. That's much better but this is limited to a few
>     file types and this requires to parse the content of the file itself.
> 
>     So my point is: we can link objects on the web to their metadata
>     with a mechanism that has been there since HTTP 1.0 (RFC1945
>     <https://datatracker.ietf.org/doc/html/rfc1945#page-59>, 1996!),
>     that is almost the beginning of the web: the HTTP Link header. Hence
>     the example of a web server that returns a pdf document along with
>     this header:
>          Link: <document_metadata.json>; rel="meta";
>     type="application/ld+json"
> 
>     Upside: it does not break nor impose anything. HTTP clients that
>     don't care or understand JSON-LD will just ignore it. Those that can
>     consume JSON-LD will fetch the metadata and use the Schema.org
>     annotations to do whatever they want. This way, search engines will
>     know precisely what's in the object, making tools like Google Image
>     able to index images much more effectively.
>     Downside: there has to be a second HTTP get query to retrieve the
>     JSON-LD metadata. No big deal.
> 
>     Does it make sense or is it just totally obvious?
> 
>     Franck.
> 
>     Le 05/01/2023 à 11:55, Yvan Le Bras a écrit :
>>     Hi Franck, Carole, hi everyone,
>>
>>     Let me first wish you all a happy new year !
>>
>>     Sorry if I misunderstood or if I am totally wrong, but it appears
>>     to me important to try expose my point of view ;)
>>
>>     Looking at your question Franck, and at answer from Carole
>>     notably, it seems to me that 1/ schemas.org <http://schemas.org>
>>     is made to mark-up web pages and e-mail messages 2/ using an
>>     intermediate ""metadata layer"" who can be RDFa or JSON-LD for
>>     example.
>>
>>     Thus, to add schemas.org <http://schemas.org> vocabulary to
>>     ""files"", it appears to me the best is to use a metadata standard
>>     who describes the data, and for example also URLs to download data
>>     files, and then can be exposed in RDFa or JSON-LD for example
>>     through web pages where there schemas.org <http://schemas.org>
>>     vocabulary is used... So in structured data accessible on the
>>     internet.
>>
>>     Thus, we can use RO-Crate or other standardized way to produce RO
>>     metadata using schemas.org <http://schemas.org> on JSON-LD web
>>     pages (for example we do so in Ecology using "Ecological Metadata
>>     Language" standard and we can look at the structured data on the
>>     data catalog like here
>>     https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c <https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c> where we are using notably "schemaVersion", "url", "dataPublished", "dateModified", "description", "keywords", "creator", "temporalCoverage", "SubjectOf", "fileFormat", "spatialCoverage", ""geo", "latitude", "longitude", "variableMeasured" schema.org <http://schema.org> terms)
>>
>>     => Here I give the EML oriented example because it allows us to
>>     have detailled metadata, notably with the "variableMeasured" who
>>     is something allowing our datasets to have a particularly higher
>>     FAIRness.
>>
>>     Please, don't hesitate to comment !
>>
>>     Wishing you a very good end of week,
>>
>>     Best,
>>
>>     Yvan
>>
>>     ------------------------------------------------------------------------
>>     *De: *"Carole Goble" <carole.goble@manchester.ac.uk>
>>     <mailto:carole.goble@manchester.ac.uk>
>>     *À: *"Dan Bolser" <dan.bolser@gmail.com>
>>     <mailto:dan.bolser@gmail.com>, "Franck Michel"
>>     <fmichel@i3s.unice.fr> <mailto:fmichel@i3s.unice.fr>
>>     *Cc: *"public-bioschemas" <public-bioschemas@w3.org>
>>     <mailto:public-bioschemas@w3.org>, "Fabien Gandon"
>>     <fabien.gandon@inria.fr> <mailto:fabien.gandon@inria.fr>
>>     *Envoyé: *Jeudi 5 Janvier 2023 11:09:11
>>     *Objet: *RE: How to mark up a document other than a web page?
>>
>>     https://zenodo.org/record/7147703#.Y7agoxXP2F4
>>     <https://zenodo.org/record/7147703#.Y7agoxXP2F4> is a longer talk
>>     that sets up the RO-Crate vision
>>
>>     Carole
>>
>>     Professor Carole Goble CBE FREng FBCS CITP
>>
>>     Department of Computer Science
>>
>>     The University of Manchester,
>>
>>     Manchester, M13 9PL, UK
>>
>>     Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>
>>     PLEASE Do not send me a calendar invite and expect me to see it.
>>     (i) Invites only work 50% of the time (ii) if they do work they do
>>     not appear as email so I don’t know they are there until it is too
>>     late.
>>
>>     Want me at a meeting? Email me. Don’t just silently sneak into a
>>     diary I do not use.
>>
>>     *From:*Carole Goble <carole.goble@manchester.ac.uk>
>>     <mailto:carole.goble@manchester.ac.uk>
>>     *Sent:* 05 January 2023 09:54
>>     *To:* Dan Bolser <dan.bolser@gmail.com>
>>     <mailto:dan.bolser@gmail.com>; Franck Michel
>>     <fmichel@i3s.unice.fr> <mailto:fmichel@i3s.unice.fr>
>>     *Cc:* public-bioschemas@w3.org <mailto:public-bioschemas@w3.org>;
>>     Fabien Gandon <fabien.gandon@inria.fr>
>>     <mailto:fabien.gandon@inria.fr>; Carole Goble
>>     <carole.goble@manchester.ac.uk> <mailto:carole.goble@manchester.ac.uk>
>>     *Subject:* RE: How to mark up a document other than a web page?
>>
>>     I have forwarded this thread to RO-Crate folks to pitch in
>>
>>     RO-Crate https://www.researchobject.org/ro-crate/
>>     <https://www.researchobject.org/ro-crate/>  packages files and
>>     annotates them with rich metadata (using Bagit). It uses JSON-LD
>>     and schema.org <http://schema.org>. It’s an example of using
>>     schema.org <http://schema.org> for multiple files not web pages.
>>
>>     RO-Crate has gained a lot of traction in organisations needing to
>>     exchange digital objects with structured machine readable
>>     metadata, and is designed to be repository neutral – that is,
>>     enable inter-repo exchange. Zenodo and DataVerse have work ongoing
>>     to build compliance.
>>
>>     https://zenodo.org/record/7376356#.Y7adghXP2F4
>>     <https://zenodo.org/record/7376356#.Y7adghXP2F4> is a talk about
>>     the repository overlay aspect of RO-Crate
>>
>>     Carole
>>
>>     Professor Carole Goble CBE FREng FBCS CITP
>>
>>     Department of Computer Science
>>
>>     The University of Manchester,
>>
>>     Manchester, M13 9PL, UK
>>
>>     Head of Node ELIXIR-UK <https://elixiruknode.org/>
>>
>>     PLEASE Do not send me a calendar invite and expect me to see it.
>>     (i) Invites only work 50% of the time (ii) if they do work they do
>>     not appear as email so I don’t know they are there until it is too
>>     late.
>>
>>     Want me at a meeting? Email me. Don’t just silently sneak into a
>>     diary I do not use.
>>
>>     *From:*Dan Bolser <dan.bolser@gmail.com
>>     <mailto:dan.bolser@gmail.com>>
>>     *Sent:* 05 January 2023 09:39
>>     *To:* Franck Michel <fmichel@i3s.unice.fr
>>     <mailto:fmichel@i3s.unice.fr>>
>>     *Cc:* public-bioschemas@w3.org <mailto:public-bioschemas@w3.org>;
>>     Fabien Gandon <fabien.gandon@inria.fr <mailto:fabien.gandon@inria.fr>>
>>     *Subject:* Re: How to mark up a document other than a web page?
>>
>>     https://www.tomforth.co.uk/scienceandpdfs/
>>     <https://www.tomforth.co.uk/scienceandpdfs/>
>>
>>     Looks useful
>>
>>     On Wed, Jan 4, 2023, 5:50 PM Franck Michel <fmichel@i3s.unice.fr
>>     <mailto:fmichel@i3s.unice.fr>> wrote:
>>
>>         Dear community,
>>
>>         First of all, let me wish you all a happy, richly marked up
>>         new year ;).
>>
>>         Schema.org is meant to mark up ressources of any kind on the
>>         internet, not just web pages. While presenting Bioschemas, I
>>         once had this question: how do I mark up a pdf file? More
>>         generally, how to mark up any resource other than an html or
>>         xml-based content, like pdf, image, csv, Excel sheet, zip
>>         archive etc. ?
>>
>>         I recently asked this during a BSC meeting but it seemed that
>>         nobody had really faced this use case yet. And I did a quick
>>         Google search but nothing came up. So I'd be interested in
>>         having your thoughts on this.
>>
>>         A basic solution would be to insert markup in the web page
>>         that provides the download link. Not so satisfying since, when
>>         an application downloads the file using its direct URL, there
>>         is no more markup.
>>
>>         I could think of a simple solution that uses the HTTP Link
>>         header to point to a file containing the markup data
>>         (similarly to what's been done in JSON-LD
>>         <https://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld>
>>         or CSCW
>>         <https://www.w3.org/TR/tabular-data-model/#link-header>). The
>>         exchange would look like this:
>>
>>         GET /document.pdf HTTP/1.1
>>         Host: example.com <http://example.com>
>>
>>         ====================================
>>
>>         HTTP/1.1 200 OK
>>         Content-Type: application/pdf
>>         Link: <document_metadata.json>; rel="meta";
>>         type="application/ld+json"
>>         ...
>>
>>         Where document_metadata.json is a JSON-LD description of the
>>         file and its topic (written with Schema.org and Bioschemas of
>>         course). I'm not sure whether rel="meta" is the best choice
>>         here, but that's just an example.
>>
>>         Note that some metadata may already be embedded in pdf and
>>         image files by means of XMP
>>         <https://en.wikipedia.org/wiki/Extensible_Metadata_Platform>,
>>         where Schema.org types and properties could be used. But this
>>         does not work with any type of file, plus applications may
>>         want to use only HTTP-based mechanisms to get the markup data,
>>         rather than have to read the content of binary files.
>>
>>         Have you seen this kind of use case and usage somewhere? Any
>>         other solution you could think of? Do search engines expect
>>         this kind of linking to external markup files?
>>
>>         Thx in advance. Regards,
>>            Franck.
>>
>>         -- 
>>
>>         Franck MICHEL, CNRS research engineer
>>
>>         Université Côte d’Azur, CNRS, Inria
>>
>>         I3S laboratory (UMR 7271)
>>
>>
>>
>>     -- 
>>     -- 
>>     --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>     Yvan Le Bras, PhD         @Yvan2935             <°))))><
>>                    Responsable scientifique et technique "Pole
>>     National de Données de Biodiversité" https://www.pndb.fr/
>>     <https://www.pndb.fr/>
>>          Bureau 34, Station marine de Concarneau BP 225, 29182
>>     Concarneau CEDEX --- MNHN Unité de service PatriNat Paris
>>                                                             tél.:  +33
>>     (0) 2 98 50 99 35 / +33 (0) 6.10.43.96.51
>>     yvan.le-bras@mnhn.fr <mailto:yvan.le-bras@mnhn.fr>
> 

-- 

 *Jerven Tjalling Bolleman*
Principal Software Developer
*SIB | Swiss Institute of Bioinformatics*
1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland
t +41 22 379 58 85
Jerven.Bolleman@sib.swiss - www.sib.swiss
Received on Thursday, 19 January 2023 19:14:24 UTC