Re: How to mark up a document other than a web page?

 Hi Franck,

What you mention about using the HTTP header reminds me of Signposting (
https://signposting.org/). Have you seen this approach? I am still have to
catch up with this subject so adding more people to the loop with better
knowledge on it.

Kind regards,

On Thu, Jan 5, 2023 at 5:48 PM Franck Michel <fmichel@i3s.unice.fr> wrote:

> Dear all,
>
> Thank you for your remarks and comments. Actually I feel like the
> discussion has already gone way beyond my initial question and proposition.
>
> My point was to figure out a simple way to provide metadata about any kind
> of resource on the web, not only web pages, in the form of Schema.org
> markup.
>
> RO-Crate is definitely a very interesting initiative but it primarily
> concerns communities used to dealing with large data repositories like
> Zenodo or Dataverse. Besides, it requires to encapsulate the produced
> objects within an package (archive) that contains all necessary additional
> metadata. This is great for enforcing FAIR ROs, but apart from such
> specific needs, an image on the web will remain available as a raw jpg or
> png file, same thing for a pdf, music, spreadsheet etc. We cannot expect
> each web master to encapsulate those objects in RO-Crate packages.
>
> A way to mark up an object is to create a web page that links to this
> object, and add markup on that page. But whenever the object is accessed
> directly by its URL, it has no more markup data. As a result, SEO practices
> have terrible recommendations like naming image files with a super long
> name containing the name of the thing being represented, its description,
> the image resolution etc. Ugly, right? XMP (Extensible Metadata Platform)
> allows to embed metadata in binary files. That's much better but this is
> limited to a few file types and this requires to parse the content of the
> file itself.
>
> So my point is: we can link objects on the web to their metadata with a
> mechanism that has been there since HTTP 1.0 (RFC1945
> <https://datatracker.ietf.org/doc/html/rfc1945#page-59>, 1996!), that is
> almost the beginning of the web: the HTTP Link header. Hence the example of
> a web server that returns a pdf document along with this header:
>     Link: <document_metadata.json>; rel="meta"; type="application/ld+json"
>
> Upside: it does not break nor impose anything. HTTP clients that don't
> care or understand JSON-LD will just ignore it. Those that can consume
> JSON-LD will fetch the metadata and use the Schema.org annotations to do
> whatever they want. This way, search engines will know precisely what's in
> the object, making tools like Google Image able to index images much more
> effectively.
> Downside: there has to be a second HTTP get query to retrieve the JSON-LD
> metadata. No big deal.
>
> Does it make sense or is it just totally obvious?
>
> Franck.
>
> Le 05/01/2023 à 11:55, Yvan Le Bras a écrit :
>
> Hi Franck, Carole, hi everyone,
>
> Let me first wish you all a happy new year !
>
> Sorry if I misunderstood or if I am totally wrong, but it appears to me
> important to try expose my point of view ;)
>
> Looking at your question Franck, and at answer from Carole notably, it
> seems to me that 1/ schemas.org is made to mark-up web pages and e-mail
> messages 2/ using an intermediate ""metadata layer"" who can be RDFa or
> JSON-LD for example.
>
> Thus, to add schemas.org vocabulary to ""files"", it appears to me the
> best is to use a metadata standard who describes the data, and for example
> also URLs to download data files, and then can be exposed in RDFa or
> JSON-LD for example through web pages where there schemas.org vocabulary
> is used... So in structured data accessible on the internet.
>
> Thus, we can use RO-Crate or other standardized way to produce RO metadata
> using schemas.org on JSON-LD web pages (for example we do so in Ecology
> using "Ecological Metadata Language" standard and we can look at the
> structured data on the data catalog like here
> https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c
> where we are using notably "schemaVersion", "url", "dataPublished",
> "dateModified", "description", "keywords", "creator", "temporalCoverage",
> "SubjectOf", "fileFormat", "spatialCoverage", ""geo", "latitude",
> "longitude",  "variableMeasured" schema.org terms)
>
> => Here I give the EML oriented example because it allows us to have
> detailled metadata, notably with the "variableMeasured" who is something
> allowing our datasets to have a particularly higher FAIRness.
>
> Please, don't hesitate to comment !
>
> Wishing you a very good end of week,
>
> Best,
>
> Yvan
>
> ------------------------------
> *De: *"Carole Goble" <carole.goble@manchester.ac.uk>
> <carole.goble@manchester.ac.uk>
> *À: *"Dan Bolser" <dan.bolser@gmail.com> <dan.bolser@gmail.com>, "Franck
> Michel" <fmichel@i3s.unice.fr> <fmichel@i3s.unice.fr>
> *Cc: *"public-bioschemas" <public-bioschemas@w3.org>
> <public-bioschemas@w3.org>, "Fabien Gandon" <fabien.gandon@inria.fr>
> <fabien.gandon@inria.fr>
> *Envoyé: *Jeudi 5 Janvier 2023 11:09:11
> *Objet: *RE: How to mark up a document other than a web page?
>
> https://zenodo.org/record/7147703#.Y7agoxXP2F4   is a longer talk that
> sets up the RO-Crate vision
>
>
>
> Carole
>
>
>
>
>
> Professor Carole Goble CBE FREng FBCS CITP
>
> Department of Computer Science
>
> The University of Manchester,
>
> Manchester, M13 9PL, UK
>
>
>
> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>
>
>
> PLEASE Do not send me a calendar invite and expect me to see it. (i)
> Invites only work 50% of the time (ii) if they do work they do not appear
> as email so I don’t know they are there until it is too late.
>
> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
> do not use.
>
>
>
> *From:* Carole Goble <carole.goble@manchester.ac.uk>
> <carole.goble@manchester.ac.uk>
> *Sent:* 05 January 2023 09:54
> *To:* Dan Bolser <dan.bolser@gmail.com> <dan.bolser@gmail.com>; Franck
> Michel <fmichel@i3s.unice.fr> <fmichel@i3s.unice.fr>
> *Cc:* public-bioschemas@w3.org; Fabien Gandon <fabien.gandon@inria.fr>
> <fabien.gandon@inria.fr>; Carole Goble <carole.goble@manchester.ac.uk>
> <carole.goble@manchester.ac.uk>
> *Subject:* RE: How to mark up a document other than a web page?
>
>
>
> I have forwarded this thread to RO-Crate folks to pitch in
>
>
>
> RO-Crate https://www.researchobject.org/ro-crate/  packages files and
> annotates them with rich metadata (using Bagit). It uses JSON-LD and
> schema.org. It’s an example of using schema.org for multiple files not
> web pages.
>
>
>
> RO-Crate has gained a lot of traction in organisations needing to exchange
> digital objects with structured machine readable metadata, and is designed
> to be repository neutral – that is, enable inter-repo exchange. Zenodo and
> DataVerse have work ongoing to build compliance.
>
> https://zenodo.org/record/7376356#.Y7adghXP2F4 is a talk about the
> repository overlay aspect of RO-Crate
>
>
>
> Carole
>
>
>
>
>
> Professor Carole Goble CBE FREng FBCS CITP
>
> Department of Computer Science
>
> The University of Manchester,
>
> Manchester, M13 9PL, UK
>
>
>
> Head of Node ELIXIR-UK <https://elixiruknode.org/>
>
>
>
> PLEASE Do not send me a calendar invite and expect me to see it. (i)
> Invites only work 50% of the time (ii) if they do work they do not appear
> as email so I don’t know they are there until it is too late.
>
> Want me at a meeting? Email me. Don’t just silently sneak into a diary I
> do not use.
>
>
>
> *From:* Dan Bolser <dan.bolser@gmail.com>
> *Sent:* 05 January 2023 09:39
> *To:* Franck Michel <fmichel@i3s.unice.fr>
> *Cc:* public-bioschemas@w3.org; Fabien Gandon <fabien.gandon@inria.fr>
> *Subject:* Re: How to mark up a document other than a web page?
>
>
>
> https://www.tomforth.co.uk/scienceandpdfs/
>
>
>
> Looks useful
>
>
>
>
>
> On Wed, Jan 4, 2023, 5:50 PM Franck Michel <fmichel@i3s.unice.fr> wrote:
>
> Dear community,
>
> First of all, let me wish you all a happy, richly marked up new year ;).
>
> Schema.org is meant to mark up ressources of any kind on the internet, not
> just web pages. While presenting Bioschemas, I once had this question: how
> do I mark up a pdf file? More generally, how to mark up any resource other
> than an html or xml-based content, like pdf, image, csv, Excel sheet, zip
> archive etc. ?
>
> I recently asked this during a BSC meeting but it seemed that nobody had
> really faced this use case yet. And I did a quick Google search but nothing
> came up. So I'd be interested in having your thoughts on this.
>
> A basic solution would be to insert markup in the web page that provides
> the download link. Not so satisfying since, when an application downloads
> the file using its direct URL, there is no more markup.
>
> I could think of a simple solution that uses the HTTP Link header to point
> to a file containing the markup data (similarly to what's been done in
> JSON-LD <https://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld> or
> CSCW <https://www.w3.org/TR/tabular-data-model/#link-header>). The
> exchange would look like this:
>
> GET /document.pdf HTTP/1.1
> Host: example.com
>
> ====================================
>
> HTTP/1.1 200 OK
> Content-Type: application/pdf
> Link: <document_metadata.json>; rel="meta"; type="application/ld+json"
> ...
>
> Where document_metadata.json is a JSON-LD description of the file and its
> topic (written with Schema.org and Bioschemas of course). I'm not sure
> whether rel="meta" is the best choice here, but that's just an example.
>
> Note that some metadata may already be embedded in pdf and image files by
> means of XMP <https://en.wikipedia.org/wiki/Extensible_Metadata_Platform>,
> where Schema.org types and properties could be used. But this does not work
> with any type of file, plus applications may want to use only HTTP-based
> mechanisms to get the markup data, rather than have to read the content of
> binary files.
>
> Have you seen this kind of use case and usage somewhere? Any other
> solution you could think of? Do search engines expect this kind of linking
> to external markup files?
>
> Thx in advance. Regards,
>    Franck.
>
> --
>
> Franck MICHEL, CNRS research engineer
>
> Université Côte d’Azur, CNRS, Inria
>
> I3S laboratory (UMR 7271)
>
>
>
> --
> --
>
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Yvan Le Bras, PhD
>                                                           @Yvan2935
>
>                                   <°))))><
>
>    Responsable scientifique et technique "Pole National de Données de
> Biodiversité"   https://www.pndb.fr/
>
>                                                                     Bureau
> 34, Station marine de Concarneau BP 225, 29182 Concarneau CEDEX --- MNHN
> Unité de service PatriNat Paris
>
>                                             tél.:  +33 (0) 2 98 50 99 35 /
> +33 (0) 6.10.43.96.51
>
>
> yvan.le-bras@mnhn.fr
>
>
>

Received on Thursday, 19 January 2023 18:41:40 UTC