How to mark up a document other than a web page? from Franck Michel on 2023-01-04 (public-bioschemas@w3.org from January 2023)

From: Franck Michel <fmichel@i3s.unice.fr>
Date: Wed, 4 Jan 2023 18:49:35 +0100
To: "public-bioschemas@w3.org" <public-bioschemas@w3.org>, Fabien Gandon <fabien.gandon@inria.fr>
Message-ID: <2eb8003c-d77c-35f4-a3d3-50ee62ba7949@i3s.unice.fr>

Dear community,

First of all, let me wish you all a happy, richly marked up new year ;).

Schema.org is meant to mark up ressources of any kind on the internet, 
not just web pages. While presenting Bioschemas, I once had this 
question: how do I mark up a pdf file? More generally, how to mark up 
any resource other than an html or xml-based content, like pdf, image, 
csv, Excel sheet, zip archive etc. ?

I recently asked this during a BSC meeting but it seemed that nobody had 
really faced this use case yet. And I did a quick Google search but 
nothing came up. So I'd be interested in having your thoughts on this.

A basic solution would be to insert markup in the web page that provides 
the download link. Not so satisfying since, when an application 
downloads the file using its direct URL, there is no more markup.

I could think of a simple solution that uses the HTTP Link header to 
point to a file containing the markup data (similarly to what's been 
done in JSON-LD 
<https://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld> or CSCW 
<https://www.w3.org/TR/tabular-data-model/#link-header>). The exchange 
would look like this:

GET /document.pdf HTTP/1.1
Host: example.com

====================================

HTTP/1.1 200 OK
Content-Type: application/pdf
Link: <document_metadata.json>; rel="meta"; type="application/ld+json"
...

Where document_metadata.json is a JSON-LD description of the file and 
its topic (written with Schema.org and Bioschemas of course). I'm not 
sure whether rel="meta" is the best choice here, but that's just an example.

Note that some metadata may already be embedded in pdf and image files 
by means of XMP 
<https://en.wikipedia.org/wiki/Extensible_Metadata_Platform>, where 
Schema.org types and properties could be used. But this does not work 
with any type of file, plus applications may want to use only HTTP-based 
mechanisms to get the markup data, rather than have to read the content 
of binary files.

Have you seen this kind of use case and usage somewhere? Any other 
solution you could think of? Do search engines expect this kind of 
linking to external markup files?

Thx in advance. Regards,
    Franck.

-- 
Franck MICHEL, CNRS research engineer
Université Côte d’Azur, CNRS, Inria
I3S laboratory (UMR 7271)

Received on Wednesday, 4 January 2023 17:49:50 UTC