- From: jerven Bolleman <jerven.bolleman@sib.swiss>
- Date: Thu, 19 Jan 2023 20:14:07 +0100
- To: "LJ.Garcia" <lj.garcia.co@gmail.com>, Franck Michel <fmichel@i3s.unice.fr>, Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>
- Cc: Yvan Le Bras <yvan.le-bras@mnhn.fr>, Carole Goble <carole.goble@manchester.ac.uk>, Dan Bolser <dan.bolser@gmail.com>, public-bioschemas <public-bioschemas@w3.org>, Fabien Gandon <fabien.gandon@inria.fr>
Hi Franck, If the document is a PDF document it can have RDF embedded in it (with some limits I believe), this is called XMP. For the open document family of file formats I know the metadata is also based on RDF, so you can just add schema.org markup in RDF/XML. For XLSX as produced by Excel you might be able to do something in its packaging but I don't think that is widely supported. For SVG one can use RDFa (which we do for www.swissbiopics.org). For JPEG/PNG the markup will need to be outside of the document. Regards, Jerven On 19/01/2023 19:41, LJ.Garcia wrote: > Hi Franck, > > What you mention about using the HTTP header reminds me of Signposting > (https://signposting.org/ <https://signposting.org/>). Have you seen > this approach? I am still have to catch up with this subject so adding > more people to the loop with better knowledge on it. > > Kind regards, > > On Thu, Jan 5, 2023 at 5:48 PM Franck Michel <fmichel@i3s.unice.fr > <mailto:fmichel@i3s.unice.fr>> wrote: > > Dear all, > > Thank you for your remarks and comments. Actually I feel like the > discussion has already gone way beyond my initial question and > proposition. > > My point was to figure out a simple way to provide metadata about > any kind of resource on the web, not only web pages, in the form of > Schema.org markup. > > RO-Crate is definitely a very interesting initiative but it > primarily concerns communities used to dealing with large data > repositories like Zenodo or Dataverse. Besides, it requires to > encapsulate the produced objects within an package (archive) that > contains all necessary additional metadata. This is great for > enforcing FAIR ROs, but apart from such specific needs, an image on > the web will remain available as a raw jpg or png file, same thing > for a pdf, music, spreadsheet etc. We cannot expect each web master > to encapsulate those objects in RO-Crate packages. > > A way to mark up an object is to create a web page that links to > this object, and add markup on that page. But whenever the object is > accessed directly by its URL, it has no more markup data. As a > result, SEO practices have terrible recommendations like naming > image files with a super long name containing the name of the thing > being represented, its description, the image resolution etc. Ugly, > right? XMP (Extensible Metadata Platform) allows to embed metadata > in binary files. That's much better but this is limited to a few > file types and this requires to parse the content of the file itself. > > So my point is: we can link objects on the web to their metadata > with a mechanism that has been there since HTTP 1.0 (RFC1945 > <https://datatracker.ietf.org/doc/html/rfc1945#page-59>, 1996!), > that is almost the beginning of the web: the HTTP Link header. Hence > the example of a web server that returns a pdf document along with > this header: > Link: <document_metadata.json>; rel="meta"; > type="application/ld+json" > > Upside: it does not break nor impose anything. HTTP clients that > don't care or understand JSON-LD will just ignore it. Those that can > consume JSON-LD will fetch the metadata and use the Schema.org > annotations to do whatever they want. This way, search engines will > know precisely what's in the object, making tools like Google Image > able to index images much more effectively. > Downside: there has to be a second HTTP get query to retrieve the > JSON-LD metadata. No big deal. > > Does it make sense or is it just totally obvious? > > Franck. > > Le 05/01/2023 à 11:55, Yvan Le Bras a écrit : >> Hi Franck, Carole, hi everyone, >> >> Let me first wish you all a happy new year ! >> >> Sorry if I misunderstood or if I am totally wrong, but it appears >> to me important to try expose my point of view ;) >> >> Looking at your question Franck, and at answer from Carole >> notably, it seems to me that 1/ schemas.org <http://schemas.org> >> is made to mark-up web pages and e-mail messages 2/ using an >> intermediate ""metadata layer"" who can be RDFa or JSON-LD for >> example. >> >> Thus, to add schemas.org <http://schemas.org> vocabulary to >> ""files"", it appears to me the best is to use a metadata standard >> who describes the data, and for example also URLs to download data >> files, and then can be exposed in RDFa or JSON-LD for example >> through web pages where there schemas.org <http://schemas.org> >> vocabulary is used... So in structured data accessible on the >> internet. >> >> Thus, we can use RO-Crate or other standardized way to produce RO >> metadata using schemas.org <http://schemas.org> on JSON-LD web >> pages (for example we do so in Ecology using "Ecological Metadata >> Language" standard and we can look at the structured data on the >> data catalog like here >> https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c <https://data.pndb.fr/view/urn:uuid:99abf52c-b271-4b66-ae50-c504e492bc4c> where we are using notably "schemaVersion", "url", "dataPublished", "dateModified", "description", "keywords", "creator", "temporalCoverage", "SubjectOf", "fileFormat", "spatialCoverage", ""geo", "latitude", "longitude", "variableMeasured" schema.org <http://schema.org> terms) >> >> => Here I give the EML oriented example because it allows us to >> have detailled metadata, notably with the "variableMeasured" who >> is something allowing our datasets to have a particularly higher >> FAIRness. >> >> Please, don't hesitate to comment ! >> >> Wishing you a very good end of week, >> >> Best, >> >> Yvan >> >> ------------------------------------------------------------------------ >> *De: *"Carole Goble" <carole.goble@manchester.ac.uk> >> <mailto:carole.goble@manchester.ac.uk> >> *À: *"Dan Bolser" <dan.bolser@gmail.com> >> <mailto:dan.bolser@gmail.com>, "Franck Michel" >> <fmichel@i3s.unice.fr> <mailto:fmichel@i3s.unice.fr> >> *Cc: *"public-bioschemas" <public-bioschemas@w3.org> >> <mailto:public-bioschemas@w3.org>, "Fabien Gandon" >> <fabien.gandon@inria.fr> <mailto:fabien.gandon@inria.fr> >> *Envoyé: *Jeudi 5 Janvier 2023 11:09:11 >> *Objet: *RE: How to mark up a document other than a web page? >> >> https://zenodo.org/record/7147703#.Y7agoxXP2F4 >> <https://zenodo.org/record/7147703#.Y7agoxXP2F4> is a longer talk >> that sets up the RO-Crate vision >> >> Carole >> >> Professor Carole Goble CBE FREng FBCS CITP >> >> Department of Computer Science >> >> The University of Manchester, >> >> Manchester, M13 9PL, UK >> >> Head of Node ELIXIR-UK <https://elixiruknode.org/> >> >> PLEASE Do not send me a calendar invite and expect me to see it. >> (i) Invites only work 50% of the time (ii) if they do work they do >> not appear as email so I don’t know they are there until it is too >> late. >> >> Want me at a meeting? Email me. Don’t just silently sneak into a >> diary I do not use. >> >> *From:*Carole Goble <carole.goble@manchester.ac.uk> >> <mailto:carole.goble@manchester.ac.uk> >> *Sent:* 05 January 2023 09:54 >> *To:* Dan Bolser <dan.bolser@gmail.com> >> <mailto:dan.bolser@gmail.com>; Franck Michel >> <fmichel@i3s.unice.fr> <mailto:fmichel@i3s.unice.fr> >> *Cc:* public-bioschemas@w3.org <mailto:public-bioschemas@w3.org>; >> Fabien Gandon <fabien.gandon@inria.fr> >> <mailto:fabien.gandon@inria.fr>; Carole Goble >> <carole.goble@manchester.ac.uk> <mailto:carole.goble@manchester.ac.uk> >> *Subject:* RE: How to mark up a document other than a web page? >> >> I have forwarded this thread to RO-Crate folks to pitch in >> >> RO-Crate https://www.researchobject.org/ro-crate/ >> <https://www.researchobject.org/ro-crate/> packages files and >> annotates them with rich metadata (using Bagit). It uses JSON-LD >> and schema.org <http://schema.org>. It’s an example of using >> schema.org <http://schema.org> for multiple files not web pages. >> >> RO-Crate has gained a lot of traction in organisations needing to >> exchange digital objects with structured machine readable >> metadata, and is designed to be repository neutral – that is, >> enable inter-repo exchange. Zenodo and DataVerse have work ongoing >> to build compliance. >> >> https://zenodo.org/record/7376356#.Y7adghXP2F4 >> <https://zenodo.org/record/7376356#.Y7adghXP2F4> is a talk about >> the repository overlay aspect of RO-Crate >> >> Carole >> >> Professor Carole Goble CBE FREng FBCS CITP >> >> Department of Computer Science >> >> The University of Manchester, >> >> Manchester, M13 9PL, UK >> >> Head of Node ELIXIR-UK <https://elixiruknode.org/> >> >> PLEASE Do not send me a calendar invite and expect me to see it. >> (i) Invites only work 50% of the time (ii) if they do work they do >> not appear as email so I don’t know they are there until it is too >> late. >> >> Want me at a meeting? Email me. Don’t just silently sneak into a >> diary I do not use. >> >> *From:*Dan Bolser <dan.bolser@gmail.com >> <mailto:dan.bolser@gmail.com>> >> *Sent:* 05 January 2023 09:39 >> *To:* Franck Michel <fmichel@i3s.unice.fr >> <mailto:fmichel@i3s.unice.fr>> >> *Cc:* public-bioschemas@w3.org <mailto:public-bioschemas@w3.org>; >> Fabien Gandon <fabien.gandon@inria.fr <mailto:fabien.gandon@inria.fr>> >> *Subject:* Re: How to mark up a document other than a web page? >> >> https://www.tomforth.co.uk/scienceandpdfs/ >> <https://www.tomforth.co.uk/scienceandpdfs/> >> >> Looks useful >> >> On Wed, Jan 4, 2023, 5:50 PM Franck Michel <fmichel@i3s.unice.fr >> <mailto:fmichel@i3s.unice.fr>> wrote: >> >> Dear community, >> >> First of all, let me wish you all a happy, richly marked up >> new year ;). >> >> Schema.org is meant to mark up ressources of any kind on the >> internet, not just web pages. While presenting Bioschemas, I >> once had this question: how do I mark up a pdf file? More >> generally, how to mark up any resource other than an html or >> xml-based content, like pdf, image, csv, Excel sheet, zip >> archive etc. ? >> >> I recently asked this during a BSC meeting but it seemed that >> nobody had really faced this use case yet. And I did a quick >> Google search but nothing came up. So I'd be interested in >> having your thoughts on this. >> >> A basic solution would be to insert markup in the web page >> that provides the download link. Not so satisfying since, when >> an application downloads the file using its direct URL, there >> is no more markup. >> >> I could think of a simple solution that uses the HTTP Link >> header to point to a file containing the markup data >> (similarly to what's been done in JSON-LD >> <https://www.w3.org/TR/json-ld/#interpreting-json-as-json-ld> >> or CSCW >> <https://www.w3.org/TR/tabular-data-model/#link-header>). The >> exchange would look like this: >> >> GET /document.pdf HTTP/1.1 >> Host: example.com <http://example.com> >> >> ==================================== >> >> HTTP/1.1 200 OK >> Content-Type: application/pdf >> Link: <document_metadata.json>; rel="meta"; >> type="application/ld+json" >> ... >> >> Where document_metadata.json is a JSON-LD description of the >> file and its topic (written with Schema.org and Bioschemas of >> course). I'm not sure whether rel="meta" is the best choice >> here, but that's just an example. >> >> Note that some metadata may already be embedded in pdf and >> image files by means of XMP >> <https://en.wikipedia.org/wiki/Extensible_Metadata_Platform>, >> where Schema.org types and properties could be used. But this >> does not work with any type of file, plus applications may >> want to use only HTTP-based mechanisms to get the markup data, >> rather than have to read the content of binary files. >> >> Have you seen this kind of use case and usage somewhere? Any >> other solution you could think of? Do search engines expect >> this kind of linking to external markup files? >> >> Thx in advance. Regards, >> Franck. >> >> -- >> >> Franck MICHEL, CNRS research engineer >> >> Université Côte d’Azur, CNRS, Inria >> >> I3S laboratory (UMR 7271) >> >> >> >> -- >> -- >> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Yvan Le Bras, PhD @Yvan2935 <°))))>< >> Responsable scientifique et technique "Pole >> National de Données de Biodiversité" https://www.pndb.fr/ >> <https://www.pndb.fr/> >> Bureau 34, Station marine de Concarneau BP 225, 29182 >> Concarneau CEDEX --- MNHN Unité de service PatriNat Paris >> tél.: +33 >> (0) 2 98 50 99 35 / +33 (0) 6.10.43.96.51 >> yvan.le-bras@mnhn.fr <mailto:yvan.le-bras@mnhn.fr> > -- *Jerven Tjalling Bolleman* Principal Software Developer *SIB | Swiss Institute of Bioinformatics* 1, rue Michel Servet - CH 1211 Geneva 4 - Switzerland t +41 22 379 58 85 Jerven.Bolleman@sib.swiss - www.sib.swiss
Received on Thursday, 19 January 2023 19:14:24 UTC