Re: My format can beat up your format (Was: Re: DeSeRe’24 Workshop on Decentralised search and Recommendations)

On 2024-03-19 14:35, Leonard Rosenthol wrote:
> I wasn’t trying to say that PDF is better than your format – I was 
> simply trying to make sure that no misinformation was spread.  Nothing more.

:) The subject line is intended as a lighthearted joke because I'm aware 
of where these kinds of discussions tend to go.

>>I presume you're referring to extracting / mapping XMP
> 
> No, that is not what I am referring to at all.  I am referring the 
> feature of PDF called “Tagged PDF”, which has been part of the standard 
> for 25 years.  And don’t forget that PDF has been an ISO standard (ISO 
> 32000) since 2008 and is a normative reference in the HTML5 
> specification (which is the basis for the “open web”).

As in 
https://html.spec.whatwg.org/multipage/system-state.html#dom-navigator-pdfviewerenabled 
that's part of NavigatorPlugins (non-normative)? Either way, my point 
wasn't that there are no references to PDF or it is entirely unsupported 
in any way. I find working with HTML (over PDF) as source to be

> In PDF 2.0 (ISO 32000-2), we added support for RDFa as part of that.  

Thanks for bringing that to my attention. I stand corrected. As 
mentioned in my earlier email, I was running on knowledge prior to 
ISO-32000-2. So, I have to say that I'm amazed that RDFa even made its 
way into PDF!

> Here is a picture from a presentation that I give on the topic showing 
> the tagging with RDFa semantics and the associated derived HTML.
> 
> A screenshot of a computer Description automatically generated

If you have a reference to an example PDF+RDFa document that you can 
link to - private to me is also okay - as well as an HTML+RDFa 
serialization, I'd love to inspect. Reading ISO 32000-2:2020 (with 
errata), since `/O` allows RDFa attributes, and presumably any 
conforming value, it doesn't have the limitations as XMP (e.g., any 
subject `about=` could be described I take it?)

> There are numerous open source tools & libraries that give you access to 
> this information when present in a PDF, as well as tools for creating it 
> in the first place.  Even common publishing solutions such as 
> Open/LibreOffice, various (La)TeX implementations and even commercial 
> solutions also support creation of Tagged PDFs.

I acknowledge that this is useful in a pipeline where the "graph" inside 
those documents - or as an alternative representation for the PDF, 
whether in HTML+RDFa, Turtle, or something else - can be extracted and 
be accessible from a Solid storage.

I find HTML(+RDFa) to be least frictionless and lossless to work with 
for a wide range of information as the source format, especially when a 
human- and machine-readable view needs to end up in the browser. 
JavaScript doesn't even need to enter the picture until HTTP 
write-operations are needed or for the behaviour layer on the document.

> Personally, I don’t think there is a “best solution for all cases of 
> information sharing”.  It is entirely dependent on whether the goal is 
> to share information with humans, with machines or with both.  It is 
> also important to consider additional requirements such as 
> longevity/stability of the information.  And so folks should always 
> choose what works best for them and their use cases.   (and with that 
> said, I think the Solid platform and its technologies bring some 
> excellent pieces to the world – which is why I am here in this group!)

I agree on all points. Which brings me back to the subject line of this 
email... where we tend to get into weeds about formats/serializations.. 
tabs vs. spaces.. as you know, it is pretty easy to run into why Turtle 
can beat up JSON-LD or vice-versa... (until of course HTML enters chat). 
Meanwhile none of the plumbing matters to the end-user.

-Sarven
https://csarven.ca/#i

Received on Wednesday, 20 March 2024 10:54:36 UTC