My format can beat up your format (Was: Re: DeSeRe’24 Workshop on Decentralised search and Recommendations)

On 2024-03-19 12:36, Leonard Rosenthol wrote:
> Sarven – I have no connection to this conference, but the general answer 
> to the question is that PDF is a semantically rich format which includes 
> full support for structural semantics, content semantics as well as RDFa 
> compatible “markup”.   So given a properly constructed PDF, retrieval of 
> such information is well defined.  In fact, there is an industry 
> standard for deterministically deriving equivalent HTML(+RDF, if 
> present) - https://pdfa.org/resource/deriving-html-from-pdf/ 
> <https://pdfa.org/resource/deriving-html-from-pdf/>.


Is the suggestion that when senders or receivers are equipped with the 
processing algorithm, they can extract the structured data - using the 
term liberally - inside the PDF?

I presume you're referring to extracting / mapping XMP (specifically 
RDF/XML) in PDF to HTML(+RDF) as per 
https://pdfa.org/resource/iso-16684-xmp/ , which is paywalled.

Fortunately when I last researched this topic ( 
https://csarven.ca/linked-research-decentralised-web ), I archived the 
then publicly available XMP specification:

https://web.archive.org/web/20190710075340/https://wwwimages2.adobe.com/content/dam/acom/en/devnet/xmp/pdfs/XMP%20SDK%20Release%20cc-2016-08/XMPSpecificationPart1.pdf

IIRC, PDF/XMP required propriety tools to create and consume that are 
underdeveloped/undertooled. It is hidden/grey metadata. The RDF within 
XMP can only have one unique subject (of triple) that is not even 
intended to identify the document itself. The RDF in XMP is essentially 
a (broken) subset of RDF/XML.

That said, if there are alternative formats/models that 
deriving-html-from-pdf follows, can you please refer me to a specification?

While I don't refute that some things are possible with PDF and it has 
done well (thinking mostly print-centric, among other things), having it 
actually play well with/in the open web platform requires 
reverse-engineering the flows/formats/data, in a nutshell. (I'd be happy 
to be corrected on how well PDFs work with respect to read-write HTTP 
operations, deep linking, and so forth.)

PDF/XMP's intricate complexity doesn't align with the rule of least 
power design principle ( https://www.w3.org/DesignIssues/Principles#PLP 
, https://www.w3.org/2001/tag/doc/leastPower.html , 
https://en.wikipedia.org/wiki/Rule_of_least_power . FWIW.

I don't see a compelling reason to package knowledge in PDF and share it 
on the web, especially when there are *open* *industry standards* that 
literally work better in about every way for knowledge sharing (and 
interactivity)... on the web. Others' mileage may vary.

-Sarven
https://csarven.ca/#i

Received on Tuesday, 19 March 2024 13:01:01 UTC