- From: Sarven Capadisli <info@csarven.ca>
- Date: Tue, 19 Mar 2024 14:00:53 +0100
- To: public-solid@w3.org
On 2024-03-19 12:36, Leonard Rosenthol wrote: > Sarven – I have no connection to this conference, but the general answer > to the question is that PDF is a semantically rich format which includes > full support for structural semantics, content semantics as well as RDFa > compatible “markup”. So given a properly constructed PDF, retrieval of > such information is well defined. In fact, there is an industry > standard for deterministically deriving equivalent HTML(+RDF, if > present) - https://pdfa.org/resource/deriving-html-from-pdf/ > <https://pdfa.org/resource/deriving-html-from-pdf/>. Is the suggestion that when senders or receivers are equipped with the processing algorithm, they can extract the structured data - using the term liberally - inside the PDF? I presume you're referring to extracting / mapping XMP (specifically RDF/XML) in PDF to HTML(+RDF) as per https://pdfa.org/resource/iso-16684-xmp/ , which is paywalled. Fortunately when I last researched this topic ( https://csarven.ca/linked-research-decentralised-web ), I archived the then publicly available XMP specification: https://web.archive.org/web/20190710075340/https://wwwimages2.adobe.com/content/dam/acom/en/devnet/xmp/pdfs/XMP%20SDK%20Release%20cc-2016-08/XMPSpecificationPart1.pdf IIRC, PDF/XMP required propriety tools to create and consume that are underdeveloped/undertooled. It is hidden/grey metadata. The RDF within XMP can only have one unique subject (of triple) that is not even intended to identify the document itself. The RDF in XMP is essentially a (broken) subset of RDF/XML. That said, if there are alternative formats/models that deriving-html-from-pdf follows, can you please refer me to a specification? While I don't refute that some things are possible with PDF and it has done well (thinking mostly print-centric, among other things), having it actually play well with/in the open web platform requires reverse-engineering the flows/formats/data, in a nutshell. (I'd be happy to be corrected on how well PDFs work with respect to read-write HTTP operations, deep linking, and so forth.) PDF/XMP's intricate complexity doesn't align with the rule of least power design principle ( https://www.w3.org/DesignIssues/Principles#PLP , https://www.w3.org/2001/tag/doc/leastPower.html , https://en.wikipedia.org/wiki/Rule_of_least_power . FWIW. I don't see a compelling reason to package knowledge in PDF and share it on the web, especially when there are *open* *industry standards* that literally work better in about every way for knowledge sharing (and interactivity)... on the web. Others' mileage may vary. -Sarven https://csarven.ca/#i
Received on Tuesday, 19 March 2024 13:01:01 UTC