[data-shapes] Use case: lab report enrichment (#584)

recalcitrantsupplant has just created a new issue for https://github.com/w3c/data-shapes:

== Use case: lab report enrichment ==
I have a number of PDF lab reports that are from different chemical analysis laboratories, with similar information in different formats. These have been OCR'd to JSON format with the extracted text, its position, confidence scores etc. I'm then converting this to RDF at which point I need to enrich it/calculate a number of additional things: some can be done directly from the RDF, others will need to be piped through some external process such as human review, an LLM, ML models etc.

As this is a proof of concept I am currently building most of this logic into SPARQL queries. With more time I would break these into more granular rules, which would then be grouped for a particular lab's format.
 
The main things I would be looking for when using rules would be:
- ability to have granular versioned rules and record provenance for execution (time etc.)
- ability to group rules

Some current example SPARQL:
```sparql
  BIND(REPLACE(?chem_name_raw, "N\\d{2}$", "") AS ?Original_Chem_Name)
  BIND(REPLACE(?value_raw, "^N\\d{2}", "") AS ?Text_Result)
  BIND(
  IF(REGEX(STR(?value_raw), "^N\\d{2}"),
     SUBSTR(STR(?value_raw), 1, 3),  # NXX
     ""
  ) AS ?NXX
)
  BIND(REPLACE(?Text_Result, "^< ", "") AS ?Result)
  BIND(IF(STRSTARTS(?Text_Result, "< "), "<", "") AS ?Prefix)
  BIND(STRLEN(
          REPLACE(
            REPLACE(STR(?Result), "\\.", ""),  # 1. remove decimal point
            "^0+", ""                       # 2. strip all leading zeros
          )
       ) AS ?Result_Sig_Figs)
```
Other examples which can be derived from the RDF include flagging certain values for human review, based on thresholds and other context in the table.

Example lab report the data is extracted from:
<img width="1677" height="1026" alt="Image" src="https://github.com/user-attachments/assets/fdd94cd7-7605-4459-8aea-934307a872de" />

This all seems within the scope of the current draft to me. The versioning can be additional triples which I manage, and the provenance can be built into the rule logic itself.

Please view or discuss this issue at https://github.com/w3c/data-shapes/issues/584 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 23 September 2025 09:25:24 UTC