Re: How should I query ActivityStreams objects containing both JSON and HTML?

It looks like you’re making an algebra, have you considered using https://www.w3.org/2001/sw/DataAccess/rq23/rq24-algebra.html

> On Mar 31, 2023, at 5:02 PM, Bob Wyman <bob@wyman.us> wrote:
> 
> Some have asked "Why are you doing this?"... Well, I'm working on an "ActivityStreams" variant of WebSub that provides content-based, rather than just topic-based, subscriptions. (WebSub only supports subscribing to named feeds -- that's topic based pubsub.)
> 
> In the system I'm building, content-based prospective subscriptions would be composed of arbitrarily complex Boolean queries. Thus, you might subscribe to:
>> $..object.type == 'Note' AND $..object.author == 'alice@example.com <http://example.com/>' AND $..object.content != 'foobar'
> 
> Given this query, whenever Alice publishes a Note that does not contain the word 'foobar,' a notification would be generated. (Actually, an "ActivityPub" instance could use this internally as a way to express a user's filtered subscriptions.) Also, note that I'm thinking of using "!=" for "contains" or fuzzy-match and "!==" for exact or strict match.
> 
> Of course, the real challenge of such a system is implementing the matching efficiently. That's something I've done for similar systems in the past. (Some may remember PubSub.com, or the Prospective Search Infrastructure (PSI) that I built at Google.) Ideally, a server should be able to support thousands or millions of these Prospective Queries being processed in real time as new posts are ingested.
> 
> In most cases, the right thing to do will be to simply convert any HTML elements to text and then allow searching within that text using =, ==, !=, or !==. However, some applications will include publishing of HTML that is structured in some useful manner. (i.e. as a table or according to some standard format. For instance, a weather report in HTML that includes class or id fields that allow parsing out individual bits of data like temperature or humidity.). In those cases, it would be useful to be able to query parts of the included HTML. Thus, the need for an XPath-like function.
> 
> bob wyman
> 
> On Fri, Mar 31, 2023 at 6:05 PM Bob Wyman <bob@wyman.us <mailto:bob@wyman.us>> wrote:
>> In a Mastodon.social post <https://mastodon.social/@bobwyman/110120087223817037>, I asked:
>> 
>> XPath and JSONPath are similar, but different. (See JSONPath spec <https://goessner.net/articles/JsonPath/>) This presents a problem for me since I'm building a system to query ActivityStreams objects that can include HTML wrapped in JSON.
>> 
>> Should I:
>> Use XPath syntax for both JSON and HTML?
>> Use JSONPath syntax for both JSON and HTML? (If so, is there a reasonable extension to JSONPath to support selecting on HTML attributes?)
>> Switch between JSONPath and XPath depending on the underlying datatype? (e.g. Embedding XPath in JSONPath.)
>> If you were writing a query, would you accept needing to know both syntaxes?
>> 
>> I would appreciate any advice you might be able to provide. Also, I would be interested to hear if anyone else has already been faced with and addressed this issue.
>> 
>> bob wyman

Received on Saturday, 1 April 2023 00:04:26 UTC