Re: How should I query ActivityStreams objects containing both JSON and HTML?

Some have asked "Why are you doing this?"... Well, I'm working on an
"ActivityStreams" variant of WebSub that provides content-based, rather
than just topic-based, subscriptions. (WebSub only supports subscribing to
named feeds -- that's topic based pubsub.)

In the system I'm building, content-based prospective subscriptions would
be composed of arbitrarily complex Boolean queries. Thus, you might
subscribe to:

> $..object.type == 'Note' AND $..object.author == 'alice@example.com' AND
> $..object.content != 'foobar'


Given this query, whenever Alice publishes a Note that does not contain the
word 'foobar,' a notification would be generated. (Actually, an
"ActivityPub" instance could use this internally as a way to express a
user's filtered subscriptions.) Also, note that I'm thinking of using "!="
for "contains" or fuzzy-match and "!==" for exact or strict match.

Of course, the real challenge of such a system is implementing the matching
efficiently. That's something I've done for similar systems in the past.
(Some may remember PubSub.com, or the Prospective Search Infrastructure
(PSI) that I built at Google.) Ideally, a server should be able to support
thousands or millions of these Prospective Queries being processed in real
time as new posts are ingested.

In most cases, the right thing to do will be to simply convert any HTML
elements to text and then allow searching within that text using =, ==, !=,
or !==. However, some applications will include publishing of HTML that is
structured in some useful manner. (i.e. as a table or according to some
standard format. For instance, a weather report in HTML that includes class
or id fields that allow parsing out individual bits of data like
temperature or humidity.). In those cases, it would be useful to be able to
query parts of the included HTML. Thus, the need for an XPath-like function.

bob wyman

On Fri, Mar 31, 2023 at 6:05 PM Bob Wyman <bob@wyman.us> wrote:

> In a Mastodon.social post
> <https://mastodon.social/@bobwyman/110120087223817037>, I asked:
>
> XPath and JSONPath are similar, but different. (See JSONPath spec
> <https://goessner.net/articles/JsonPath/>) This presents a problem for me
> since I'm building a system to query ActivityStreams objects that can
> include HTML wrapped in JSON.
>
> Should I:
>
>    - Use XPath syntax for both JSON and HTML?
>    - Use JSONPath syntax for both JSON and HTML? (If so, is there a
>    reasonable extension to JSONPath to support selecting on HTML attributes?)
>    - Switch between JSONPath and XPath depending on the underlying
>    datatype? (e.g. Embedding XPath in JSONPath.)
>
> If you were writing a query, would you accept needing to know both
> syntaxes?
>
> I would appreciate any advice you might be able to provide. Also, I would
> be interested to hear if anyone else has already been faced with and
> addressed this issue.
>
> bob wyman
>

Received on Saturday, 1 April 2023 00:02:35 UTC