- From: Bob Wyman <bob@wyman.us>
- Date: Wed, 5 Apr 2023 23:03:19 -0400
- To: Johannes Ernst <johannes.ernst@gmail.com>
- Cc: public-swicg@w3.org
- Message-ID: <CAA1s49UGaToCUoy9czLjRJi2H3BOtQrMMtWWnd7EqqQhVbyXJA@mail.gmail.com>
Johannes,
You wrote:
> [Search] is probably far less a technical problem than one of successful
> communication
I think there are more issues associated with "search" than you suggest.
Below are just a few issues, other than those related to "communications,"
that should be considered:
- Rights and Obligations:
- Assuming that the law establishes that "All Rights Are Reserved" to
the content creator, what rights must be granted, by a creator, to permit
search?
- What rights are not reserved to the creator? (Note: These may vary
by jurisdiction.)
- May individuals maintain searchable databases of content
received for their own personal use?
- What, if any, rights are granted by law and do not need to be
granted by creators? (Fair use, etc.?)
- What mechanism or syntax or mechanism will be used by creators to
express grants of rights to others? (Rights Expression Language?)
- Can content creators constrain the audience who may discover their
content via search? (i.e. To just members of groups, etc.?) If so, how?
- What obligations or limitations do search providers have?
- If content is signed, must the result of a search be verifiable?
- If a license requires attribution, how is that requirement
satisfied? Also, what about licenses embedded in indexed posts?
- May content be summarized? If so, to what degree?
- If a post includes images or media, must they be retained in the
search result?
- If creators limit the "right to store or archive" how does that
affect search providers?
- May content from multiple posts be combined to produce
derivative works? (i.e. large language model (LLM) systems?)
- Kinds of search:
- Retrospective search: Searching for things that have been published
in the past (i.e. traditional "search")
- Prospective search: Requesting notification whenever an object
matching some query is published in the future.
- Should results of prospective searches be delivered in the same
manner as posts addressed to a user or should they be
displayed via some
other mechanism
- Cross-matching: Enforcement of creator-specified audience
constraints on delivery of search results (i.e. While search results must
match the searcher's constraints, the searcher's attributes must
match the
creator's audience constraints. See question about audience-constraints
above.)
- Search API?
- Should the specs be extended to provide a standard search
interface, for both retrospective and prospective search?
- Should the standard API provide "universal search?" (i.e. both
retrospective and prospective search in a single interface)
- If a standard API is provided, where should it be defined?
- Search addendum to ActivityStreams Collections?
- Extension to the ActivityPub Client2Server interface?
- Extension to ActivityPub Server2Server interface?
- Query syntax?
- Traditional text search engine syntax? (Google-like and easy to
use)
- SQL-like filters (i.e. as in WHERE clauses)
- JsonPath? (with XPath for searching within HTML/XML content?)
- SPARQL? (Semantic Web, very powerful, but very hard to use.)
- How should result rate limits be expressed and enforced? (i.e. no
more than XXX results/hour...)
- Search implementation:
- Are there useful systems for effectively and efficiently
implementing distributed
<https://en.wikipedia.org/wiki/Distributed_search_engine> or federated
search <https://en.wikipedia.org/wiki/Federated_search>? (If so,
should normal instances be encouraged to participate in such
distributed or
federated systems?) Will the European Common DataSpaces
<https://dataspaces.info/#concepts> project provide anything of use
here?
- Can/Should IPFS (InterPlanetary File System
<https://en.wikipedia.org/wiki/InterPlanetary_File_System>) be
leveraged?
- Given that search systems will often have broad audiences, and can
be much more resource intensive than Social Web instances, is
there a need
to find ways to monetize these systems? If so, what means are acceptable?
- Alternatives to crawling. (How do we prevent search crawlers from
overloading instances?)
- FeedMesh for ActivityPub? For blogging, we built a system that
allowed major blog search providers (Bloggdigger, Blo.gs,
Google, PubSub,
VeriSign and Yahoo) to share what their crawlers found. This
reduced load
on individual blogs and also ensured that all search providers
distinguished their services based on their quality of
service, not just
the number of blogs they crawled. This may have later led to
PubSubHubbub
and then to WebSub <https://www.w3.org/TR/websub/>...
- WebSub for ActivityPub? We could define a Activity* variant of
WebSub to which instances would forward copies of public,
searchable posts
for distribution to others, including providers of either
retrospective or
prospective search. This would eliminate the need for search
crawlers to
impose load on instances.
- Can/Should we build standard Web Components
<https://www.webcomponents.org/>for the entering of search queries and
display of search results in order to make it easier for people to adopt
this capability?
This is just a quick summary of issues off the top of my head. I'm sure
that others in the group can add additional issues that should be
considered.
bob wyman
Received on Thursday, 6 April 2023 03:03:38 UTC