What's the best way to search the fediverse?

Yingwu@lemmy.dbzer0.com · 9 months ago

What's the best way to search the fediverse?

Lovable Sidekick@lemmy.world · 9 months ago

What if search itself were a federated function? Although I’m a software dev I really don’t know much about the mechanics of large-scale search engines such as Google, but I know their server farms somehow share the load of performing searches and maintaining whatever database they maintain to optimize searching. Seems like the fediverse could do search in a similar way. I’m just saying your critique of the idea, although well thought out, seems like a critique of a particular strategy. It’s not obvious to me that the very idea of federated search is outlandish.

Jupiter Rowland@sh.itjust.works · 9 months ago

Still, the issue would be to find all instances of all Fediverse server applications.

I mean, the idea was to cover the whole Fediverse with that search. Literally everything.

Like, imagine I spin up my own instance of Forte on a home server to try it out and see if it already works.

How’s a Fediverse search engine supposed to know about my brand-new Forte instance? Clairvoyance? Hah. A crawler? Yeah, right, as if any crawler out there was fast enough to discover a brand-new instance of something that doesn’t have a running instance at all yet. At least not beyond enclosed, experimental instances detached from the rest of the Fediverse.

I mean, instead of Forte, I could also install what Forte was forked from, namely something colloquially referred to as (streams). Something that intentionally doesn’t have a name, doesn’t have a brand identity, doesn’t have a unified server identifier. Unlike Mastodon whose instances all identify as “mastodon” and Lemmy whose instances all identify as “lemmy” and Hubzilla whose instances all identify as “hubzilla”, (streams) instances don’t all identify the same. That field is customisable. And it has been customised for as long as (streams) has been around. You can’t reliably crawl (streams) instances. Instead of “streams”, they can identify as “y” (because Y is not X) or “get ready to rumbly” (public instance actually) or “bunny of doom” or “diversi spiritus”.

In fact, crawlers would have to be able to identify any kind of Fediverse server software. Even if someone has only just forked something, a crawler would be able to recognise it as Fediverse server software. If you hard-code server identifiers into the crawler, it’d be out-of-date as soon as someone decides to fork Mastodon or Misskey or Firefish or Sharkey or whatever again. And, as mentioned above, you can’t crawl (streams) instances by identifier.

It simply is impossible to discover and index the whole Fediverse by crawling, Google-style. And if a Fediverse search engine can’t discover a (streams) instance that identifies as “y”, it can’t index the content coming from the man who created (streams) and Forte and still occasionally develops both. The man who created the oldest still existing Fediverse project, Friendica, as well as the Swiss Army knife of the Fediverse, Hubzilla, and the very concept of nomadic identity. One of the most competent and experienced Fediverse devs ever. A crawler couldn’t find him.

Still, the search engine needs to know all Fediverse instances, right?

Well, if crawling fails, and crawling does fail, there’s only one way to achieve that: Each instance would have to announce its presence to anything that’s supposed to be able to search the Fediverse.

But in order to be able that, each instance would have to know everything that can search the Fediverse. And all instances of it. Every single one of them.

And if it shall announce its existence when it spins up for the first time, it will have to know all these search instances immediately before spinning up.

How can it possibly know them all before even going online itself?

Two options. Either a centralised list of all search instance that’s being updated as soon as a new one is spun up.

But you said, “federated.” As in not centralised.

Or the list would have to be built into the source code as it’s being git pulled from the code repository. In fact, the list would have to be git pulled from the code repository immediately before the server spins up so that it’s up-to-date when the server spins up. This would mean that the whole server software would have to be updated before start-up.

Of course, each Fediverse server software project that’s started from scratch would have to implement this list, otherwise its instances couldn’t be found.

But how is this list supposed to be kept up-to-date?

I mean, let’s suppose what has been spun up here is something that has Fediverse search built in. It itself would have to be added to this list so that other new instances can announce themselves to this new instance, so that it can find them and index their content.

So how is this new search-equipped instance supposed to be added to the list of search instances?

Shall it add itself to the list by manipulating the production code of all Fediverse server applications that have Fediverse search built in? Past the maintainers and without their consent?

Perfect search that covers 100% of the Fediverse has to rely on lists of some kind, that’s clear. The Fediverse changes too quickly to be crawlable. It’s too diverse to be crawlable. And it has server software which itself is inherently uncrawlable because it’s undiscoverable by design.

But such lists are impossible to always be kept up-to-date, too.

Lovable Sidekick@lemmy.world · 9 months ago

Thanks for putting so much time and thought into the discussion. All the problems you talk about exist for every search engine in actual use today. For example, publishing a site on a brand new domain has the exact problem you’re describing with spinning up a new Forte instance. There can be a 24-hr lag before DNS can reliably find the site. Perfect search is an aspirational goal. The realistic goal is to satisfy most needs. No matter how many words you throw at it, I don’t think federated search is an outlandish idea at all.

Jupiter Rowland@sh.itjust.works · 9 months ago

I’m not even only talking about a 24-hour lag. I’m talking about parts of the Fediverse never being discovered at all. After all, the Fediverse doesn’t have a centralised DNS of its own in which all instances are registered but only them, where a search crawler could simply look them up.

Even if someone developed a Web search crawler much like the Google Bot, something that crawls the entire WWW looking for Fediverse instances, how is it supposed to tell Fediverse instances from websites that aren’t Fediverse instances?

I bet the first two proposals for solutions wouldn’t work with (streams).

The first proposal would probably be to go for the instance type, like “mastodon” or “lemmy” or “mbin” or “akkoma” or “misskey” or whatever. This, however, would require valid instance types to be manually added to a kind of config file from which the search crawler could look valid instance types up. This, in turn, would only work if this list was constantly kept complete and up-to-date.

This means: Whenever someone launches a new project, the identifier of this project will have to be added to the list. Whenever someone forks something into a new project, ditto. Now let the devs of the crawler have as little time as the Plume devs or as the sole Firefish dev early this year, and the list of Fediverse instance types will spend months outdated with new projects missing, and the crawler won’t recognise the instances of these new projects as Fediverse projects.

Oh, and it wouldn’t work with (streams) at all. See, (streams) is intentionally without a name, without a brand identity and even without a unified, pre-defined, fixed instance type. It isn’t like all instances identify as “streams” or “(streams)”. Some identify as “streams”, but many others have unique types. The crawler wouldn’t know these identifiers as valid Fediverse instance types (how is that crawler supposed to know that “bunny of doom” is a Fediverse identifier), and thus, it wouldn’t be able to identify (streams) instances as Fediverse instances.

Now you could say that (streams) is so tiny that it wouldn’t hurt to sweep it under the rug. Nobody would notice.

But that’d exactly be the problem. One of the (streams) users is the guy who created (streams) and everything before it all the way back to Mistpark in 2010, the one man who developed more Fediverse protocols and server applications than anyone, the man who invented nomadic identity and magic single sign-on: Mike Macgirvin. He is on one out of only two instances that identify as “y” (because Y is not X).

He is one of the few people in the Fediverse who actually post about what’s possible in the Fediverse that goes way beyond Mastodon. Not only possible, but readily available right now. He started advertising (streams) in the wake of the mass-migration of Twitter users to Mastodon. And if his most recent creation, Forte, manages to take off, he’ll probably advertise that. If (streams) wasn’t caught by crawlers, nobody would read his advertisement except those who already follow him, and I guess half of them already know his creations and what they can do.

Hard-coding the custom identifiers of (streams) instances into the list is a stupid idea, too. The instance type is not defined upon installation in a config file. It’s an admin-side free-text field that can be changed anytime with no consequences for connections, just because the admin feels like it.

Okay, so here’s the second proposal: Go for nodeinfo. The problem this time: Mike has also intentionally removed almost all nodeinfo code from (streams). He didn’t want (streams) to participate in that eternal rat race between Fediverse projects and Fediverse instances for the best stats on FediDB, Fediverse Observer and The Federation. In fact, (streams) is entirely absent from all three. This, too, is intentional.

If anyone has a better idea, I’m all ears.