I am a bit worried about the number of people here who, like me, are fairly recent arrivals, and who are using the so-called 'extreme' and 'unreasonable' reactions to them wanting to 'hack' the fediverse to write off the whole conversation around consent as somehow not relevant.
The NSA monitors anything you do on the internet anyway, so why are you complaining about tech bros wanting to harvest out in the open, yada yada.
It seems we need to define what consent actually is.
AGAIN.
And here’s the point at which we go off the rails (towards the end of
the thread; the earlier section is quite well expressed):
Most people in tech do not want to hear this, because it invalidates
the vast majority of their business models, AI/ML training data,
business intel operations, and so forth. Anything that’s based on
gathering data that is ‘public’ suddenly becomes suspect, if the
above is applied.
And yes, that includes internet darlings like the Internet Archive,
which also operates on a non-consensual, opt-out model.
It’s the Western Acquisition, claiming ownership without permission.
It’s so ingrained in white, Western internet culture that there are
now whole generations who consider anything that can be read by the
crawler they wrote in a weekend to be fair game, regardless or what
the user’s original intent was.
Republishing, reformatting, archiving, aggregating, all without the
user being fully aware, because if they were, they would object.
It’s dishonest as fuck, and no different from colonial attitudes
towards natural resources.
“It’s there, so we can take it.”
We then have some reasonable responses from others in the thread:
Re: Internet Archive, I think many of us don’t believe/accept that
businesses, organizations, genuine public figure politicians,
etc. have a right to control how their publications of public
relevance are archived & shared. The problem is that IA isn’t able
to mechanically distinguish between those cases and teenagers’
personal diary-like blogs (chosen as example at opposite end of
spectrum).
This is the difference between the internet archive and an ML model:
the archive does not claim ownership.
Finally, a thought of mine own:
Sindarina seems to fundamentally miss
the central idea of the world wide web, that is, publically sharing
information. This does not mean the work may be used for any purpose
whatsoever, as the content of many websites is either copyrighted or
CC-BY-SA. But publishing anything on the www or in print, opens it
by necessity to aggregation and archival. I routinely save webpages to disk.
To run with the cafe analogy that has been brought up, one cannot post
a note to the cafe’s bulletin board and at the same time expect that
no one else may take a photo of it, then perhaps share it with some
acquaintances.
This is a far cry from the data harvesting done by Google, Microsoft,
Apple & co., or the dubiously collected data used to train “automated
plagiarism engine[s],” as Arthur Besse put it not too long ago.
It’s fair that maybe the architecture of public inbox/outbox protocols aren’t suited for this kind of use (juxtapose with Matrix).
However consider this: Some people on the fediverse simply don’t want to be indexed. It should be opt-in instead of opt-out, for people who explicitly want it. People aren’t against search, they’re against non-consensual search.
I think it’s important for the culture of the fediverse that such civility is encouraged. Because on the fediverse, the community can actually make a difference. By blocking federation with offenders, we can guide the culture of fedi. And it’s better for it.
Running with the idea that people can “technically” do what they want because of the nature of the protocols is counter-productive, because we actually can do something.
All a search implementer has to do is adapt to that culture, and they’ll be fine. So I don’t see why there’s such push-back against this viewpoint.
I would fully agree that other internet protocols are much better suited to information not meant to be broadcast publicly.
Civility is great, and should be highly encouraged. That’s largely why I like Lemmy. Each instance can guide its community in line with its values, whatever those may be, block offenders, and generally forge the space it wishes.
However, I think Besse’s comments on setting the correct expectations in the public sphere are worth considering.
For a different internet example: all the messages I send in any chatroom on an IRC server will inevitably be logged by someone, especially in popular rooms. Any assumption to the contrary would be naïve, and demanding that people not keep a log any of my publicly broadcast messages would be laughed at by the operators. It’s a public space, and sending anything to that space necessarily means I forgo my ability to control who sees, aggregates, archives, or shares that information. My choice to put the information into that space is the opt-in mechanism, just how books or interviews do the same offline in print.
It’s not so much the protocol as it is how making things public fundamentally works.
With respect to your thoughts: just because the (corporate) internet works this way now, doesn’t mean it should. I don’t want people scraping my posts. I find it creepy. The fediverse (some parts of it, at least) was, for many people and for a long time, a place they could go to connect with people without needing to argue about the legal definition of consent. The fact that people can technically get away with scraping my posts isn’t permission to do so. And, obviously, just turning off your computer isn’t an option, because, at least in the global north-west, you need to have an online presence to be involved in society.
Nobody is claiming that the web is a place for healthy relationships with corporations. It isn’t. The web is a place corporations constructed to make more money. This is about working together to build something better.
I’m happy that you’re comfortable with this model, but I don’t want people who operate like this to intrude on the spaces we’re building to get away from it. It’s just like, a courtesy thing. Will there need to be protocol changes to technologically force people not to do this? Probably. Should there have to be? I really wish I could say there didn’t need to be.
just because the (corporate) internet works this way now, doesn’t mean it should
The web worked this way before there was a large corporate presence. Scraping was common during the blogosphere period and robots.txt was the solution everyone at the time agreed on and that’s been the standard ever since.
I’m happy that you’re comfortable with this model, but I don’t want people who operate like this to intrude on the spaces we’re building to get away from it
We’re not intruding on this space. We’ve been in the fediverse for just as long or longer; the fediverse has been scrapable since 2008.
We’re not intruding on this space. We’ve been in the fediverse for just as long or longer; the fediverse has been scrapable since 2008.
Totally. And while it was scrapable, and scraped a lot, I wish there had been a lot more systematic public scraping of the “federated social web” (as it was called before the terrible name “fediverse” was adopted) back then - I had a lot of public conversations on identi.ca and StatusNet which I wish I could still see, but they now exist only in a bunch of private databases I don’t have access to. 😢
I think blurring the lines between public and private spaces is the
opposite of informing consent. Cultivating unrealistic expectations of
“privacy” and control in what are ultimately public spaces is actually
bad.
I tried to single out the world wide web, as opposed to the internet
at large, because the two are not synonymous. It’s rather absurd to
publicly serve webpages to any querying IP address and maintain that
the receiving computer is not to save said pages to disk.
All this to say: I find it difficult to argue that web publications
should or could be exempt from aggregation and archival (or scraping,
to put it another way). I understand that the ease with which bots do
this can be disconcerting, however.
If we stay with the cafe bulletin board, getting a detailed overview of all the
postings on the board is akin to scraping the whole thing. If we extend
our analogy instead to a somewhat more significant example, library
catalogs do the same with books, magazines, and movies.
This is the cost of publishing, be that in print or online. It must be
expected that some person has a copy of every- and anything one has
ever written or posted publicly, and perhaps even catalogued it. A way around
this might be to move away from the web to another part of the internet,
like Matrix, as alma suggested.
I assume the non-consensual collection of various (meta-)data is what
you refer to when talking about intrusion and money making.
Lemmy, like many projects, seeks to offer an alternative to corporate,
data-gobbling social media sites, but doesn’t eliminate the ability
to search through its webpages.
And here’s the point at which we go off the rails (towards the end of the thread; the earlier section is quite well expressed):
We then have some reasonable responses from others in the thread:
Rich Felker @[email protected]
Arne Babenhauserheide @[email protected]
*snip*
Finally, a thought of mine own:
Sindarina seems to fundamentally miss the central idea of the world wide web, that is, publically sharing information. This does not mean the work may be used for any purpose whatsoever, as the content of many websites is either copyrighted or CC-BY-SA. But publishing anything on the www or in print, opens it by necessity to aggregation and archival. I routinely save webpages to disk.
To run with the cafe analogy that has been brought up, one cannot post a note to the cafe’s bulletin board and at the same time expect that no one else may take a photo of it, then perhaps share it with some acquaintances.
This is a far cry from the data harvesting done by Google, Microsoft, Apple & co., or the dubiously collected data used to train “automated plagiarism engine[s],” as Arthur Besse put it not too long ago.
It’s fair that maybe the architecture of public inbox/outbox protocols aren’t suited for this kind of use (juxtapose with Matrix).
However consider this: Some people on the fediverse simply don’t want to be indexed. It should be opt-in instead of opt-out, for people who explicitly want it. People aren’t against search, they’re against non-consensual search.
I think it’s important for the culture of the fediverse that such civility is encouraged. Because on the fediverse, the community can actually make a difference. By blocking federation with offenders, we can guide the culture of fedi. And it’s better for it.
Running with the idea that people can “technically” do what they want because of the nature of the protocols is counter-productive, because we actually can do something.
All a search implementer has to do is adapt to that culture, and they’ll be fine. So I don’t see why there’s such push-back against this viewpoint.
I would fully agree that other internet protocols are much better suited to information not meant to be broadcast publicly.
Civility is great, and should be highly encouraged. That’s largely why I like Lemmy. Each instance can guide its community in line with its values, whatever those may be, block offenders, and generally forge the space it wishes.
However, I think Besse’s comments on setting the correct expectations in the public sphere are worth considering.
For a different internet example: all the messages I send in any chatroom on an IRC server will inevitably be logged by someone, especially in popular rooms. Any assumption to the contrary would be naïve, and demanding that people not keep a log any of my publicly broadcast messages would be laughed at by the operators. It’s a public space, and sending anything to that space necessarily means I forgo my ability to control who sees, aggregates, archives, or shares that information. My choice to put the information into that space is the opt-in mechanism, just how books or interviews do the same offline in print.
It’s not so much the protocol as it is how making things public fundamentally works.
With respect to your thoughts: just because the (corporate) internet works this way now, doesn’t mean it should. I don’t want people scraping my posts. I find it creepy. The fediverse (some parts of it, at least) was, for many people and for a long time, a place they could go to connect with people without needing to argue about the legal definition of consent. The fact that people can technically get away with scraping my posts isn’t permission to do so. And, obviously, just turning off your computer isn’t an option, because, at least in the global north-west, you need to have an online presence to be involved in society.
Nobody is claiming that the web is a place for healthy relationships with corporations. It isn’t. The web is a place corporations constructed to make more money. This is about working together to build something better.
I’m happy that you’re comfortable with this model, but I don’t want people who operate like this to intrude on the spaces we’re building to get away from it. It’s just like, a courtesy thing. Will there need to be protocol changes to technologically force people not to do this? Probably. Should there have to be? I really wish I could say there didn’t need to be.
The web worked this way before there was a large corporate presence. Scraping was common during the blogosphere period and
robots.txt
was the solution everyone at the time agreed on and that’s been the standard ever since.We’re not intruding on this space. We’ve been in the fediverse for just as long or longer; the fediverse has been scrapable since 2008.
Totally. And while it was scrapable, and scraped a lot, I wish there had been a lot more systematic public scraping of the “federated social web” (as it was called before the terrible name “fediverse” was adopted) back then - I had a lot of public conversations on identi.ca and StatusNet which I wish I could still see, but they now exist only in a bunch of private databases I don’t have access to. 😢
I think Besse makes a great point here:
I tried to single out the world wide web, as opposed to the internet at large, because the two are not synonymous. It’s rather absurd to publicly serve webpages to any querying IP address and maintain that the receiving computer is not to save said pages to disk.
All this to say: I find it difficult to argue that web publications should or could be exempt from aggregation and archival (or scraping, to put it another way). I understand that the ease with which bots do this can be disconcerting, however.
If we stay with the cafe bulletin board, getting a detailed overview of all the postings on the board is akin to scraping the whole thing. If we extend our analogy instead to a somewhat more significant example, library catalogs do the same with books, magazines, and movies.
This is the cost of publishing, be that in print or online. It must be expected that some person has a copy of every- and anything one has ever written or posted publicly, and perhaps even catalogued it. A way around this might be to move away from the web to another part of the internet, like Matrix, as alma suggested.
I assume the non-consensual collection of various (meta-)data is what you refer to when talking about intrusion and money making. Lemmy, like many projects, seeks to offer an alternative to corporate, data-gobbling social media sites, but doesn’t eliminate the ability to search through its webpages.