Wikipedia:Wikipedia Signpost/2022-11-28/Opinion

Opinion

Privacy on Wikipedia in the cyberpunk future

Ladsgroup has actively edited Wikipedia since 2006. On the Persian Wikipedia, he's a bureaucrat, oversighter and check user. He currently works as a Staff Database Architect for the WMF. As a volunteer he helps build tools for CheckUsers. This article was written in his role as a volunteer and any opinions expressed do not necessarily reflect the opinions of The Signpost, the WMF, or of other Wikipedians.

When you edit Wikipedia, it will be public. We all know that. But do you know what it actually entails?

Can others tell if I have multiple accounts (such as sockpuppets)?

Some trusted users called Checkusers are able to see your IP address and user agent. Meaning they will know where you live, maybe where you are studying or where you work. They don't disclose such information and it's subject to a really strong policy. However, that's not the only way you can be identified.

The way you use language is unique to you; it's like a fingerprint. There are bodies of research on that. With simple natural language processing tools, you can extract discussions from Wikipedia and link accounts that have similar linguistic fingerprints.

socks

not socks

Here's an example of two socks in Persian Wikipedia and two users that are not, analysed using a simple NLP system.

What does this mean? It means people will be able to find, guess or confirm their suspicions on other accounts you have. They will be able to link between multiple accounts without needing access to private data that could reveal where you live or work.

Who can analyze my edits?

Wikimedia projects are public: the license means that all information hosted on them can be reused for any purposes whatsoever, and the privacy policy allows for analysis of edits or other information publicly shared for any reason.

That means anyone with resources or knowledge can analyze data trends in your edit history, such as when you edit, what words you use, what articles you have edited. As technology has advanced, tools for analyzing trends in user data have as well, and include things as basic as edit counters, and as complex as anti-abuse machine learning systems, such as ORES and some anti-vandal bots. Academics have begun utilizing public data to develop models for combatting abuse on Wikipedia using machine learning and artificial intelligence systems, and volunteer developers have created systems that utilize natural language processing in order to help identify malicious actors.

As with anything, these technologies can be abused. That's one of the risks of an open project: an oppressive government or a big company can invest in it and download Wikimedia dumps. They can even go further and cross-check it with social media posts. While not likely in most cases, in areas of the world where free speech is limited, one should be conscious of what information you share on Wikipedia and other Wikimedia projects.

Beside external entities, volunteers have been building such tools to help Checkusers do their job better, with the potential to limit access to private data. The tool we showed graphs from here is being used in several wikis already but is only made available to Checkusers of that wiki by the developer. The tool doesn't give just a number, it builds plots and graphs to make decision-making easier.

Can we ban using AI tools?

Legally, there's nothing we can do to stop external entities from using this data – it's engraved in our license and privacy policy^[1] that it's free to use for whatever purpose people see fit.

Because of this, restrictions on the use of natural language processing or other automated or AI abuse detection systems that do not directly edit Wikimedia projects are not possible. Communities could amend their local policies to prohibit blocks based on such technologies or to prohibit consideration of such analysis when deciding whether or not there is cause to use the CheckUser tool. Local projects cannot, however, prevent use of natural language processing or other tools completely because of the nature of the license and the current privacy policy.

Notes

^ From the Privacy policy: "You should be aware that specific data made public by you or aggregated data that is made public by us can be used by anyone for analysis and to infer further information, such as which country a user is from, political affiliation and gender."

In this issue

Opinion

Disinformation report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

Since The Signpost invited Ladsgroup to give us permission to publish his essay, I didn't want to modify it before publication. But it is worth mentioning that the stylometry techniques discussed, whether AI assisted or not, can be used to link real-world identity to on-Wiki identity by comparison to off-wiki writing such as blog posts or other published material. Something to think about? ☆ Bri (talk) 17:16, 28 November 2022 (UTC)[reply]

The third paragraph of the second section addresses that, I think. I'd go on to add that it's not just oppressive governments or big companies that could do that, but also basically anyone with a grudge and a target - could be Kiwifarms, 8chan, terrorists, extremist partisans, and other such undesirables (in fact, I'd say they're even more dangerous than Big Brother or Big Tech). This is definitely something everyone should be careful about, both within Wikipedia and elsewhere on the Internet. Doxxing and IRL harassment based on Wikipedia edits has already happened on Wikipedia, in relation to the 2020 Delhi riots article - a well-respected editor was doxxed and harassed by far-right "news" site OpIndia.

Stay safe, people... W. Tell DCCXLVI (^{talk to me!}/_c) 18:19, 28 November 2022 (UTC)[reply]

I see this article as being essentially fair warning of the technology we can't stop - either Wikipedia uses it for making Wikipedia a better place, or it's used by outsiders for their own interests; or probably both. We really can't afford to ignore it. It will have different impacts on various groups, however.
- The main group that it should have a large negative effect on is commercial editors such as undeclared paid editors. It should be easier to catch these folks when they sockpuppet. We should embrace this use of stylometry, but not overuse it.
- People who write here under a pseudonym, but who are published elsewhere under their own name, may have undeserved troubles because of stylometry.
- As with any tech that can be used for doxxing, women may be subject to harsher consequences. But I don't see them being a big part of the above 2 groups.
- I don't see much problem for editors who have experimented with sockpuppeting, say for 8 edits on 4 articles 10 years ago. Matching the 2-3 accounts (which won't have much text to analyze) would be difficult. And who would gain much from matching the 2-3 user names? In any case admins generally aren't looking for these folks.
- And there is a kicker. This and other technologies will progress - so, if you're going to try to fight it, you'll have to fight the versions that will come out, say 5 years down the road. Smallbones_(smalltalk) 04:55, 29 November 2022 (UTC)[reply]
I wonder just how unique these writing styles will be. I suspect if you took four hundred wikipedians who write in English the two from New Zealand would come up with some shared features as would the two from Dublin and the two from Glasgow. Especially if they used the same spellchecker. Ϣere SpielChequers 19:35, 13 December 2022 (UTC)[reply]

The Signpost is written by editors like you — join in!

Home

About