Wikipedia talk:Database download/Archive 3

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Archive 2

Archive 3

dump of Articles for Creation requested

There is a plan to speedy-delete around 90,000 drafts from Articles for Creation: Wikipedia_talk:Criteria_for_speedy_deletion#G13:_Abandoned_Articles_for_creation_submissions. Since these pages are in the Wikipedia_talk space, I don't think they are normally available as a dump. I'm requesting that they be made available for mirroring and preservation purposes before their deletion from Wikipedia. My specific request is for a dump of Wikipedia_talk:WikiProject_Articles_for_creation and all its subpages. —rybec 04:06, 7 April 2013 (UTC)

I'm pretty sure you can do this yourself using Special:Export. All the entries you are interested in should be in Category:Pending AfC submissions, meaning you can even have the tool generate the list of articles to export automatically. Failing this AutoWikiBrowser can be used to build a list of relevant articles which can be cut and pasted into Special:Export - TB (talk) 09:11, 5 June 2013 (UTC)

Downloading Wikipedia, step-by-step instructions

Can someone replace this paragraph with step by step instructions on how to download wikipedia into a usable format for use offline?

What software progams do you need? (I've read that mysql is no longer being supported, I checked Apaches website but have no idea what program, of their many programs, is needed to make a usable offline wikipedia)
What are the web addresses of those programs?
What version of those programs do you need?
Is a dump what you need to download?
Can someone write a brief laymens translation of the different files at: http://download.wikimedia.org/enwiki/latest/ (what exactly each file contains, or what you need to download if you are trying to do a certain thing, ex; you want the latest version of the articles in text only) you can get a general idea from the title but it's still not very clear.
What program do you need to download the files on that page.
If you get Wikipedia downloaded and setup, how do you update it, or replace a single article with a newer version.— Preceding unsigned comment added by 68.217.171.51 (talk) 19:03, 1 September 2006 (UTC)

The instructions on m:Help:Running MediaWiki on Debian GNU/Linux tell you how to install a fresh mediawiki installation. From here, you will still need to get the xml dump, and run mwdumper. — Preceding unsigned comment added by Wizzy (talk • contribs) 15:32, 20 September 2006 (UTC)

Romanian version?

Hei guys. It is possible to make the same with romanian version? Thanks — Preceding unsigned comment added by 78.31.56.113 (talk) 02:39, 16 June 2013 (UTC)

Please see http://dumps.wikimedia.org/rowiki/, jonkerz ♠talk 02:55, 16 June 2013 (UTC)

Is there a mobile version for download?

Or is it possible to convert the whole thing for mobile viewing? Oktoberstorm (talk) 11:50, 17 September 2013 (UTC)

Interwiki links

Could someone please help me with a question? If I download the English Wikipedia, will the links to other languages be on the pages? I understand they won't be active offline, but will they be there as read-only? Lampman (talk) 00:27, 14 January 2014 (UTC)

New version of bzip2 executable?

Wikipedia:Database download#Dealing with compressed files contains a link to ftp://sources.redhat.com/pub/bzip2/v102/bzip2-102-x86-win32.exe which is nice. Is there a way to link to an executable version of the current version 1.06? Thanks! GoingBatty (talk) 00:54, 1 February 2014 (UTC)

How to download all users talk pages?

How can I download all versions of all the user's talk pages? Srijankedia (talk) 00:00, 11 March 2014 (UTC)

How to create a Torrent?

I notice that the March and April dumps are missing from meta:data dump torrents. Is there an idiot's guide to transforming a new dump from dumps.wikimedia.org to the torrents page? -- John of Reading (talk) 07:37, 7 April 2014 (UTC)

(More) Actually the burnbit.com home page makes it idiot-proof. Even I have managed to do it. -- John of Reading (talk) 18:52, 15 June 2014 (UTC)

Torrent download.

John of Reading, why you keep on taking down link to torrent that is from 3 February 2014? — Preceding unsigned comment added by 84.50.57.71 (talk) 11:02, 14 August 2014 (UTC)

Several years of torrent links are listed at meta:Data dump torrents#enwiki with the latest at the top - the most recent was added only yesterday. If a link to "the latest torrent" is displayed at Wikipedia:Database download as well, they are sure to get out of sync. I think it is safer to have no torrent links here, and send people over to the page at meta. -- John of Reading (talk) 15:14, 14 August 2014 (UTC)

Anywhere to get a hard drive with the most recent dump on it?

Is there anywhere to get the most recent dump, or some subset thereof (say, enwiki talk and WP namespaces with histories intact), without downloading it? I live in a rural area with a lousy connection such that even downloading the current version (which wouldn't actually be helpful to me anyway) is not practical. Seems like hard drives with the database preloaded is something someone may have thought to offer. Anyone come across this? In other words I want to have a hard drive with the data on it shipped to me. Thoughts? Thanks. --98.101.162.68 (talk) 22:30, 23 August 2014 (UTC)

I suggest those who are editing about the BitTorrent related specifics should know extremely about the BitTorrent Protocol. BitTorrenting is not a simple job of making 'burnbit' torrents as many here would have thought about. It's more than that.

Recently my edit (which was prevalent for about more than a year) was changed to a few words without explaining the benefits!

My edit: Additionally, there are numerous benefits using BitTorrent over a HTTP/FTP download. In an HTTP/FTP download, there is no way to compute the checksum of a file during downloading. Because of that limitation, data corruption is harder to prevent when downloading a large file. With BitTorrent downloads, the checksum of each block/chunk is computed using a SHA-1 hash during the download. If the computed hash doesn't match with the stored hash in the *.torrent file, that particular block/chunk is avoided. As a result, a BitTorrent download is generally more secure against data corruption than a regular HTTP/FTP download.^[1]

New edit: ** Use BitTorrent to download the data dump, as torrenting has many benefits.

Even in the new Meta page, I see no reference to the 'benefits' he/she is referring.

The new edit is made by this IP: 184.66.130.12

Could I know, why this edit have been made and why didn't he/she provide the correct reference when editing? Why he/she solely decided to delete that whole reference?

I'm amazed what could be the reason!

106.66.159.57 (talk) 09:45, 6 October 2014 (UTC)

References

^ "Exploring the Use of BitTorrent as the Basis for a Large Trace Repository" (PDF). University of Massachusetts (USA). Archived from the original (PDF) on 2013-12-20. Retrieved 2013-12-20. {{cite web}}: Cite uses deprecated parameter |authors= (help)

Privacy: Timestamps in database downloads and history

I haven't looked into the database dumps, even if you only get a snapshot and not edit history you can track users (and their timestamps) by downloading it frequently.

By going into people's "View history" or pages' history you can track users. Should the timestamps be accurate down to the second there and in dumps?

It seems to me for talk pages that signatures' timestamps would be preserved but excluding that couldn't all available info on users be shown say with day granularity of timestamps or just not at all? Is the timestamp important to anyone? I feel like I should see all mine (when logged in). But then I would infer a timeframe for others between mine.. Still I would have to edit much and only for those pages and users between my edits.

I'm not sure about with timezones how timestamps appear. Because of screen scraping maybe less granularity would be pointless as could be evaded?

I see when I edit pages that they are synced to Google very fast - anyone know how? Ongoing database dumps? comp.arch (talk) 12:31, 26 October 2014 (UTC)

We are missing an offline wiki reader program (that preferably can read compressed wiki dump files)

It would be awesome to download Wikipedia and have a program read the data archive, like a browser does online, but for offline viewing; in that, all I need is a wiki dump file, and the program (as a basical 'reader' program).

I don't need the 'talk', or 'discussion', or 'modification' parts, just as long as internal hyperlinks are forwarded to find the right file, and if it's possible to just extract the needed page 'on the fly' rather than having to extract the whole Wiki dump file on my hd.

There exist a lot of compressed text readers, however to have a compressed html browser would be file. And have the pages stripped from redundant data, like headers, and other online stuff, that only makes sense having when being online.

I basically want a book reader, where the main page is like a search engine, to load the sub pages (of whatever topic I want to read of Wikipedia). — Preceding unsigned comment added by 72.45.33.155 (talk) 08:25, 6 December 2014 (UTC)

Yes, that would be awsome. I added a list of all the offline Wikipedia readers I know about to Wikipedia: database download#Offline Wikipedia readers. I hope at least one of them works for you, User:72.45.33.155.

Is there a mainspace article that has a section discussing such readers, other than list of Wikipedia mobile applications? --DavidCary (talk) 16:38, 26 March 2015 (UTC)

Is there a way to get diff update periodically?

>> pages-articles.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is approximately 11 GB compressed (expands to over 49 GB when uncompressed).

If I want to cache whole Wikipedia locally, I can download the whole 11 GB. But What if I want to update periodically once a day or once a week? Do I have to download again whole 11 GB or is there a better way?

Balkierode (talk) 22:33, 14 July 2015 (UTC)

Pagelinks table size

Hello,

I have downloaded the pagelinks table, on disk it's about 32GB. Does anyone know of how big it is once imported into a database? My import is currently greater than 50GB and still going.

Thanks

kcg2015 — Preceding unsigned comment added by Kcg2015 (talk • contribs) 06:42, 15 October 2015 (UTC)

File system limits

A file is data - operating systems can and do split that data and have it stored upon many parts of a medium (fragmentation). Everyone knows that a non-fragmented data file is limited by size. — Preceding unsigned comment added by 138.130.64.105 (talk) 07:41, 28 September 2016 (UTC)

Frustrating and NOT fair

This is frustrating: I'm looking for a link to download wikipedia but nothing. These pages are too technical for "normal" people to understand. Even then, the pages talk about xlm and do not provide a context, etc. There's still too long to go for these resources to become democratic. BTW, where's all money we donate going? can some go to make the terminology more friendly to normal people? — Preceding unsigned comment added by 74.59.33.46 (talk) 22:33, 17 April 2017 (UTC)

These pages are not intended for users without technical knowledge. Reader programs/apps should deal with this in the background. Some things should be more clear though, especially multistream - I don't understand why wikimedia is even still offering the non-multistream archive. I mean, if it's for the savings, offer a 7z instead and you can knock another 20% off. Neither do I understand why the hell this page is recommending you get the non-multistream archive. W3ird N3rd (talk) 03:09, 14 July 2017 (UTC)

Multistream?

How do I open only a part of the multistream file? It says that the reader should handle it for me, but every time I try to open it, it begins decompressing the whole file. What do I need to enable/download in order to successfully use the multistream file? — Preceding unsigned comment added by 73.25.216.142 (talk) 16:51, 7 October 2017 (UTC)

Creating EN Mirror

I am seeking help in creating a daily updated EN Wikipedia mirror. This mirror needs all meta, but not necessarily media from commons. Media from commons can be stage 2. This mirror needs to be updated every 24 hours (or more frequent if deemed feasible). Hardware and bandwidth resources are available for this project. I have read Dr. Kent L. Millers work on wiki mirrors (http://www.nongnu.org/wp-mirror/) . Not sure it is the best approach. Has anyone made a successful one as I have described? Seeking all input (failures and successes) please. — Preceding unsigned comment added by Thebluegold (talk • contribs) 18:28, 9 October 2017 (UTC)

A copy of another wiki-project

How do I make a copy of another wiki project that works on MediaWiki? If this is possible only with the help of bots, it is desirable to show the instruction for AutoWikiBrowser. 78.111.186.100 (talk) 16:55, 13 January 2018 (UTC)

I have a trouble connecting to Wikidata from MySQL Workbench

I try to establish a remote connection to the Wikipedia database from my MySQL Workbench. I followed the steps listed on https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database. However, when I try to open the connection, it says "Could not connect the SSH tunnel": "Authentication failed, access denialed". I have generated a SSH key pairs and convert the private key into OpenSSH format, but it did not help. — Preceding unsigned comment added by Yangwudi398 (talk • contribs) 16:21, 9 October 2018 (UTC)

Linux File System Support

Now that ZFS is supported on Linux and legally ^[1]. It would make sense to me to add ZFS as a possible file system to add to the list of filesystems Linux supports. — Preceding unsigned comment added by 132.28.253.252 (talk) 05:21, 7 November 2018 (UTC)

References

^ >https://blog.ubuntu.com/2016/02/18/zfs-licensing-and-linux

Clarify multistream description

The should I get multistream? section mentions that you can seek into the bz2 file, but doesn’t explicitly say whether the offset refers to bytes in the compressed file or its uncompressed contents. User alkamid on #wikimedia-tech ^connect (I don’t know their wiki username) tried it out and found out that it refers to the compressed file; can someone with the appropriate privileges edit this page to clarify that? --Lucas Werkmeister (WMDE) (talk) 16:41, 14 November 2018 (UTC)

Done. -- ArielGlenn (talk) 16:58, 14 November 2018 (UTC)

Hojatoleslam

The file enwiki-20190201-pages-articles-multistream.xml.bz2 contains

<title>Hojatoleslam</title>

twice. --77.173.90.33 (talk) 00:31, 22 February 2019 (UTC)

The March 1st stream also has duplicates, so when importing either do not add a UNIQUE INDEX on the title column, or add an "ON DUPLICATE KEY UPDATE" or something. (March 1st duplicates are "Ablaze! (magazine)", "You Wreck Me", "Carlitos Correia", "Larry Wu-Tai Chin", "Homalocladium", "Parapteropyrum" and "Manuel Rivera Garrido".) --77.173.90.33 (talk) 12:19, 23 March 2019 (UTC)

Pages Meta-History Names

The complete revision history of Wikipedia articles are distributed into multiple bzip dumps. Is there any way to know which bzip file contains which article? Is there any mapping present for this? -- Descentis (talk) 09:53, 28 July 2019 (UTC)

Download template namespace only?

Is it possible to download only the template namespace? I don't have a lot of storage space and would preferably not download literally everything. --Trialpears (talk) 14:18, 4 August 2019 (UTC)

2000s Wikipedia

I need a dump of every revision of every WMF wiki page, in all namespaces, up to the end of 2009 in the EST time zone. That is, every revision before January 1, 2010 00:00 EST, including all the history of the pages up to that point. Does anyone know how to build a web crawler that would retrieve all of this data? Or what else should I do? PseudoSkull (talk) 04:23, 25 October 2019 (UTC)

Does WP-mirror work?

mw:Wp-mirror hasn't been supported since 2014.

Has anyone successfully installed this lately?

If not, will somebody with the necessary skills please test it to see if it works?

I look forward to your replies. — The Transhumanist 13:08, 11 December 2019 (UTC)

Proposal to usurp shortcut WP:DD

Currently WP:DD redirects to here, Wikipedia:Database download. I would like to usurp this shortcut and direct it to Wikipedia:Database download. WP:DUMP is the more popular and actively used shortcut, so it already was preferred, and now I think it should be recommended.

I think the reuse of this shortcut is okay because WP:DD was established in 2005 and earlier today it had 53 uses. This is a low use over a long period of time. I changed most of those links from WP:DD to Wikipedia:Database download so that anyone following those links will arrive to where they want, and in anticipation of a new target for WP:DD.

The target I want for WP:DD is WP:Defining data, which is a draft guideline I am passing around. This is a new text and the usurpation may be premature, but in the context of low use of WP:DD, I think changing the redirect target anytime is fine. Thoughts from others? Blue Rasberry (talk) 16:09, 24 November 2019 (UTC)

This is a new text and the usurpation may be premature - for the record this was my main reason for reverting the usurpation. Primefac (talk) 12:09, 13 December 2019 (UTC)

I withdraw the request. I was premature in requesting this. After discussion with others I renamed this draft concept Wikipedia:Key information. It may yet get another rename after more discussion. For now no need to consider the shortcut request. Blue Rasberry (talk) 20:44, 30 December 2019 (UTC)

Question about page history

Hello, sorry for the question, but is it possible to get the entirety of wikipedia database including the page change histories? Bigjimmmyjams (talk) 04:53, 27 April 2020 (UTC)

— Preceding unsigned comment added by Bigjimmmyjams (talk • contribs) 13:26, 20 April 2020 (UTC)

Bulk-download my own diffs?

Hi, I was wondering if there was some way to bulk-download the diffs of all my edits? Interested in seeing which websites I most commonly add ELs and citations to. Blythwood (talk) 22:53, 10 August 2020 (UTC)

Reasonable crawling

Project page says a second delay between requests is reasonable. Is it a second after the start or the end of previous request? What about pages without wikicode: history pages, static resources, etc.? 84.120.7.178 (talk) 02:40, 12 September 2020 (UTC)

Help page

Should the beginning of this article have a help page template? I hesitate to remove it without confirmation. ― Qwerfjkl|✉ 17:56, 20 April 2021 (UTC)

Why would it need to be removed? Primefac (talk) 18:27, 20 April 2021 (UTC)

Delta-Dumps or timestamp of last change

I am looking for a list of all the entries of https://de.wiktionary.org that have had any changes in the last week. Is there something like that? Under which URL is this list available? Or is there a list that contains the timestamp of the last change for each entry?

Background: A script I wrote looks every day on the page https://dumps.wikimedia.org/dewiktionary/latest/ for the date of current version of the file dewiktionary-latest-all-titles-in-ns0.gz. If there is a new version, the script downloads it. In this file are all lemmas of de.wiktionary.org (only the lemmas, no contents). The script compares this list with the entries that are already in my database, marks entries as deleted if they are no longer in the current list, and marks entries for download that do not yet exist in my database. Another script then goes through all the new entries and downloads their contents in order, for example by going to this page: https://de.wiktionary.org/w/index.php?action=raw&title=Haus using the respective lemma instead of house, of course. To conserve both my own server's and Wiktionary's bandwidth, this script always pauses briefly between two lemmas, thus downloading the contents of a maximum of about 100,000 lemmas per day. There are about 1,000,000 lemmas in total, so the download of the whole German Wiktionary was done in less than 2 weeks.

And here's the point: I want to keep my database as up-to-date as possible, but I also want to download as little data as possible. Above all, I don't want to download content that I already have in exactly the same version. So ideally I only want to download the contents of those lemmas where there has been any change since the last dump. Is there such a list? If so, where can I find it? A list where each lemma has a timestamp of it's last change also will do. --Hubert1965 (talk) 15:46, 8 July 2021 (UTC)

IPFS

Should Wikipedia mirror on IPFS be mentioned? https://ipfs.kiwix.org/ This looks like just a regular HTTP website but browsers with compatible add-on with automatically switch to access the underlying IPFS, which could be described as a "browsable, partial BitTorrent" and a distributed P2P filesystem on its own. Cloud200 (talk) 14:25, 29 November 2021 (UTC)

How would i download all articles under a certain category/portal?

Title says it all. 47.54.193.251 (talk) 05:22, 21 February 2022 (UTC)

There is a tool for this under development at Kiwix but it's not available yet. The other Kiwix guy (talk) 15:53, 10 March 2022 (UTC)

incorrect information with Aard2

1. aarddict: § Aard Dictionary this link is not correctly linked. Should go to HTML section Aard Dictionary / Aard 2.

2. It also has a successor Aard 2. Is correct, however Aard2 has new functions and better compressibility and is allowing pictures. Not sure, why this is not mentioned. I could add it if you would allow me... BRM (talk) 11:00, 22 January 2023 (UTC)

[1] "Exploring the Use of BitTorrent as the Basis for a Large Trace Repository" (PDF). University of Massachusetts (USA). Archived from the original (PDF) on 2013-12-20. Retrieved 2013-12-20. {{cite web}}: Cite uses deprecated parameter |authors= (help)

[2] >https://blog.ubuntu.com/2016/02/18/zfs-licensing-and-linux

[1]

[1]