Other

Mirror Wikipedia on your own computer

No future - Last modification: Nov 30, 2020

With the end of the world coming up, it'll be handy to have a local mirror of wikipedia. The whole database is a bit big to manage, but keeping only the current version of the pages from a snapshot makes it manageable. For example, in October 2019, the English Wikipedia pages (text only, no media) are dumped in 70GB of XML, for about 6 million articles. Using the pages-articles dump, which feature all articles with no history and talk pages, there are in fact more than 19.6 million pages to import (with templates, redirects, media descriptions...). After 7 months of importing on a raspberry pi 4, the database weights 290GB on the disk, without caching, and the pages are at least 7 months old and cannot be updated automatically.

Some software can use these XML dumps and present them in a tailored browser, see the offline wikipedia readers section. It's certainly easier to install, and also come with the pages media for some of them, but it's not as fun as having a real editable wiki. It's also not easy to find a software that works on ARM processor, because it would be nice to have this running on the low power Raspberry Pi 4 computer. It seems kiwix can make a wifi hotspot that offers a static version of wikipedia: see the doc.

It's not really easy to mirror Wikipedia: there's not much recent documentation on this, and having a website similar to what Wikipedia looks like requires using the same version of mediawiki and all its extensions (more than 100). The size of the data makes it hard to complete and it's also complicated to get the media (images and films in pages). Here's a recent update on what works and what doesn't.

Download the XML dumps here: https://dumps.wikimedia.org/backup-index.html.
Install mediawiki from git: https://www.mediawiki.org/wiki/Download_from_Git#Fetch_external_libraries.
Import the XML dumps in your database. The documentation about that (https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps) is quite old, and the only method that seems to be working in 2019 is the one that is not recommended for this task of importing a lot of data, the maintenance/importDump.php script.
- The problem with this maintenance script is that it will run for days, something that php is really not made for, and its used memory will grow beyond what the computer can manage.
- For the raspberry pi in particular, I suggest the largest memory version for this, to help with in-memory caching and to delay the out-of-memory issues to a few days maybe. Overclocking the Pi 4 may prove useful too.
- Database size, because of object cache, seach index and links, is much larger than the XML dump. At the end of the import, the database on the disk was 290GB.
- The current version of mediawiki (time of writing, 2019-10) has a bug (T211450) that would make the import fail with this method. A workaround is suggested in the bug's discussion, code modification is required.
- After the first failure, use the --skip-to=number option to fast forward to the last entry printed on the console.
Setup a web server, like nginx mariadb and php-fpm and put the mediawiki online.
Next step: install extensions. Wikipedia uses a lot of mediawiki extensions. Some are required during the import, but most of them only for pages rendering. I suggest you get all extensions at the same time as the mediawiki code, because several months later it will be harder to get the versions of all extensions known to work with the database dump you imported.

I should add that my cheap SSD died a few months after the import completed, so I never could finish putting it online and lost the 7 months of import because I couldn't copy the 290GB elsewhere. Also, having a 7 months old version and counting of wikipedia is not as fun as the idea sounded at the beginning, and there is no incremental update system, or it would be slower than wikipedia change rate anyway. If only the import methods that work much faster were still available, a bimonthly import could be done, but not with this bad XML importer.

Have fun!

0 comment

Discuss this article, add a comment:

name:
website:
comment:
If you are human, type 12: