Create a temporary external Web mirror

13 Apr 2018

There are occasions when a specific website is relaying live information about occuring events but the server cannot handle the amount of people coming. This causes long load times, server errors, and everyone starts impatiently refreshing the page causing only more harm. You can try to provide a Web mirror to help them mitigate server load, but it requires you to think about it a bit. The solution described here is simple and can be applied everytime without discussion with the original website administrator about the installation of a load balancer and other more complicated solutions.

Context: the Notre-Dame-des-Landes ZAD is under government attack and relaying live information about the cop progression and the various expulsions and destructions of their homes. It is at times unavailable and needs some support.

If you have a Web server with large bandwidth available, it is possible to build a mirror of the original website and update it every now and then to have a more or less up to date version. We are going to use HTTrack to build this mirror.

You need a VPS or any computer reachable for the Internet, an access to its shell, and httrack available. Go to a publicly available folder served of your Web server and build an original mirror of the website. There are a few options to consider here that are discussed below.

httrack "https://zad.nadir.org/" -v -K0 -r2 -x 

The -K0 option changes all HTML links in <a> tags to relative links, so they point to your local copy instead of the original website. It does not however handle stylesheets or scripts links in the HTML head.

The -r2 option limits the mirror to a depth of 2, which means only 1 link traversal level is done. If you clone the home page of the website, this means that the links on the home page will be mirrored as well, but the links and those "subpages" will not be fetched. Most of the time a limit of 2 is enough and allows you to mirror only the more interesting content and not the whole website, which would contribute to overload the original website.

The -x option changes all external links, even those in the HTML head, to a single external.html proxy page. This removes all direct connection to the original website and means that your mirrored pages won't contain links that would continue to be loaded from the original server. Note that this will most of the time remove the styling and scripts of the pages mirrored!

Once the website has been mirrored once, check that everything looks OK and that your Web browser does not try to connect to the original website. Then you can update the mirror every few minutes with the --update option, directly in a shell loop (consider using tmux to let it run without being connected to the shell all the time) or in your crontab.

while true ; do httrack "https://zad.nadir.org/" -v -K0 -r2 -x --update ; sleep 8m ; done

This will update only modified files every 8 minutes. Choose the update time wisely: if it is too short, you will just end up overloading the site as well. Then spread the word about your mirror and hope people start using it instead of the original website. Maybe contact the original admins so they can put a link to your mirror somewhere.