A Poor Man's CDN
Hosting large and often-downloaded files can be tricky, especially when you want users to have decent download speeds and 100% availability. This is the story of how Dash’s docsets are hosted.
First, some data:
- At the time of writing, there are 102 docsets hosted
- The total size of these docsets is 1.5 GB (while archived)
- Bandwidth requirements are in the range of 5-7 TB / month
- It would cost about $600 / month to host them in a “regular” CDN (e.g. Amazon CloudFront). In contrast, my hosting only costs $20 / month (thanks to 4 VPSs from DigitalOcean)
Hosting the docsets
Some docsets are somewhat large, so download speeds need to be decent. This is achieved by hosting the files in different data centers:
- 2 mirrors in New York (for North America)
- 1 mirror in San Francisco (for North America and Asia)
- 1 mirror in Amsterdam (for Europe – or at least Western Europe)
- Extra mirrors can be added in less than 2 minutes to account for spikes
South America, Eastern Europe, Africa and Australia are not directly covered, but should still have alright download speeds, as no one complained yet. More mirrors will be added whenever DigitalOcean opens more data centers.
Load balancing
Dash performs latency tests on all available mirrors by loading a small file. The mirrors are then prioritised based on latency. Whenever a mirror goes down, Dash notices and avoids it. Mirrors that have almost the same latency (±0.03s) are considered equal and are chosen randomly.
This setup results in 100% uptime and really cheap bandwidth costs. I highly recommend you consider a similar setup if you need to host large files for your app.
Hosting the docset feeds
The docset feeds are just small XML files which Dash polls to check for updates. These files are requested a lot, on each Dash launch and every 24 hours afterwards. As each docset has its own feed and most users have more than one docset installed, about 320k HTTP requests are made each day.
These requests are easily handled by a nginx web server on a 512 MB VPS in New York and are also mirrored on GitHub. I tried using Apache but it would sometimes use over 1 GB of RAM while hosting these files and would end up completely failing, while nginx serves requests faster and uses less than 40MB of RAM. I’ll talk about my experiences with nginx in a future post.
Whenever Dash needs to load a feed, it launches 2 threads which race to grab the feed (from kapeli.com or from GitHub), whichever thread finishes first wins and its results are used. Most of the time, the kapeli.com thread wins.
The chances of both kapeli.com and GitHub being unavailable are very very small, so this approach resulted in 100% uptime so far.