Wednesday, August 14, 2013

Setting up your own S3 storage on VPS servers

After using Amazon S3 for a while, and seeing that the performance in Northern Europe is basically crap (even with CloudFront), it was time to try setting up an alternative on VPS servers that were closer to "home". I chose only to look at S3 API compatible storage as I already had apps running using S3 storage as a backend, and I wanted to move the storage for the apps to servers running in Northern Europe.

The requirements


Easy to scale
Replication
S3 Compatible API
Must run as an application on top of Ubuntu 12.04 LTS
Must use the storage available on a VPS
Can limit disk usage per node. (This is not an absolute requirement, but a nice-to-have requirement)
Can work on low to medium latency connections

The contenders

Riak CS


Runs on top of Riak
Riak is a database (Key/value store) and therefore runs on top of the majority of *nix based distributions
Easy to add nodes using command line tools to a Riak cluster
Has S3 compatible API
Replication is a must for a Riak cluster
No node is master/slave, all are equal (good for HA)
Has a nice web administration tool

Eucalyptus Walrus


Medium difficult to add nodes
Has addons for S3 compatible apis
Can use a normal folder for storage, but not if replication is used
Must have a block device to replicate
Easy to limit usage on a node
Difficult to configure replication
Must have a Cloud Controller, and for HA, a secondary controller

OpenStack Swift


Medium difficult to scale
Can have problems with medium latency connections, due to writing on a majority of nodes
S3 compatible API as an addon
Must have block devices
Easy to limit usage on a node
Syncs through rsync. (I never liked rsync...)

Cloudian


Has a community Edition, but documentation is sign-up only (Vmware/citrix anyone?)
Claims to be OSS, but in reality: no.
Read up on the docs, but the documentation is sparse and it does not feel "production-ready"

Apache Cloudstack


Has S3 API
Not usable as it requires a management server and a host/hypervisor system

Ceph


Is a distributed file system.
Easy to add nodes
Replicates across nodes
Does have a S3 compatible API, although some limitations (http://ceph.com/docs/next/radosgw/s3/)
Has a nice deploy-tool
Requires block devices for storage

Wrap up


Basically this gives two different directions.
Setting up Riak CS directly on the system or choosing Ceph, Walrus or Swift and setting up a file as a block device.

After reading up on the docs, I am considering both Ceph and Riak CS, and will start by testing Riak CS. The

Both provide good chef cookbooks, so for large scale deployments, use time to setup chef properly. It will save you time when you need that next node if you plan to grow.

However, this is most likely going to be more expensive than using cloud storage, so do consider if you want to use your time on this or just pay for cloud storage.

Other openstack options would work fine as well, since the client library I am using supports both.