For my final year at Aberystwyth, I am required to complete a ‘major project’ which will form ⅓ of my marks this year. My degree scheme is Open Source Computing, which means I have to choose something relevant to open source in general. The best solution is to contribute to a project, which I’m happy to say is ownCloud, an open-source self-hosted cloud solution built using PHP.
Problem Scope
Ideally, what I would most like to implement is some kind of version control for the file store that ownCloud uses. This would allow restoration of old or deleted files and may well open up synchronisation possibilities.
There are several constraints to consider. The ownCloud project is popular largely because it depends only on PHP; the WebDAV server and Ampache server are both implemented in PHP and require no extra dependencies (other than PHP libaries/modules). Any version control implementation needs to ensure WebDAV access is maintained and no extra dependencies are introduced.
Broadly, version control can be split into two types: centralised vs. distributed.
Centralised Version Control
Centralised systems focus on a client-server model, where the client sends changes to the server which then checks them to the repository. On the client end, only the current copy of the repository is stored because the server has all the version history. Software such as SVN has a good track record for handling large files in terms of sane memory usage. No VCS (distributed or otherwise) can intelligently merge large binary files.
As a side note, SVN is implemented utilising WebDAV and the DeltaV extension (WebDAV support for versioning) and ownCloud uses WebDAV already. SabreDAV (the PHP WebDAV library used by ownCloud) does not support DeltaV, however, so it would have to be implemented from scratch. I’m assuming that doing this would provide compatibility with SVN front-ends.
Centralised systems also depend on a connection to the server, in order to communicate about file differences. Locking also requires a connection to the server, which may present problems when some clients are offline and make modifications.
Distributed Version Control
Distribute systems bring several advantages. A DVCS can work offline and provide merging when they are re-connected to the network. This is a more likely use-case for self-hosted cloud solutions such as ownCloud. Files can still be modified, even modified in different places in two separate repositories and then merged intelligently when pushing/pulling to other remotes (again, not for binary files). They effectively provide synchronisation and version control in one convenient package.
Every DVCS traditionally has issues with large files, however and several problems occur. Some binary files are compressed, meaning small changes to the file can cascade and generate large diffs. Generating diffs has it’s own problems, as most DVCS were designed for small files containing source code. They make the assumption the file will fit in memory in order to generate a diff for it. There are several approaches that have been taken, which are explored below.
git-annex
In git-annex large files are taken out of the git repository, and stored in ./.git/annex instead. Symbolic links are then created and committed to the repository. The back-end storage can be changed, with support ranging from S3, to bup to FTP. Obviously, this approach short-cuts most of the issues with large files described above, at the cost of losing any version history for those files. It runs on Linux and OS-X but there is no mention of Windows support.
git-bigfiles
The git-bigfiles project claims several advantages over Git, while also stating it aims to merge these changes back into mainline Git when they are stable enough. The project no longer appears to be maintained.
However, as of Git version 1.7.6 a new option has been added to git config – the core.bigFileThreshold option. According to the changelog:
Adding a file larger than core.bigfilethreshold (defaults to 1/2 Gig)
using “git add” will send the contents straight to a packfile without
having to hold it and its compressed representation both at the same
time in memory.
This prevents malloc() errors and similar out-of-memory issues when adding large files to a Git repository.
bup
The bup project takes a different approach. It uses a git-compatible repository but has it’s own algorithm (derived from the rsync algorithm) which takes up much less space. It achieves this by generating a ‘rolling checksum’ to split large files into chunks and then store the differences in the chunks.
It uses the Git packfile format, which allows a normal git front-end to access the repository. It also writes packfiles directly, without an intermediate stage, providing the same advantage as core.bigFileThreshold in Git 1.7.6 – it’s fast even with very large amounts of data.
Finally, it’s written in Python and C, which makes it relatively simple. The current version is a 0.25 release candidate, so it’s probably not ready for production use. Interestingly, git-annex can use bup as a storage back-end.
git-annex and bup?
Reading over that brief list (there are other projects, and I haven’t even looked at Mercurial yet) it seems that a combination of bup and git-annex might be suitable.
I’ll consider a single folder ~/owncloud. Inside there is a git repository which is kept synced with the ownCloud server, either directly or via a mounted WebDAV folder. This effectively solves all the synchronisation issues apart from large file support.
So, taking the approach of git-annex all appropriate files—over a certain size or of a certain file type—would be ‘annexed’ to a separate store. Providing symbolic links to these files makes the process transparent to the user, although this may be troublesome on Windows. This solves the issue of a frighteningly large Git repository but leaves us with another: how to version/archive these extra files and how to integrate that with ownCloud.
Efficient synchronisation can be achieved using Unison or rsync but again, no version history is kept. This is the approach that bup takes: using an rsync-like algorithm to split large files into chunks, providing de-duplication for the entire file history but without the large delta issues of traditional Git. However, another issue with bup is there is currently no way to clean up old backup information; eventually even a bup repository will become unwieldy.
What bup can do is maintain compatibility with ordinary Git clients. It may be possible to use bup on the client machine to generate Git-compatible repositories, which are much more efficiently managed.
Time’s Up
I’ve been looking into this all day. I’m going to write some more posts going into more detail as I understand things better. Still, this shows how deep the problem goes.
52.420702
-4.084572