Granite (v0.1.0)

7. Release early. Release often. And listen to your customers.
– Eric S. Raymond, The Cathedral and the Bazaar

First of all, my apologies for the nearly month-long absence. Most of that time has been spent wrapping my head around binary packing, zlib inflation/decompression (in fact the distinction still eludes me…) and playing with Granite, the product of all this tortuous work.

Granite is a pure PHP library for Git. There were several issues with the libraries I managed to find, the most useful of which was Glip.

Glip’s developer has no intention of supporting ‘push/pull/sync’ operations [1], although an alternative implementation exists [2] that permits pushing over “smart” HTTP. However, the Upliner Glip fork [2] hasn’t been updated for a year, and Glip [1] appears to have stalled.

Ideally, Granite will provide a simple, up-to-date implementation of Git reading, writing, with smart HTTP push and pull support. I’d like to start using this library for a variety of uses besides ownCloud, like a locally-installable version of GitHub.

ownCloud

So how does this fit into ownCloud versioning? Granite can read any object from the repository, including tags/branches (refs). Once I’ve developed some classes to represent each of the major objects, I can start writing a PHP StreamWrapper implementation. This should allow me to ‘tweak’ one of the existing OC_Filestorage providers, allowing access to a Git repository as if it were a local directory.

I’m still fuzzy on the details for ownCloud – I don’t much want to go changing the user interface, that’s somebody else’s code. Ideally, you would enable the ‘Git Integration’ (or whatever) application, mount a repository or two (or add existing folders to new repositories) and then be able to browse the current HEAD. For version rollback/recovery, the application should use a configuration value which points to HEAD by default, which the user can then override to allow the viewing of previous files.

Active or Passive Versioning?

That’s a poor choice of heading, but one of my major concerns is how often to make a new ‘version’. I don’t want it to be too granular (i.e. virtually every write results in a new commit) as repository history can get quite large. On the other hand, I don’t want to leave too long between changes, for obvious reasons.

What do you think? (I know people are reading this!) Client-side projects tend to use file change notification systems (SpiderOak, for example [3]) but that’s not really applicable here. Apple’s Time Machine seems to go for an hourly approach [4]. I’ll look more into the choices made by other projects.

Testing

The code is a bit of a shambles at the moment: I’m tired, and it’s just started working, so I want to share it with the world! Beyond all that though, I want to put Granite through its paces before using it to integrate with ownCloud. So please, download it, test it, break it, and tell me all about it. Hopefully the unit tests make sense to people, the tests should run regardless of whether a repository has packed objects or not.

[1] http://lists.fimml.at/glip-devel/0014.html
[2] https://github.com/Upliner/gitphp-glip/wiki/
[3] https://spideroak.com/blog/20091204132500-spideroak-releases-lightweight-filesystem-change-notification-utilities-for-windows-os-x-and-linux-gplv3
[4] https://discussions.apple.com/thread/1949414?start=0&tstart=0

Progress Report (25/11/11)

Finally, the Progress Report which has been distracting me from the rest of my work (including a Ruby on Rails assignment and more Git prototyping) is completed! 4,000+ words detailing the current considerations: future direction, a plan, initial requirements and acceptance tests, and a brief review of version control systems.

The focus for this week is to complete the Rails assignment to a decent standard, and work on completing the first milestone for the 7th December. After that, development should start picking up.

For those interested, it’s a bit of a formal read, but the report is available as a PDF.

DevXS

I spent the weekend of the 11th–13th November at the University of Lincoln for DevXS. The event was sponsored by a variety of University of Lincoln organisations, DevCSI and Amazon Web Services and was apparently a resounding success.

I arrived via train and met up with fellow Aberystwyth students, most of whom are Artificial Intelligence and Robotics students. Being an Open Source Computing student, I was looking for a project with a slightly different focus and found it in Team York from the University of York.

Over 24 hours of constant development we developed the Tasks for Chrome project and submitted it for consideration in the Public Platforms competition, and the general DevCSI competition.

Tasks for Chrome

The final product consisted largely of a PHP API wrapper to the Google Tasks API, which allowed us to do some interesting things with PHP before we talked to Google Tasks. For the the Public Platforms competition, Andrew wrote a JSON parser for reading lists that several Universities provide online. Using this, it was a simple matter to build a web interface that allowed simple CRUD (Create, Read, Update, Delete) functionality while also making the import of reading lists trivial.

A Chrome extension was also developed as a proof of concept: it was possible to rapidly develop it with jQuery and AJAX into a toolbar button with a popup window. The popup window could also perform CRUD operations and import reading lists. There are a number of possibilities from here, such as subscribing to data sources instead of using manual import (think integration with bug trackers). Perhaps you could even sync between two or more users, sharing a todo list between a group.

The Experience

The experience of the entire conference was more important, however. The 26 teams produced some incredible software, with some of my personal favourites being Ook Nog and the Unofficial University Guide. The software developed in such a short space of time, by ~200 self-organising individuals, was inspirational.

The people were incredible; mostly Computer Science students with a smattering of other disciplines, everybody was easy-going, willing to help, and in an amazing mood. A particular mention goes to Dave Challis from University of Southampton for helping me with jQuery AJAX callbacks.

After recovering for a few days, DevXS has reinvigorated my software development passion. Yesterday saw me stepping through the the Git packfile format and the Glip source code, in order to understand exactly what is going on at the bit/byte level (with an explanatory blog post to follow…) which I’ve been putting off for weeks. PHP developers don’t have to deal with binary formats very often, but it really is so much simpler than it looks.

This post is partly to explain the absence of any updates for the last week, and to encourage anybody with free time in February to attend the Dev8D conference, which is bigger, better and free to attend—it really was a life-changing experience.

Common Concepts: Repositories

This post covers the basic storage mechanisms used for the repositories of various systems, along with some advantages/disadvantages of each.

A repository is the database of version information used by a version control system. In centralised systems the repository is on the server, while distributed systems keep a copy on each developer’s machine.

Storage mechanisms for repositories vary, with varying amounts of documentation. Git’s storage and object models are very well described around the web and Mercurial has excellent documentation on the same subject. Bazaar is more difficult to describe, partly because it takes a different approach to the other systems and partly due to a lack of documentation.

There are, broadly speaking, three methods of storing version information:

Snapshots: In snapshot-based systems, each version of a file is saved individually, independent of other versions. This has a few consequences: the repository size quickly grows very large, so compression is typically used to reduce this. This is traded for a single disk access for any version (since they are all stored individually), making access to versions very quick. Git and Bazaar use snapshot-based systems.

Delta Compression: Delta compression stores the difference between two files. Delta storage can be used to efficiently store multiple versions of files, avoiding the large storage issue of snapshot systems (until the repository gets very large anyway). Consequently, the access time for old versions takes longer the more version history there is. Mercurial, Subversion and CVS use delta-based methods for repository storage.

A technique called version jumping [1] can be used to minimise the access costs, and offers some storage improvements. Subversion has a similar implementation called skip deltas.

Weaves: All version information is stored in a single file, in interleaved blocks. Metadata is added to each interleaved block, indicating the version it belongs to along with some other information. Any version of a file can be reconstructed with a single sequential read, although this takes longer the more interleaved blocks there are (i.e. the more history a file has). The trade-off is having to rewrite the file each time a new version is added. Bazaar used to utilise this method (although it is unclear what it uses now) and the original Source Code Control System had a similar mechanism.

Terminology

I’ve spent most of this week writing the draft of my Progress Report (due in two weeks) and a brief report on the relevant version control systems. While the Progress Report is going slowly, the version control report is getting rather large and complicated, so I’m rewriting each section as a blog post in order to proof read it.

This post covers some basic terminology concerning version control systems and identifies synonymous terms in other systems. Each of these areas is also covered in detail in the Common Concepts posts. I’m trying not to assume any knowledge of each areas, apart from a general technical background.

Terminology

Repository: A repository is the database used by a version control system to track information about file versions.

Commit: A commit (also a revision, version or changeset) is a snapshot of a file at a particular point in time. Different systems use different terms, but they are all essentially the same thing.

Tip: A tip (also the HEAD when discussing the main development branch) is the latest commit for a particular branch.

Branch: Words fall short when trying to explain branches. They are exactly as they sound in real life, with the exception that branches can be merged together again. An excellent example along with diagrams is available at gitguru.

Parent: Every commit has a parent (except the initial commit). In the case of a merge commit, there may be two or more parents. Parents can be traced back to the initial commit.

Working Directory: The working directory (also the working copy or the checked out copy) is where the developer’s local, private copy of the repository is stored. The developer makes changes to the relevant files and then commits the changes to the repository.

Check Out: A file is checked out from the repository into the working directory. Similarly, when the file is checked in (also committed), it is saved back to the repository.

Clone: To clone a repository is to make a complete copy of it.

Merge: Merging combines the changes of two branches into one branch. This often results in the creation of a merge commit which identifies the two (or more) parent commits and a descriptive message. Merging algorithms are very much out of scope, but most systems use the three-way merge algorithm.

Diff/Patch: A diff is the difference between two files, as generated by the diff program. A patch is simply a diff, saved to a file in a specified format. The patch program can be used to ‘apply’ changes described in a patch to a file.

A Brief Update

I’ve spent most of the previous week working on a report comparing the features of four version control systems. I’m up to 3000 words and I’ve yet to discuss any details—apparently it’s going to be quite long.

On the more practical side of things, an issue has arisen: Git doesn’t track metadata. At all. As far as I can tell, neither does Mercurial. Bazaar tracks the executable bit, but that appears to be all. The gibak project has an interesting approach: using Git hooks to add metadata to a file, which itself is version controlled. This allows you to check a file’s permissions, size, or any other data, for any version in the history.

An issue I’m avoiding thinking about for now is how to get a Git repository initialised on the server, without having Git installed. It should be a bare repository to allow pushing/pulling without conflicting with the working tree. I will look more into that and update when I know something.

I’d like to try and add that to the Glip prototype I’ve got going, along with an option to change which branch/commit is displayed in the file browser. After that, hopefully, things should start to pick up with both the report and the prototype.

Next Steps

I’ve shown myself that it’s fairly easy to insert a version control system into ownCloud. The next immediate step is allowing users to step back to previous commits, and then building support in for actually viewing files (downloads as well).

First, however, I need to start writing a requirements specification and make a start on the progress report. That and prototyping will take up most of my time this week.

Git vs. Other Distributed Systems

There’s also an interesting paper on Mercurial and it’s Revlog format; also, a report by Google on Git and Mercurial.

Mercurial has some interesting features and an apparent simplicity similar to Git, better Windows support, implemented mostly in Python (with some C) and apparently an implementation of version jumping (more below).

Literature

Efficient Distributed Backup with Delta Compression describes version jumping: essentially storing the entire version of a file when it is less than the combined storage cost of the deltas. It also makes retrieval essentially O(1), I think, regardless of how long ago the version was committed.

Mercurial also has excellent support for HTTP push/pull, which I need to investigate. Also, Google’s analysis notes that Git has support for HTTP and WebDAV pushes:

…designed such that you can have a Apache simply serve the Git repository as static content…

This definitely needs to be looked at, I’ll see how that affects my prototype as well.

Project Outline

So I’ve been putting this all together as part of my Project Outline, which is essentially a project proposal—it’s not assessed, it’s just a deliverable to my supervisor and the project co-ordinator. It contains an idea of what is being undertaken, some constraints and an initial reading list for the bibliography.

A Prototype

I’ve managed to make a small, very badly put together prototype which shows reading from the HEAD of a Git repository. This was rather easy with the fantastic Glip library.

Issues

  • File downloading/streaming through WebDAV doesn’t work for anything except text files
  • File sizes, modification times, etc. aren’t implemented (ignore the 5MB everywhere). Currently the modification time of everything is the last commit
  • Glip doesn’t seem to offer a way of getting file modification times and so on. It can read the index files though, so perhaps this can be implemented
  • lib/fakedirstream.php doesn’t implement methods for fopen() and so on. Writing these would hopefully make integration of file download trivial
  • Expects an initialised (at least 1 commit) Git repository in $SERVERROOT/data/$USERNAME/git/ (i.e. there should be a .git folder inside)—currently WebDAV breaks if it isn’t present for each user ;)
  • Currently hard-coded to the HEAD of the master branch
  • No testing, error checking, or any expectation that it will work on your system
  • Requires a an empty directory: $SERVERROT/data/$USERNAME/files/Glip in order to show up in the file manager

Code

Aside from that it’s in a branch called glip-integration on either my Gitorious or Github repository.

Git, Git-Annex and Bup

My previous post enlightened me as to just how damn complicated this is :)

The Problem, Simply Put

Git can’t handle large files. This is largely because it attempts to load each file into memory for comparison[1]. Git also has an issue effectively storing large files because large binary files don’t diff very well.

(Digression: Git stores every version of every object in a separate file, called loose objects. Since this obviously takes up lots of space, depending on certain config parameters, Git compresses these objects into packfiles using a delta algorithm and zlib, these are called packed objects.)

The Solution: The rsync Algorithm

If you look at synchronisation protocols, most of them are derived from rsync[2] in some way. They are incredibly fast and efficient and achieve this mostly through a rolling checksum. One project, bup, attempts to implement this using a git-formatted repository. Files are effectively ‘chunked’ into ~8KB blocks and then committed to Git directly. The resulting Git repository is compatible with existing Git clients.

Bup

Bup has a lot of advantages, it is extremely quick and is efficient with both memory usage and storage. Bup de-duplicates data by using content addressable storage, meaning if it tries to store a chunk that is already backed up, it simply uses the existing chunk instead. The inevitable trade-off is that files become entangled with their own history (and others) in a very complicated manner, meaning you can’t clean your repository history at the moment. There is also a trade-off with splitting files into lots of blocks.

(Digression: Git has trouble tracking lots (i.e. millions) of files. The packfiles aren’t the issue, they are particularly efficient at their job, the problem is their index files. Git keeps an associated index of what’s in a packfile in .idx files, which allows Git to jump straight to the compressed version of an object. This breaks down a bit with lots of packfiles. Bup’s solution is another layer: .midx files, which index the .idx files. Crucially, this is an issue for bup when it is trying to write very efficient packfiles, Git can still read everything as normal, despite this divergence in expected use.)

Problem Solved, Almost

So bup can efficiently handle almost any file, of any size, by utilising the rsync algorithm (amongst other things). The problem is that it creates a repository with thousands of little files.

A simple way of managing these would be nice, which is where git-annex comes in. Files are "annexed" into a separate storage location with symbolic links pointing to their location.

The obvious disadvantage is these files can no longer be versioned. However, git-annex supports bup as a storage back-end. This allows us to manage large files with Git (git-annex is also compatibile with Git), tracking their symbolic links for user friendliness, while bup takes care of the hard stuff in the background (which is also Git compatible).

Git 1.7.6+

As of Git 1.7.6, the release notes [3] show a new config option: core.bigFileThreshold. Files large than this size are written directly to packfiles, which prevents Git choking on "out of memory" errors. So as of Git 1.7.6 we can at least deal with large files, even if it’s still not very efficiently.

Some Notes

The git-annex project is written in Haskell for starters. Maybe it’s possible to implement this in a simpler manner on the git side, after all, it’s only moving files around rather than dealing with the actual git repository. Also, the packaged version of git-annex has a version of 0.14, while Github[4] indicates versions beyond that have been released. To use the latest (master) version of git-annex, Ubuntu 11.10 Oneiric Ocelot is required because of expected library versions.

Bup is currently on version 0.25 RC1, which isn’t very promising for production use; it also contains this snippet in the README:

This is a very early version. Therefore it will most probably not work for you, but we don’t know why. It is also missing some probably-critical features.

What it comes down to is that Git really can’t deal with large files. Fix the memory issue, you have a packfile issue (alleviated in 1.7.6+). Fix that, there’s a packfile index issue. Fix that and you end up writing meta-index files and I start to wonder why Git was used in the first place. Then again, it appears that nobody’s really fixed this properly despite the compelling use cases.

Looking at various pieces of literature (too many to reference) it’s possible to see that each of these problems can be solved in isolation. What nobody has done yet is package it all up into a single application, protocol or algorithm that can be used. So between now and the next post, I’m going to look into some requirements for the system ownCloud would like (by asking the mailing list nicely) and start looking for solutions.

References

  1. Bup Design Explanation (line 95 onwards)
  2. The rsync algorithm
  3. Git 1.7.6 Release Notes
  4. git-annex

On Synchronisation And Version Control

For my final year at Aberystwyth, I am required to complete a ‘major project’ which will form ⅓ of my marks this year. My degree scheme is Open Source Computing, which means I have to choose something relevant to open source in general. The best solution is to contribute to a project, which I’m happy to say is ownCloud, an open-source self-hosted cloud solution built using PHP.

Problem Scope

Ideally, what I would most like to implement is some kind of version control for the file store that ownCloud uses. This would allow restoration of old or deleted files and may well open up synchronisation possibilities.

There are several constraints to consider. The ownCloud project is popular largely because it depends only on PHP; the WebDAV server and Ampache server are both implemented in PHP and require no extra dependencies (other than PHP libaries/modules). Any version control implementation needs to ensure WebDAV access is maintained and no extra dependencies are introduced.

Broadly, version control can be split into two types: centralised vs. distributed.

Centralised Version Control

Centralised systems focus on a client-server model, where the client sends changes to the server which then checks them to the repository. On the client end, only the current copy of the repository is stored because the server has all the version history. Software such as SVN has a good track record for handling large files in terms of sane memory usage. No VCS (distributed or otherwise) can intelligently merge large binary files.

As a side note, SVN is implemented utilising WebDAV and the DeltaV extension (WebDAV support for versioning) and ownCloud uses WebDAV already. SabreDAV (the PHP WebDAV library used by ownCloud) does not support DeltaV, however, so it would have to be implemented from scratch. I’m assuming that doing this would provide compatibility with SVN front-ends.

Centralised systems also depend on a connection to the server, in order to communicate about file differences. Locking also requires a connection to the server, which may present problems when some clients are offline and make modifications.

Distributed Version Control

Distribute systems bring several advantages. A DVCS can work offline and provide merging when they are re-connected to the network. This is a more likely use-case for self-hosted cloud solutions such as ownCloud. Files can still be modified, even modified in different places in two separate repositories and then merged intelligently when pushing/pulling to other remotes (again, not for binary files). They effectively provide synchronisation and version control in one convenient package.

Every DVCS traditionally has issues with large files, however and several problems occur. Some binary files are compressed, meaning small changes to the file can cascade and generate large diffs. Generating diffs has it’s own problems, as most DVCS were designed for small files containing source code. They make the assumption the file will fit in memory in order to generate a diff for it. There are several approaches that have been taken, which are explored below.

git-annex

In git-annex large files are taken out of the git repository, and stored in ./.git/annex instead. Symbolic links are then created and committed to the repository. The back-end storage can be changed, with support ranging from S3, to bup to FTP. Obviously, this approach short-cuts most of the issues with large files described above, at the cost of losing any version history for those files. It runs on Linux and OS-X but there is no mention of Windows support.

git-bigfiles

The git-bigfiles project claims several advantages over Git, while also stating it aims to merge these changes back into mainline Git when they are stable enough. The project no longer appears to be maintained.

However, as of Git version 1.7.6 a new option has been added to git config – the core.bigFileThreshold option. According to the changelog:

Adding a file larger than core.bigfilethreshold (defaults to 1/2 Gig)
using “git add” will send the contents straight to a packfile without
having to hold it and its compressed representation both at the same
time in memory.

This prevents malloc() errors and similar out-of-memory issues when adding large files to a Git repository.

bup

The bup project takes a different approach. It uses a git-compatible repository but has it’s own algorithm (derived from the rsync algorithm) which takes up much less space. It achieves this by generating a ‘rolling checksum’ to split large files into chunks and then store the differences in the chunks.

It uses the Git packfile format, which allows a normal git front-end to access the repository. It also writes packfiles directly, without an intermediate stage, providing the same advantage as core.bigFileThreshold in Git 1.7.6 – it’s fast even with very large amounts of data.

Finally, it’s written in Python and C, which makes it relatively simple. The current version is a 0.25 release candidate, so it’s probably not ready for production use. Interestingly, git-annex can use bup as a storage back-end.

git-annex and bup?

Reading over that brief list (there are other projects, and I haven’t even looked at Mercurial yet) it seems that a combination of bup and git-annex might be suitable.

I’ll consider a single folder ~/owncloud. Inside there is a git repository which is kept synced with the ownCloud server, either directly or via a mounted WebDAV folder. This effectively solves all the synchronisation issues apart from large file support.

So, taking the approach of git-annex all appropriate files—over a certain size or of a certain file type—would be ‘annexed’ to a separate store. Providing symbolic links to these files makes the process transparent to the user, although this may be troublesome on Windows. This solves the issue of a frighteningly large Git repository but leaves us with another: how to version/archive these extra files and how to integrate that with ownCloud.

Efficient synchronisation can be achieved using Unison or rsync but again, no version history is kept. This is the approach that bup takes: using an rsync-like algorithm to split large files into chunks, providing de-duplication for the entire file history but without the large delta issues of traditional Git. However, another issue with bup is there is currently no way to clean up old backup information; eventually even a bup repository will become unwieldy.

What bup can do is maintain compatibility with ordinary Git clients. It may be possible to use bup on the client machine to generate Git-compatible repositories, which are much more efficiently managed.

Time’s Up

I’ve been looking into this all day. I’m going to write some more posts going into more detail as I understand things better. Still, this shows how deep the problem goes.

Follow

Get every new post delivered to your Inbox.

Join 37 other followers