Common Concepts: Repositories

This post covers the basic storage mechanisms used for the repositories of various systems, along with some advantages/disadvantages of each.

A repository is the database of version information used by a version control system. In centralised systems the repository is on the server, while distributed systems keep a copy on each developer’s machine.

Storage mechanisms for repositories vary, with varying amounts of documentation. Git’s storage and object models are very well described around the web and Mercurial has excellent documentation on the same subject. Bazaar is more difficult to describe, partly because it takes a different approach to the other systems and partly due to a lack of documentation.

There are, broadly speaking, three methods of storing version information:

Snapshots: In snapshot-based systems, each version of a file is saved individually, independent of other versions. This has a few consequences: the repository size quickly grows very large, so compression is typically used to reduce this. This is traded for a single disk access for any version (since they are all stored individually), making access to versions very quick. Git and Bazaar use snapshot-based systems.

Delta Compression: Delta compression stores the difference between two files. Delta storage can be used to efficiently store multiple versions of files, avoiding the large storage issue of snapshot systems (until the repository gets very large anyway). Consequently, the access time for old versions takes longer the more version history there is. Mercurial, Subversion and CVS use delta-based methods for repository storage.

A technique called version jumping [1] can be used to minimise the access costs, and offers some storage improvements. Subversion has a similar implementation called skip deltas.

Weaves: All version information is stored in a single file, in interleaved blocks. Metadata is added to each interleaved block, indicating the version it belongs to along with some other information. Any version of a file can be reconstructed with a single sequential read, although this takes longer the more interleaved blocks there are (i.e. the more history a file has). The trade-off is having to rewrite the file each time a new version is added. Bazaar used to utilise this method (although it is unclear what it uses now) and the original Source Code Control System had a similar mechanism.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 39 other followers