Git-Annex

Abstract

Nathan will talk about Datalad and Git Annex to members of PenguinsUnbound

Date
2020-06-20 10:00 — 12:00
Location
Jitsi

Introduction

Nathan will talk about Datalad and Git Annex to members of PenguinsUnbound.

Goal

  • Provide a high level introduction of what git-annex is.
  • Compare and contrast the different ways git-annex can be used.
  • Convince 25% of audience try using git-annex
  • Provide a plethora of resources for future investigation

Introduction to git-annex

Background on the Author

Joey Hess really likes git.

  • etckeeper is a collection of tools to keep track of /etc/ in Git.
  • Ikiwiki is a wiki compiler that stores pages and history in Git.
  • git-annex manages files with git, without checking contents into git.

From the Author

Introduction to git-annex by the author, Joey Hess.

git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space.

git-annex is designed for git users who love the command line. For everyone else, the git-annex assistant turns git-annex into an easy to use folder synchroniser.

git-annex quick start

git-annex can be used in many different ways

git-annex without any annexed files

git-annex is primarily known for “managing files with git without checking the file contents into git”. If this particular capability is not desired it would still be possible to benefit from git-annex or Datalad.

  • Datalad is able to be used on git repositories that contain no “annex”
  • Many of the auto commit/sync features of git-annex can be used even if the files are just checked into git.

git-annex can be overwhelmingly automated

Generally listed from most automated to least automated. Items higher on the list likely include functionality mentioned lower in list. See workflow for more info.

  1. git annex webapp. GUI for creation, configuration and management
  2. git annex assistant. Automates git annex sync and uses network to keep remotes in sync.
  3. git annex watch. Automates git add and git commit.
  4. git annex sync. Complex git fetch and git merge operations on request.

git-annex flexible configuration and manual usage

  • Very manual with direct control on command line via git and git-annex sub-commands
  • most commands can be configured on multiple different levels
  • Usually git-annex will stay out of the way if configured correctly

git-annex configuration

git config vs git annex config

  • Configuration for git-annex will feel natural for those familiar with configuration for git.
  • git-annex can be configured with git config (and the associated configuration files).
  • git annex config has the same syntax as git config, but stores configuration stored within the git repository itself (in the git-annex branch).
  • git config overrides configuration set via git annex config.

gitignore

  • Most git user’s should be relatively familiar the purpose of this file.
  • Consider using existing templates (such as with gitignore.io).
  • Consider using gitignore whitelist with git-annex repositories.

The following snippet shows the general format for a gitignore whitelist:

# Ignore everything
*
# But descend into directories
!*/
# Recursively allow files under subtree
!/subtree/**

gitattributes

The following snippet shows trivial usage for two common file types in C.

# C Source Code
*.c     text diff=c
# Compiled Dynamic libraries (binary is a macro for -text -diff)
*.so    binary

git-annex handles large files and metadata

Configure which files go into annex

backends - simple

Simple Recommendation: Default to MD5E

  • datalad defaults to MD5E backend
  • MD5E backend provides content integrity with lowest computation cost.
  • MD5E has file extension included in “hash key” avoids subtle issues

How important is Security vs Performance?

  • Use “SHA256E” or “SHA512E” to improve security at the cost of performance.

backends - file ext in hash

How to choose whether to use backend that includes file ext?

  1. Will the same file contents possibly have multiple different file extensions?
  2. Do programs rely on the file extension of file (not the symlink to file)?
  • File ext. in key causes multiple copies of identical files with different ext.
  • File ext. in key results in file at end of symlink having that extension as well.
  • See this bug report for more information.

The following steps demonstrate how unique “hash keys” are created by SHA256E for 4 files with identical contents:

$ touch a a.b a.b.c a.b.c.d
$ git-annex add .
add a ok
add a.b ok
add a.b.c ok
add a.b.c.d ok
$ git-annex lookupkey *
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.b
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.b.c
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.c.d

Metadata

Using metadata with views:

Find annex files

# Find mp3 files in either of two repositories that have less than 3 copies:
git annex find \
    --not --exclude '*.mp3' --and \
    -\( --in usbdrive --or --in archive -\) --and \
    --not --copies 3

git-annex automates git operations and file transfers

git-annex can automate the movement of files

  • Can be used with files in “annex” or in “git”
  • git-annex can --auto get/drop/copy/move files based on rules
  • git-annex sync choose an action (get/drop/copy/move) based on rules for files
  • assistant is a daemon that can automate operations between remotes
  • webapp allows everything to be configured via a GUI

preferred content - Configure which files should move and where

These commands can be used to help identify what would happen with the next annex sync --content:

annex find --in . --want-drop
annex find --not --in . --want-get

preferred content - Practical example

Consider the path of files from a camera to a website in a small company.

Remotes:

  • (A) camera is “source” of photos
  • (B) laptop connected to camera is “client”
  • (C) external hard-drive is “archive”
  • (D) web server is “public”

Flow of files

  • (A)->(B): Always
  • (B)->(C): if file is placed in “archive/” directory
  • (B)->(D): if file is placed in “public/” directory

Sharing files from git-annex with git

via git on computer in your network

via SSH

via HTTP

via Gogs, a git-annex compatible git server

Sharing files from git-annex with Special Remotes

  • Can only be used with files in “annex”, not those in “git”.
  • Special Remotes allows files to be annexed in a large variety of ways.
  • Almost all special remotes support encryption.

Special Remotes - Cloud services

Special Remotes - local usage

File-system based remotes are great for local servers or USB drives

  • directory for local use
  • rsync for use over a network
  • adb for use with android device
  • The directory and rsync special remotes intentionally use the same layout. So the same directory could be set up as both types of special remotes. ( comment by Joey H)
  • The main reason to use this rather than a bare git repo is that it supports encryption. ( comment by Joey H)

Special Remotes - Bring your Own

Powerful external special remotes

Simple hook commands

  • Very simple to write, but less robust than external special remotes

Other tools that work with git-annex

Datalad handles publication and reproduction at scale

DataLad vs Git/Git-annex for modular data management:

This talks demos how DataLad, utilizing these tools, aids workflows involving nested repositories (git submodules), and argues that such workflows are highly suitable for data management needs in science.

git-annex metadata GUI

git-annex-metadata-gui provides a graphical interface to the metadata functionality of git-annex.

git-annex-adapter - use from python

git-annex-adapter lets you interact with git-annex from within Python.

Necessary commands are executed using subprocess and use their batch versions whenever possible.

recastex - manage podcasts

recastex can take files and podcasts captured by git-annex and re-podcast them.

With recastex (RECAST annEX) you can now re-podcast the shows you have locally (to, say, your phone). This reduces network usage (brilliant for traveling when network costs are expensive) and improves privacy.

Recasting isn’t limited to podcasts. recastex casts all locally available media.

albumin - manage photos

albumin is a script to semi-automatically manage a photograph collection.

A script to semi-automatically manage a photograph collection using a git-annex repository. It analyzes the files for their dates and times, compares them and their identification method to existing data in the repository, and decides which information to keep.

pre-commit-annex for extract and exiftool

AnnexRemote

AnnexRemote is a helper module to easily develop special remotes for git annex.

AnnexRemote handles all the protocol stuff for you, so you can focus on the remote itself. It implements the complete external special remote protocol and fulfils all specifications regarding whitespaces etc. This is ensured by an excessive test suite. Extensions to the protocol are normally added within hours after they’ve been published.

Firefox plugin FlashGot - download manager

Conclusion

References

From git-annex branchable

public git-annex repositories

From git documentation

Specific Works

Others

Avatar
Nathan Genetzky
Senior Software Engineer

Software Engineer by Day, Electronic Hobbyist by Night.

Previous

Related