Nathan will talk about Datalad and Git Annex to members of PenguinsUnbound
Nathan will talk about Datalad and Git Annex to members of PenguinsUnbound.
Joey Hess really likes git.
Introduction to git-annex by the author, Joey Hess.
git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space.
git-annex is designed for git users who love the command line. For everyone else, the git-annex assistant turns git-annex into an easy to use folder synchroniser.
apt-get install git-annex
git-annex is primarily known for “managing files with git without checking the file contents into git”. If this particular capability is not desired it would still be possible to benefit from git-annex or Datalad.
Generally listed from most automated to least automated. Items higher on the list likely include functionality mentioned lower in list. See workflow for more info.
git annex sync
and uses network to keep remotes in sync.git add
and git commit
.git fetch
and git merge
operations on request.git-annex
branch).The following snippet shows the general format for a gitignore whitelist:
# Ignore everything
*
# But descend into directories
!*/
# Recursively allow files under subtree
!/subtree/**
The following snippet shows trivial usage for two common file types in C.
# C Source Code
*.c text diff=c
# Compiled Dynamic libraries (binary is a macro for -text -diff)
*.so binary
Simple Recommendation: Default to MD5E
How important is Security vs Performance?
How to choose whether to use backend that includes file ext?
The following steps demonstrate how unique “hash keys” are created by SHA256E for 4 files with identical contents:
$ touch a a.b a.b.c a.b.c.d
$ git-annex add .
add a ok
add a.b ok
add a.b.c ok
add a.b.c.d ok
$ git-annex lookupkey *
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.b
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.b.c
SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.c.d
Using metadata with views:
--in
-(
and -)
to group expressions--and
, --or
, and --not
to alter logic# Find mp3 files in either of two repositories that have less than 3 copies:
git annex find \
--not --exclude '*.mp3' --and \
-\( --in usbdrive --or --in archive -\) --and \
--not --copies 3
--auto
get/drop/copy/move files based on rules--auto
flagThese commands can be used to help identify what would happen with the next
annex sync --content
:
annex find --in . --want-drop
annex find --not --in . --want-get
Consider the path of files from a camera to a website in a small company.
Remotes:
(A)
camera is “source” of photos(B)
laptop connected to camera is “client”(C)
external hard-drive is “archive”(D)
web server is “public”Flow of files
(A)->(B)
: Always(B)->(C)
: if file is placed in “archive/” directory(B)->(D)
: if file is placed in “public/” directoryvia SSH
via HTTP
File-system based remotes are great for local servers or USB drives
Powerful external special remotes
Simple hook commands
DataLad vs Git/Git-annex for modular data management:
This talks demos how DataLad, utilizing these tools, aids workflows involving nested repositories (git submodules), and argues that such workflows are highly suitable for data management needs in science.
git-annex-metadata-gui provides a graphical interface to the metadata functionality of git-annex.
git-annex-adapter lets you interact with git-annex from within Python.
Necessary commands are executed using subprocess and use their batch versions whenever possible.
recastex can take files and podcasts captured by git-annex and re-podcast them.
With recastex (RECAST annEX) you can now re-podcast the shows you have locally (to, say, your phone). This reduces network usage (brilliant for traveling when network costs are expensive) and improves privacy.
Recasting isn’t limited to podcasts. recastex casts all locally available media.
albumin is a script to semi-automatically manage a photograph collection.
A script to semi-automatically manage a photograph collection using a git-annex repository. It analyzes the files for their dates and times, compares them and their identification method to existing data in the repository, and decides which information to keep.
AnnexRemote is a helper module to easily develop special remotes for git annex.
AnnexRemote handles all the protocol stuff for you, so you can focus on the remote itself. It implements the complete external special remote protocol and fulfils all specifications regarding whitespaces etc. This is ensured by an excessive test suite. Extensions to the protocol are normally added within hours after they’ve been published.
From git-annex branchable
From git documentation
Specific Works
Others