![]() ![]() Any suggestions or tricks for gracefully implementing this, either within git or in POSIX at large? Everything I've thought of is in one way or another a kludge. Bring the power and distributed nature of git to bear on your large files with git-annex. Checksums and encryption keep your data safe and secure. ![]() ![]() It can sync, backup, and archive your data, offline and online. It seems that I need a setup with a main repository for code and an auxiliary repository for data. git-annex allows managing large files with git, without storing the file contents in git. Changes to the data set can be tracked in a plain text file or by the person who provided the data (or just not at all). If an erroneous data point is fixed, I'm never going to look at the erroneous version again.With data in the repository, copying to a thumb drive may be impossible, which is annoying when I'm just working on a hundred lines of code.It also sets the files to read-only every time they sync. It doesnt check for changes within text files. Not exactly true, you still leverage all of Git features on DVC (in fact getting on Gitflow is one of the points of data version control in the first place), even merging: dvc. I'd rather either have it in a fixed location or add links as needed. Please correct me if I am wrong, but git-annex assistant is a horrible choice. git-annex Share Improve this question Follow asked at 13:52 Make42 12k 24 78 153 1. git cloning a spare copy (so I have two versions in my home directory) will pull a few GB of data I already have.However, I don't want the data in the git repository: If minor changes happen to the data, users should be notified at the next checkout. The data isn't 100% read-only now and then a data point is corrected, or a minor formatting change happens.When users first clone the repository, the data should come with.git/annex/objects/ files back to the repo's. git/annex/objects/ files, and annex.hardlink is enabled on the cache, so the populate step hardlinks the cache's. But what to do with the data sets that the code is working with? The process that initially populates the cache only uses 10GB, because annex.thin is enabled in the repo so it hardlinks the content to the. We're often working on a project where we've been handed a large data set (say, a handful of files that are 1GB each), and are writing code to analyze it.Īll of the analysis code is in Git, so everybody can check changes in and out of our central repository. ![]()
0 Comments
Leave a Reply. |