Michael Eriksson's Blog

A Swede in Germany

Dropping the ball on version control / Importing snapshots into Subversion

with 2 comments

Unfortunately, computer stupidities are not limited to the ignorant or the stupid—they can also include those lazy, overly optimistic, too pressed for time, whatnot.

A particularly interesting example is my own use of version control:

I am a great believer in version control, I have worked with several* such systems in my professional life, I cannot recall the last time that I worked somewhere without version control, and I have used version control very extensively in past private activities,** including for correspondence, to keep track of config files, and, of course, for my website.

*Off the top of my head, and in likely order of first encounter, PVCS, CVS, Subversion, Git, Perforce. There was also some use of RCS during my time at Uni. (Note that the choice of tools is typically made by the employer, or some manager working for the employer, and is often based on existing licences, company tradition, legacy issues, whatnot. This explains e.g. why Perforce comes after Git in the above listing.)

**Early on, CVS; later, Subversion.

However, at some point, I grew lazy, between long hours in the office, commutes, and whatnots, and I increasingly cut out the overhead—and, mostly, this worked well, because version control is often there for when things go wrong, just like insurance. For small and independent single files, like letters, more than this indirect insurance is rarely needed. (As opposed to greater masses of files, e.g. source code to be coordinated, tagged, branched, maintained in different versions, whatnot.) Yes, using proper version control is still both better and what I recommend, but it is not a game changer when it comes to letters and the like, unlike e.g. a switch from WYSIWYG to something markup-based.

Then I took up writing fiction—and dropped the ball completely. Of course, I should have used version control for this. I knew this very well, but I had been using Perforce professionally for a few years,* had forgotten the other interfaces, and intended to go with Git over the-much-more-familiar-to-me Subversion.

*Using Perforce for my writings was out of the question. The “user experience” is poor relative e.g. Subversion and Git; ditto, in my impression, the flexibility; Perforce is extremely heavy-weight in setup; and I would likely have needed a commercial licence. Any advantages that Perforce might or might not have had in terms of e.g “enterprise” functionality were irrelevant, and, frankly, brought nothing even in the long-running-but-smallish professional project(s) where I used it.

But I also did not want to get bogged down refreshing my memory of Git right then and there—I wanted to work on that first book. The result? I worked on the book (later, books) and postponed Git until next week, and next week, and next week, … My “version control” at this stage consisted of a cron-job* that created an automatic snapshot of the relevant files once a day.**

*Cron is a tool to automatically run certain tasks at certain times.

**Relative proper version control, this implies an extreme duplication of data, changes that are grouped semi-randomly (because they took place on the same day) instead of grouped by belonging, snapshots (as pseudo-commits) that include work-in-progress, snapshots that (necessarily) lack a commit message, snapshots that are made even on days with no changes, etc. However, it does make it possible to track the development to a reasonable degree, it allows a reasonable access to past data (should the need arise), and it proved a decent basis for a switch to version control (cf. below). (However, some defects present in the snapshots cannot be trivially repaired. For instance, going through the details of various changes between two snapshots in order to add truly helpful commit messages would imply an enormous amount of work and I used much more blanket messages below, mostly to identify which snapshot was the basis for the one or two commits.)

Then came the end of 2021, I still had not set up Git, and my then notebook malfunctioned. While I make regular backups, and suffered only minimal data loss, this brought my writings to a virtual halt: with one thing and another, including a time-consuming switch from Debian to Gentoo and a winter depression, I just lost contact. (And my motivation had been low for quite some time before that.) Also see e.g. [1] and a handful of other texts from early 2022, which was not a good time for me.

In preparation to resume my work (by now in 2023…) on both my website and my books, I decided to do it properly this time. The website already used Subversion, which implied reacquainting myself with that tool, and I now chose to skip Git for the books and go with Subversion instead.*

*If in doubt, largely automatic conversion tools exist, implying that I can switch to Git if and when I am ready to do so, with comparatively little effort and comparatively little loss, even if I begin with Subversion. (And why did I not do so to begin with?) Also see excursion.

(Note: A full understanding of the below requires some acquaintance with Subversion or sufficiently similar tools, as well as some acquaintance with a few standard Unix tools.)

So, how to turn those few years of daily snapshots into a Subversion repository while preserving history? I began with some entirely manual imports, in order to get a feel for the work needed and the problems/complications that needed consideration. This by having (an initially empty) repository and working copy, copying the files from the first snapshot into the working copy, committing, throwing out the files,* copying in the files from the second snapshot,* taking a look at the changes through “cvs status”, taking corresponding action, committing, etc.

*Leading to a very early observation that it is better to compare first and replace files later. Cf. parts of the below. (However, “throwing out the files” is not dangerous, as they are still present in the repository and can easily be restored.)

After a few iterations, I had enough of a feel to write a small shell script to do most of the work, proceeding by the general idea of checking (using “diff -rq” on the current working copy and the next snapshot) whether any of the already present files were gone (cf. below) and, if not, just replacing the data with the next snapshot, automatically generating “cvs add” commands for any new files, and then committing.

The above “if not” applied most of the time and made for very fast work. However, every now and then, some files were gone, and I then chose to manually intervene and find a suitable combination of “svn remove” and, with an eye at preserving as much as possible of the historical developments, “svn move”.* (Had I been content with losing the historical developments, I could have let the script generate “svn remove” commands automatically too, turning any moves into independent actions of remove-old and add-new, and been done much faster.) After this + a commit, I would re-run the script, the “if not” would now apply and the correct remaining actions would be taken.**

*See excursion.

**If a file had been both moved and edited on the same day/in the same snapshot, there might now be some slight falsification of history, e.g. should I first have changed the contents and then moved the file. With the above procedure, Subversion would first see the move and then the change in contents. Likewise, a change in contents, a move, and a further change in contents would be mapped as a move followed by a single change in contents. However, both the final contents of the day and the final file name of the day are correctly represented in Subversion, which is the main thing.

All in all, this was surprisingly painless,* but it still required a handful of hours of work—and the result is and remains inferior to using version control from the beginning.

*I had feared a much longer process, to the point that I originally had contemplated importing just the latest state into Subversion, even at the cost of losing all history. (This was also, a priori, a potential outcome of those manual imports “to get a feel for the work needed”. Had that work been too bothersome, I would not have proceeded with the hundreds of snapshots.)

(There was a sometime string of annoyances, however, as I could go through ten or twenty days’ worth of just calling the script resp. of the “if not” case, run into a day requiring manual intervention, intervene, and proceed in the hope of another ten or twenty easy days—but instead run into several snapshots requiring manual intervention in a row. As a single snapshot requiring manual intervention takes longer than a month’s worth of snapshots that do not, this was a PITA.)

Excursion on disappearing files:
There were basically three reasons for why an old file disappeared between snapshots:

  1. I had (during the original work) moved it to another name and/or another directory. I now had to find the new name/location and do a “svn move” to reflect this in the repository. (And sometimes a “svn mkdir”, when the other directory did not already exist. If I were to begin again, I would make the “svn mkdir” automatic.) Usually, this was easy, as the name was normally only marginally changed, e.g. to go from “tournament” to “23_tournament”, corresponding to a chapter being assigned a position within the book; however, there were some exceptions. A particular hindrance in the first few iterations was that I failed to consider the behavior of the command line tool “diff” (not to be confused with “svn diff”), which I used to find differences between the state in the repository and the next snapshot: a call like “diff -rq” upon two directories does show what files are present in the one but not the other, but if a (sub-)directory is missing, the files in that directory are not listed in addition to the directory it self. (With the implication that I first have to “svn mkdir” the new directory, and only after will “diff -rq” show me the full differences in files.) This complication might have made me misinterpret a few early disappearing files as belonging to one of the following items, instead of this item, because I could not see that the file had been moved. Another complication was when a file had been given a new name with a less obvious connection, which happened on some rare occasions.
  2. I had outright deleted it, be it because the writing was crap, because the contents did not fit well with the rest of the story, or because it served some temporary purpose, e.g. as a reminder of some idea that I might or might not take up later. In a particularly weird case, I had managed to save a file with discus statistics with my writings, where it absolutely did not belong. (I am not certain how that happened.) These cases resulted in a simple “svn remove”.
  3. I had integrated the contents into another file and then deleted the original file, often with several smaller files being integrated into the same larger file through the course of one day. Here, I used a “svn remove” as a compromise. Ideally, I should have identified the earlier and later files, committed them together, and given them an informative commit message, but the benefits of this would have been in no proportion to the additional effort. (This is a particularly good example of how proper version control, with commits of changes as they happen, is superior to mere daily snapshots.)

In a more generic setting, I might also have had to consider the reverse of the last item, that parts or the whole of a larger file had been moved to smaller new files, but I knew that this had been so rare in my case, if it had happened at all, that I could ignore the possibility with no great loss. A similar case is the transfer of some parts of one file into another. This has happened from time to time, even with my books, e.g. when a scene has been moved from one chapter to another or when a part of a file with miscellanea has found a permanent home. However, it is still somewhat rare and the loss of (meta-)information is lesser than if e.g. an atomic “svn move” had been replaced with a disconnected “svn remove”–“svn add” sequence. (Other cases yet might exist, e.g. that a single file was partially moved to a new file, partially integrated into an old one. However, again, these cases were rare relative the three main items, and relatively little could be gained from pursuing the details.)

Excursion on some other observations:
During my imports, I sometimes had the impression that I had structured my files and directories in an unfortunate manner for use with version control, which could point to an additional benefit of using version control from day one. A particular issue is that I often use a directory “delete” to contain quasi-deleted files over just deleting them, and only empty this directory when I am sure that I do not need the files anymore (slightly similar to the Windows Recycle Bin, but on a one-per-directory basis and used in a more discretionary manner). Through the automatisms involved above, I had such directories present in the snapshot, added to Subversion during imports, files moved to them, files removed from them, etc. Is this sensible from a Subversion point of view, however? Chances are that I would either not have added these directories to the repository in the first place, had I used Subversion from the beginning, or that I would not have bothered with them at all, within or without the repository, as the contents of any file removed by “svn remove” are still present in the repository and restorable at will. Similarly, with an eye at the previous excursion, there were cases of where I kept miscellanea or some such in one file, where it might have been more Subversion-friendly to use a separate directory and to put each item into its own file within that directory.

As a result of the above procedure, I currently have some files in the repository that do not belong there, because they are of a too temporary nature, notably PDFs generated based on the markup files. Had I gone with version control to begin with, they would not be present. As is, I will remove them at a later time, but even after removal they will unnecessarily bloat the repository, as the data is still saved in the history. (There might be some means of deleting the history too, but I have not investigated this.) Fortunately, the problem is limited, as I appear to have given such temporary files a separate directory outside of the snapshot area at a comparatively early stage.

When making the snapshots, I had taken no provisions to filter out “.swp” files, created by my editor, Vim, to prevent parallel editing in two Vims and to keep track of changes not yet “officially” written to disk. These had to be manually deleted before import. (Fortunately, possible with a single “find -iname ’*.swp’ -delete” over all the snapshots.) There might, my memory is vague, also have been some very early occurrence when I accidentally did add some “.swp” files to the repository and had to delete them again. Working with Subversion from day one, this problem would not have occurred.

I had a very odd issue with “svn mkdir”: Again and again, I used “svn add” instead, correctly received an error message, corrected myself with “svn mkdir”—and then made the exact same mistake the next time around.* The last few times, I came just short of swearing out loud. The issue is the odder, as the regular/non-svn command to create a directory is “mkdir”, which should make “svn mkdir” the obviously correct choice over “svn add”.

*If a directory already exists in the file system, it can be added with “svn add”, but not new ones created. If in doubt, how is Subversion to know whether the argument given was intended as a new directory or as a new file?

Excursion on Git vs. Subversion:
Git is superior to Subversion in a great many ways and should likely be the first choice for most, with Subversion having as its main relative strength a lower threshold of knowledge for effective and efficient use.* However, Git’s single largest relative advantage is that it is distributed. Being distributed is great for various collaborative efforts, especially when the collaborators do not necessarily have constant access to a central repository, but is a mere nice-to-have in my situation. Chances are that my own main benefit from using Git for my books would have been a greater familiarity with Git, which would potentially have made me more productive in some later professional setting. (But that hinges on Git actually being used in those settings, and not e.g. Perforce. Cf. an above footnote.)

*But this could (wholly or partially) be a side-effect of different feature sets, as more functionality, all other factors equal, implies more to learn. (Unfortunately, my last non-trivial Git use is too far back for me to make a more explicit comparison.)

Excursion on automatic detection of what happened to deleted files:
I contemplated writing some code to attempt an automatic detection of moved files, e.g. by comparing file names or file contents. At an early stage, this did not seem worth the effort; at a later stage, it was a bit too late. Moreover, there are some tricky issues to consider, including that I sometimes legitimately have files with the same name in different directories (e.g. a separate preface for each of the books), and that files could not just have been renamed but also had their contents changed on the same day (also cf. above), which would have made a match based on file contents non-trivial.* Then there is the issue of multiple files being merged into a new file… My best bet might have been to implement a “gets 80 percent right based on filenames” solution and to take the losses on the remaining 20 percent.

*One fast-to-implement solution could be to use a tool like “diff” on versions of the files that have been reformatted to have to one word per line, and see what proportion of the lines/words come out the same and/or whether larger blocks of lines/words come out the same. This is likely to be quite slow over a non-trivial number of files and is likely to be highly imperfect in results, however. (The problem with more sophisticated solutions, be they my own or found somewhere on the Internet, is that the time invested might be larger or considerably larger than the time saved.)

Excursion on general laziness:
More generally, I seem to have grown more lazy with computer tools over the years. (As with version control, I will try to do better.) For instance, the point where I solve something through a complex regular expression instead of manual editing has shifted to require a greater average mass of text than twenty years ago. Back then, I might have erred on doing regular expressions even for tasks so small that I actually lost time relative manual editing, because I enjoyed the challenge; today, I rarely care about the challenge, might require some self-discipline to go the regexp route, and sometimes find myself doing manual editing even when I know that the regexp would have saved me a little time. (When more than “a little time” is at stake, that is a different story and I am as likely to go the regexp route as in the past.)

Excursion on “perfect is the enemy of good”:
This old saying is repeatedly relevant above, most notably in the original decision to go with Git (a metaphorical “perfect”) over Subversion (a metaphorical “good”), which indirectly led to no version control being used at all… I would have been much better off going with Subversion over going with daily snapshots. Ditto, going with Git over snapshots, even without a refresher, as the basic-most commands are fairly obvious (and partly coinciding with Subversion’s), and as I could have filled in my deficits over the first few days or weeks of work. (What if I screwed up? Well, even if I somehow, in some obscure manner, managed to lose, say, the first week’s worth of repository completely, I would still be no worse off than if I had had no repository to begin with, provided that the working copy was preserved.) However, and in reverse, I repeatedly chose “good” over “perfect” during the later import, in that I made compromises here and there (as is clear from several statements).

Excursion on books vs. code:
Note that books are easier to import in this manner than code. For instance, with code, we have concerns like whether any given state of the repository actually compiles. While this can fail even with normal work, the risk is considerably increased through importing snapshots in this manner, e.g. because snapshots (cf. above) can contain work-in-progress that would not have been committed. With languages like Java, renaming a class requires both a change of the file contents and the file name, as well as changes to all other files that references the class, and all of this should ideally be committed together. Etc. Correspondingly, much greater compromises, or much greater corrective efforts, would be needed for code.

Excursion on number of files:
A reason for why this import was comparatively slow is the use of many files. (Currently, I seem to have 317 files in my working copy, not counting directories and various automatically generated Subversion files.) It would be possible to get by with a lot less, e.g. a single file per book, a TODO file, and some few various-and-sundry. However, while this would have removed the issue of moved files almost entirely, it would have been a very bad idea with an eye at the actual daily work. Imagine e.g. the extra effort needed to find the right passage for editing or the extra effort for repeatedly jumping back and forth between different chapters. Then there is the issue of later use of the repository, e.g. to revisit the history, to find where an error might have been introduced, whatnot—much easier with many smaller files than several large ones.

(As to what files I have, in a very rough guesstimate: about a quarter are chapters in one of the books, about two dozen are files like shell scripts and continually re-used LaTeX snippets, some few contain TODOs or similar, some few others have a different and varying character, and the remaining clear majority are various pieces of text in progress. The last include e.g. chapters-to-be, individual scenes/passages/whatnot that might or might not be included somewhere at some point, and mere ideas that might or might not be developed into something larger later on.)


Written by michaeleriksson

March 23, 2023 at 8:59 am

2 Responses

Subscribe to comments with RSS.

  1. […] a follow-up on yesterday’s text ([1]) on adventures with version […]

  2. […] asking for permission (cf. at least [1] and [2]). In the wake of my adventures with Subversion ([3], [4]), I can point to yet another horrifyingly incompetent […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: