Michael Eriksson's Blog

A Swede in Germany

Version control changing how the user works / diffs and line-breaks

leave a comment »

In my recent writings on Subversion and version control, I also discuss how use of version control can change how someone works (cf. parts of [1]). Since then, a much better example has occurred to me, namely the potentially strong incentives to reformat texts less often:

Both traditional diff-/merge-tools and traditional editors tend to be line-oriented (and for good reason: it makes many types of work easier). Ditto many other tools, especially those, like Subversion, that make heavy use of diff-/merge-abilities.

However, LaTeX, which I use for my books, treats line-breaks within a paragraph entirely or almost* entirely as if they were regular spaces. Similarly, LaTeX treats two consecutive spaces within a paragraph virtually as one space. Etc.

*It has been years since I studied the details, and it is very possible that I overlook some subtleties or special cases. However, for the purposes of the below, no such subtleties/whatnot are relevant.

As so much of the textual formatting of the markup, e.g. with regard to line-breaks within paragraphs, does not matter, a LaTeX author will usually format the raw text in a manner that is convenient for viewing/editing as text, while relying on a mixture of LaTeX automatisms and, when needed, own explicit instructions* to generate a presentable formatting of the output (e.g. generated PDF).

*For instance, “\,” has the implication of adding a small horizontal space, suitable to e.g. put the “e.” and “g.” in “e.g.” slightly further apart than when written together, but not as far apart as when a full space is used. Use of the thinsp[ace] character entity reference in HTML has a similar effect. Contrast (assuming correct rendering), “e.g.”, “e. g.”, and “e. g.”.

However, a formatting change that is harmless with regard to LaTeX and its generated output might trip up a line-based diff or merge. For instance, a line-based diff would recognize a one-line difference (on the second line) between

We hold these truths to be self-evident,
that all men are created equal,
that they are endowed by their Creator with certain unalienable Rights,
that among these are Life,
Liberty and the pursuit of Happiness.

and

We hold these truths to be self-evident,
that all mice are created equal,
that they are endowed by their Creator with certain unalienable Rights,
that among these are Life,
Liberty and the pursuit of Happiness.

However, if we compare the first with

We hold these truths to be self-evident, that all men are created equal,
that they are endowed by their Creator
with certain unalienable Rights,
that among these are Life, Liberty and the pursuit of Happiness.

the result is a complete line-wise* difference—while LaTeX would have rendered both texts identically and while they read identically from a human point of view.

*The standard diff-command would show this as five lines disappearing and four new appearing. Some other tool (or some other set of settings) might show e.g. two lines disappearing, one line appearing, and three lines being changed.

Such re-formatting, however, is very common. A notable case is a small edit in one line that increases the length of that line, followed by a semi-automatic reformatting* to keep all lines within the paragraph beneath a certain length (and all but the last close to that length). In a worst case, this leads to every single line in the paragraph being changed and creates a nightmare in terms of diffs and version control.

*In Vim, my editor of choice, e.g. by just typing “gq}” to format from the current cursor position to the end of the paragraph.

(In contrast, code is much less likely to be affected by such drastic changes. It happens, as with e.g. inconsistent use of tabs vs. spaces between different users/editors, but much more rarely.)

Excursion on mitigation:
The issue can be mitigated by instructing diffs to partially ignore white-space, with the effect e.g. that “abc_def” (one space, indicated by “_”) is treated as identical to “abc__def” (two spaces). However, this cannot be extended to line-breaks without reducing the benefits of diffs very considerably—and it would require a non-trivial intervention with the tools that I know. (For code, however, ignoring even regular white-spaces can be extremely helpful.)

Excursion on subtleties in formatting and my own markup:
Above, I tried to give the “We hold” examples using HTML’s “pre” tag for pre-formatted text. (An exception where HTML does care about line-breaks. It is otherwise almost entirely agnostic.) This failed, because I explicitly remove line-breaks from my own markup during generation, which makes my markup language even more agnostic than HTML (and thereby circumvents any use of HTML’s “pre” tag to preserve line-breaks).

The reason? In my early days of experimenting with W-rdpr-ss and “post by email”, I had the problem that W-rdpr-ss messed* up line-breaks, forcing me to take corrective actions.

*This is so long ago that I am uncertain of the details. However, I believe that it involved spuriously turning a simple line-break into a completely empty line or, equally spuriously, surrounding individual lines with “p” tags, thereby converting a line-break within a paragraph (which should have been ignored) into a paragraph break.

Instead, I proceeded by just giving the markup-indication for a “hard” line-break at the end of each line, which is converted into a “br” tag during HTML generation and remains in effect even as (regular) line-breaks are stripped.

(This might have been for the best, in as far as mixing HTML tags with my own markup is potentially dangerous. I have done it on a few occasions, e.g. to demonstrate thin spaces in this text, but I normally avoid it, as it relies on the target language/format being HTML. If I were to generate e.g. LaTeX as output instead, I could fall flat on my face.)

Excursion on own markup(s):
I have two similar-but-not-identical markup languages: Firstly, a more powerful and useful that I use(d) for my website. Secondly, the more primitive that I use for W-rdpr-ss. I am disinclined to make the latter more powerful than it is, as W-rdpr-ss causes so many odd disturbances that I cannot rely on the result (and as e.g. the above line-break stripping might cause problems here and there). Instead, I will get by until I finally get around to set up my website for blogging and abandon W-rdpr-ss, at which time I will either return to the other language or modify the current as needed.

Written by michaeleriksson

March 27, 2023 at 10:46 pm

Leave a comment