Git can have some unexpected behavior if a starry eyed developer decides to rewrite history. When we speak of rewriting history, we are not actually directly editing history, but rather presenting it in another way. So what are some common ways people do this? And what are the ramifications of it? And how do we do forensics when something goes wrong? I will attempt to answer all these questions along with offering my candid opinions.
Let us go over how people rewrite history.
The first way is by squashing commits. I find this to be the least reprehensible thing people do with their history. I am not speaking about developers who break things with a bad commit. That has nothing to do with Git. Regardless of the version control in place, developers should not be committing code which causes project breakage unless there is some good reason, such as needing help or sharing and collaborating work in progress. That is something people do with public knowledge. I am referring to people squashing commits to keep a change sets clean. This is a worthy goal and may even be a requirement for some projects.
Another reason for squashing commits may be that ongoing work has needless noise and an issue, which may be addressed by any number of commits, could be encapsulated in a single change set. This could also be an attempt to mitigate breakage of code. I see this as a worthy goal as well. I also see this as a source of problems when things are not always done rigorously, which is often the case in the corporate setting.
As a general rule of thumb, I do not like to squash history for any reason unless the people doing it are completely competent in programming and in using Git. This can be the case in an Open Source project because we can afford to be perfectionists and, in fact, we sometimes must be perfectionists because of the amount of scrutiny. Also, in an Open Source setting, we tend to have pseudo-infinite time and resources.
This is rarely the case in a corporate setting. People usually have more to do than they have time for, and compromises in quality are made more often than not in terms of developer time and developer expertise. In this environment, squashing commits can sometimes be disastrous because the commits and reasons for them are lost, making it harder to make compromises when necessary. I will give an example. Feature X must be released because of a contractual obligation, but previously working code is broken by it. Now finding the reason feature X breaks something is much harder because the change set cannot be broken down into smaller chunks because those small commit chunks are squashed together.
The moral of the story is DO NOT SQUASH COMMITS! Keep change sets small and mostly working if possible. Be descriptive of why a certain change is made in commit messages and use the merge model so the change can be tracked back to where it originated.
A second common way to rewrite history is rebasing. I will just say it straight up, I hate rebasing and I see no reason for it. The *only* time I use the rebasing model is when I have to interact with Subversion or CVS. Fortunately this happens less and less nowadays, but multiple parents do not exist in those version control systems. In this case, there is nothing that can be done but rebase before code is pushed up to those version control systems. It must appear to have one parent. Bad things will happen if commits are not rebased to Subversion branches.
In my mind, rebasing hinders Git and its inherent advantages. But this mindset of linear history is often carried over by folks who can not get their mind out of those archaic ways of thinking. It might also be claimed to be a way for people to present a linear, easy to follow change set for public consumption. I think more often than not, it is done to placate obsessive compulsive disorder by developers. It is even possible to reorder commits in a rebase interactively. Talk about confusing!
So what happens when history is replayed from another branch onto an existing branch? And why is it bad? Well, new commits have new SHA-1 hashes. How is it determined if something was committed or not when the SHA-1 hashes are not the same but the change sets themselves are? There are ways which I will describe later, but we should not make this any harder than it needs to be. And what about merges that have already been dealt with? Too bad sucker, everyone must deal with them again because the hashes are different and the parent history and merge commits are gone!
Alrightythen… we have figured out how to lose information, redo work that was already done, and make reverting harder when a problem arises! Rebasing is a bad idea in every way. DO NOT DO IT!
A third way to actually rewrite history is to wheel the history back.
git reset --hard HEAD~50 will wheel back history by 50 commits. This technique has its place. I have done it, normally on a non-shared branch when I am trying out some different branches to see if there are conflicts between them or determine where something went wrong. But that is something I would never do to a shared repository. It is not worth it. Revert and share that. If that is not done, changes may invariably be pulled by someone else and it will reappear in the repository when someone pushes. And rightfully so! Developers are not helping anyone by trying to wheel back history. Share the revert and do it right.
A fourth (very useful) way to rewrite history is to amend commits with
git commit --amend. I actually do this all the time *if* I have not shared my work! Amending a commit to add another change, fix a mistake, or update the message to be more meaningful makes a lot of sense and no one should be afraid to do it. The better the commits, the better the history, and the more useful Git will be to the project.
A fifth way is to split a commit or rebase interactively. This is usually not needed unless it becomes necessary to rescue someone from making huge, monolithic commit, usually because they are still stuck in the old subversion mindset. But it is possible to do. It might be useful if multiple issues become convoluted and only one should be delivered. In practice, it often makes more sense to pluck out the code manually and make new commits. Again, if code is already shared, the split (rebase interactively) or new commits must be preceeded by a revert so others who may have pulled the code do not have unnecessary noise to fix. Remember, using Git is less work, not more.
A sixth more invasive way is to use what is known as a filter-branch. The same caveats exist as with other methods of rewriting history. If people are basing their work off one version and history is completely rewritten, no work will merge cleanly. A filter branch only makes sense when the repository is not shared, needs to have huge swathes of history fixed, and/or other developers can be notified that the whole project needs to be cloned again. A typical use case would be an import of an existing code base into Git with certain errors needing to be cleaned up. For example, the committer’s name and email or certain files should be removed completely from history. This technique can even be used to clean up curse words or other bad commit messages.
Git is like any other tool or programming language. It can easily be misused and the results can be disastrous. Part of our responsibility as developers is to not misuse our tools and programming languages. It is part of the reason we are well paid and either admired or scorned. As a general rule it is best to clearly communicate intentions with both code and actions. Clever code, obfuscation, and non-standard practices should be avoided because costs typically far out weigh benefits. I consider rewriting history in Git to be clever, obfuscated and non-standard. It ought to be avoided unless that path makes overwhelming sense.
My next posting, which was originally going to be part of this one, will contain a few ways to deal with the bad situations caused by misuses I have just described.