📜 ⬆️ ⬇️

Changing the coding of the git repository

Hey. Due to the specifics, Linux is used at work with KOI8-R, all commits in the git repository were implemented in a local encoding. After some time, it was decided to recode the repository in UTF-8. In this article I want to discuss the technology of changing the encoding of an existing git repository, and at the same time correcting some of the errors made in certain commits.

A warning


In fact, a new repository will be created, respectively, before carrying out the procedure, it is necessary to suspend the current development, merge all changes into the conventionally central repository, in which we will perform the recoding. After checking the received repository, it will be necessary to re-clone it on all machines.

Git and Encoding


Git operates on binary data, so it doesn’t interact with the encoding of the files, as for commit comments, it also saves them in the form we gave it to them, but for each commit, the encoding header is filled in, which can later be used when requesting comments. If the encoding header encoding empty, git considers it to be UTF-8.

To configure, there are two parameters located in the [i18n] section:
')
 [i18n] commitencoding = UTF-8 logoutputencoding = KOI8-R 

The first one just sets the contents of the encoding header for the git commit and git commit-tree commands, the second tells the git log , git show , git blame commands to what encoding the text should be re-encoded before outputting to the user. If none of the parameters is specified, git assumes that logoutputencoding is UTF-8, however, if only the first parameter is set, git uses its value for the second one.

Because of this, various errors can occur - for example, if in the commits the encoding header does not match the comment encoding, but is equal to the value of the logoutputencoding parameter, git decides that transcoding is not required and displays the comment text as it is, respectively, on machines with a locale encoding as the comment, the content will be displayed correctly, although all the rest will be garbage.

In order to see the value of the comment encoding header, you can use the following command:

 git log –pretty=”%h - '%e': %s” 

Read more about the features of the git log command here .

Git filter-branch


So, we come to the main topic of this article. In order to "rewrite the history" of the existing repository, use the git filter-branch command. It allows you to consistently repeat all the commits made beforehand processing the files or meta data with various filters.

This article uses three filters:

After each filter, a command is set that git filter-branch will execute before recording a commit.

In order to go through the entire repository, you need to specify the - all parameter, separating it with additional parameters from filters, specify HEAD as the target and overwrite the tags according to new commits. To do this, you need to add a tag-name filter with the cat :

 git filter-branch <> --tag-name-filter 'cat' -- --all HEAD 

Before changing the encoding of comments, do not forget to set the correct value of the i18n.commitencoding directive - it will be written in all the headers received after performing the repository operation.

To convert a comment encoding, use the following command:

 'iconv -c -s -f KOI8-R -t UTF-8' 


The git filter-branch command takes the following form:

 git filter-branch --msg-filter 'iconv -c -s -f KOI8-R -t UTF-8' \ --tag-name-filter 'cat' -- --all HEAD 

Since the operation of “rewriting history” rather rudely interferes with the workflow, it makes sense (if you decide to make it) to try to correct the maximum number of errors. These may be incorrectly set environmental parameters, files stored in the repository, which should not be there, an encoding or part of the data of individual files, etc.

In particular, I discovered that the pair of commits had the wrong e-mail address of the author. Since at that time all commits were created by me, the problem was solved simply by overwriting this parameter in all commits:

 git filter-branch --msg-filter 'iconv -c -s -f KOI8-R -t UTF-8' \ --env-filter 'export GIT_AUTHOR_EMAIL="xxx@gmail.com" export GIT_COMMITTER_EMAIL="xxx@gmail.com"' \ --tag-name-filter 'cat' -- --all HEAD 

But naturally, no one bothers to use more complex structures with different conditions, etc.

In general, the git filter-branch command provides very rich functionality for modifying / fixing the git repository. You can read about all of its features here .

Source: https://habr.com/ru/post/178069/


All Articles