Using iconv under Windows 10 thanks to Docker

Iconv

Iconv is a command-line program, provided on Unix and Unix-like systems, used to convert text files between different character encodings (https://en.wikipedia.org/wiki/Iconv).
For example, we can convert a file from ISO-8859-1 to UTF-8, with this command :
iconv -f iso-8859-1 -t utf-8 <infile> -o <outfile>
Sometimes we would like to use iconv to convert the encoding of some text files, but it’s not natively available for Windows.

Batch processing with iconv

However iconv can only convert one file at a time.
So if we need to convert a whole set of files, we’ll need to use batch processing.
Let’s say we want to convert Java sources (*.java) in the src directory, from ISO-8859-1 to UTF-8, and overwrite the original files, we could try this :
find ./src -name "*.java" -type f -exec iconv -f iso88591 -t utf8 -o "{}" "{}" \;

Docker

Now, let’s suppose we don’t have any virtual machine with Linux installed, nor any installation of WSL with Windows… That’s where Docker can be useful, by providing images where iconv is already embedded. This is the case with images like debian or ubuntu.
First let’s verify that iconv is present in a given Docker image :
docker run --rm debian:11 which iconv
Or
docker run --rm ubuntu:20.04 which iconv
It should be located at /usr/bin/iconv.

If iconv is present, no need to build our custom image and install iconv with apt.

The final step

The directory containing the sources that we want to convert will have to be bind mounted to be accessible by our container. We’ll use Docker volumes for that.
Here is how we can do that, assuming the sources we want to convert are located in the src directory :

docker run --rm -v $(pwd):/code debian:11 find /code/src -name "*.java" -type f -exec iconv -f iso88591 -t utf8 -o "{}" "{}" \;

Note

If you can’t use the pwd command because you don’t use a Unix-like terminal, please read this article : Sharing files between host and container in Docker Desktop for Windows.

But it may fail… 😬

Oh yes, if you try to overwrite the original file with iconv, it may fail ! So it’s not always a good idea…

In fact if you see the following error with the batch conversion (using find -exec) :

iconv terminated by signal 7

or if using iconv on a single file you have the following error (with exit code = 135) :

Bus error

Then it can occur because the input file is greater than 32 KB, indeed iconv uses a buffer (32 KB I guess) when reading the original file. If the buffer is full, the input file still open, and it tries to write to the same file, it will fail. So it may be a better strategy to write converted files to another location.

So here is a better way to make it work :

for file in $(find . -name "*.java")
do
    iconv -f iso88591 -t utf8 -o "$file.utf8" "$file" &&
    mv -f "$file.utf8" "$file"
done

References

Bonus

You can try to detect the type and encoding of a file with the Unix command file. It can display the charset used for a text file.

Alternative without Docker : Git Bash

If you don’t have Docker installed, but you installed Git for Windows, you can use Git Bash.

It contains basic Unix tools like find, file, and above all iconv 🤩 !

Laisser un commentaire