Quantcast
Channel: Debian User Forums
Viewing all articles
Browse latest Browse all 3637

Programming • [Bash] cut and UTF-8 problem

$
0
0
I came across an unexpected behaviour when transfering a bash-script from a diffent distro to Debian.
Debian version is: DEBIAN_VERSION_FULL=13.2

The code below is a simplifaction to make it easy to see, so not the original script itself:

LANG is set to UTF-8, echo a string containing one 2-byte UTF-8 character as first character of the string. "od -c" shows, the two bytes of that character `.
But piping the string into "cut -c 1-1" which should show that UTF-8 character, only takes the first byte. When using "cut -c 2-2" which should give the second character "f" it takes the second byte of the UTF-8 character and not the real second character. Obviously "cut" does not recognize UTF-8 correctly. Regardless where such a UTF-8 character is in the string, cut always only takes the first byte (last example).

Is there any explanation for this behavior or does anyone knows a setting to make it work?

Just as a reminder: this example is very simplified but it shows the effect; in the real script the "cut" has different numbers and only one part in the middle of multiple pipes, so there is no need to talk about using bash internal functions to get the single character out of a string for this example (that works) - cut needs to handle UTF-8 correcctly.

I appreciate and input!

Code:

andreas@linux:~$ echo $LANGde_DE.UTF-8andreas@linux:~$ echo "öffnen" | od -c0000000 303 266   f   f   n   e   n  \n0000010andreas@linux:~$ echo "öffnen" | cut -c 1-1�andreas@linux:~/ahnen/namelist$ echo "öffnen" | cut -c 1-1 | od -c0000000 303  \n0000002andreas@linux:~$ echo "öffnen" | cut -c 2-2�andreas@linux:~$ echo "öffnen" | cut -c 2-2 | od -c0000000 266  \n0000002andreas@linux:~$ echo "Töne" | cut -c 2-2 | od -c0000000 303  \n0000002

Statistics: Posted by andykim — 2026-01-09 04:50



Viewing all articles
Browse latest Browse all 3637

Trending Articles