Trimming a string with Bash
Note: the below solution should actually work on all POSIX-compatible shells (including Bash and Zsh).
The problem
Let’s say we have a piece of text that looks like this:
this string
has whitespace
everywhere
why
and the task is to delete all of the leading and trailing whitespace characters, that is, trim the string, using Bash.
To make it clear which characters are actually present in the above, here’s the original string with escape sequences shown:
\n\t\t\t \nthis string \t\n has \t whitespace\n\n everywhere \r\n why\t \t\n\t\t
The reason why I’m restricting this problem to Bash is because this can be done in Python in one line using the built-in strip
method:
s.strip() # prints 'this string \t\n has \t whitespace\n\n everywhere \r\n why'
so let’s see how we can do it in Bash.
Finding solutions online
Googling the term ‘trim string bash’ gives this article as the first hit, which offers a couple of solutions.
Much to my surprise, after running their code on my seemingly simple example, none of them worked as expected, and either removed too few, or too many characters.
Other solutions I’ve encountered seem to be assuming we’re dealing with strings that don’t have any newline characters (\n
).
Additionally, they often use other utilities (commonly sed
, awk
, and xargs
), so it would be useful to have a solution using only Bash-isms instead.
String removal
In Bash, we can remove a given character or string from the front and the back of an input string using the following syntax:
x='some string'
echo "${x#s}" # will print 'ome string'
echo "${x%g}" # will print 'some strin'
We can also remove any one of the following characters in the square brackets as demonstrated in this SO answer by using:
# removes the specified (whitespace) characters from the beginning
echo "${x#[$'\r\t\n ']}"
Note that the list of characters to be removed is specified as a string with a $
prefix, because then the characters are properly escaped, as explained in this Unix.SE answer.
Digging through the Bash man
page reveals that it’s instead possible to just use the keyword [:space:]
to explicitly specify the entire class of whitespace characters, so the generalization of the above is then:
# removes _any_ whitespace character from the beginning
echo "${x#[[:space:]]}"
Test code
Since we can get the size of a string using ${#variable}
, our task is straightforward - keep removing whitespace characters until there’s nothing else to remove, i.e. until size(string before trimming) == size(string after trimming)
, which is achieved with the following code:
s=' some string '
size_before=${#s}
size_after=0
while [ ${size_before} -ne ${size_after} ]
do
size_before=${#s}
s="${s#[[:space:]]}"
s="${s%[[:space:]]}"
size_after=${#s}
done
echo "${s}" # prints 'some string'
Note that using something like ${s##[[:space:]]}
won’t work properly.
According to the Bash manual, this would remove the longest substring matching the pattern, which means if we our original string was, say, \t\t\n\t actual string
, it would just remove \t\t
, and leave the rest as-is.
Putting it all together
To make things handy, we can put everything in a function called trimstring
, which can then be added to a ~/.bashrc
file or similar:
trimstring(){
if [ $# -ne 1 ]
then
echo "USAGE: trimstring [STRING]"
return 1
fi
s="${1}"
size_before=${#s}
size_after=0
while [ ${size_before} -ne ${size_after} ]
do
size_before=${#s}
s="${s#[[:space:]]}"
s="${s%[[:space:]]}"
size_after=${#s}
done
echo "${s}"
return 0
}
After a lot of testing, below is a table of commonly used shells in which the above works and doesn’t work.
Shell | Ash | Bash | Dash | Csh | Tcsh | Ksh | Zsh | Fish |
---|---|---|---|---|---|---|---|---|
Works? | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ |
Appendix: solution using GNU Sed
After some more googling, the Sed-based solution from the article mentioned above works if we restrict ourselves to using GNU Sed, which has a -z
option1 that treats the null character as the end of a line instead; this means that $
will only match the end of the whole text stream instead of individual newline (\n
) characters, while ^
will match the beginning of the stream.
This allows us to make the following script:
trimstring_sed(){
s="${1}"
s="$(printf "${s}" | sed -z 's/^[[:space:]]*//')"
s="$(printf "${s}" | sed -z 's/[[:space:]]*$//')"
echo "${s}"
return 0
}
The above basically matches zero or more instances of any whitespace character at the beginning and end of the input string, and removes them.
Comparing the timings of trimstring
with trimstring_sed
gives trimstring
an obvious edge when it comes to speed though:
string="$(cat test.txt)" # contains the initial string
time for i in {1..1000}; do trimstring "${string}" > /dev/null; done
real 0m0.222s
user 0m0.217s
sys 0m0.006s
time for i in {1..1000}; do trimstring_sed "${string}" > /dev/null; done
real 0m3.521s
user 0m2.853s
sys 0m1.270s
Thus, there’s at least an order of magnitude difference between just using built-in Bash-isms vs. using an external tool like Sed; of course, the Python solution is by far the fastest, but only if we’re already inside the interpreter, and aren’t using a shell in the first place.
-
see this SO answer and/or this Unix.SE answer ↩