Using a bash script with sed to clean data

After creating a database the first things you will do with it are creating the schema (tables, relationships, indexes, etc) and importing data. One of the options for importing data is to import it from a flat file like txt, csv, tsv, etc. I’m developing an API now and when I imported data from a tsv (tab separated values) file Postgresql started to give errors like this in several lines:

128190:23: actual: ": A Short Film", expected: value separator (\t)

Inspecting the lines from the tsv file with grep they all had the same pattern:

tt0294236	short 'Arry and Arriet's Evening Out	'Arry and 'Arriet's Evening Out	0	1909	Comedy

All the lines that had a word starting with single quotes, like ‘Arry and Arriet’s Evening Out, or double quotes were showing errors, as the importer was trying to find a tab and mistakingly assuming the quotes were the separation between values and not the tab. So, in order to import the data I had to clean it before, removing single quotes from the start of values, which has no value for the API, and all double quotes, since the values are already strings they have no value as well. To clean it I’ve used Sed.

Sed is a stream editor in bash, what it means is it can perform editing operations on text coming from standard input or a file, like insertion, deleting, find and replacing, etc, so it’s really useful to clean data from files. The use case here is to find the data I don’t want and replace it.

Let’s work with two real case here, the lines are these ones:

tt0294236	short	'Arry and Arriet's Evening Out	'Arry and 'Arriet's Evening Out	0	1909	Comedy

tt9913170	tvEpisode	"American Gods"	"American Gods"	0	2019	Talk-Show

On the first one we will have to remove the single quote from the value’s start and on the second we will have to remove all the double quotes. The result has to be this:

tt0294236	short	Arry and Arriet's Evening Out	Arry and 'Arriet's Evening Out	0	1909	Comedy

tt9913170 tvEpisode	American Gods     American Gods	0	2019	Talk-Show

Sed uses regular expressions to find the data we want to replace, a simple table showing some of the patters would be this:

Pattern	Use case
.	Matches any character, including newline
^	Matches the null string at beginning of the pattern space, i.e. what appears after the circumflex must appear at the beginning of the pattern space
$	It is the same as ^, but refers to end of pattern space
\	Used to escape (use) invalid characters in strings

With these pattern you can do a lot of different stuff. We wil start from the basics here and do a simple find and replace. You need to save both lines with the quotes from above in a file, it can be called with-quotes.txt.

One example of how Sed works for replacement would be this:

sed -i 's/"what I want to replace"/"what I will use for as replacement"/g' file.txt

Explaining each part, we have sed which is the command we are using, -i means I’ll do the work on the current file (inline) and not produce the results on another file, s/ for substitution, “what I want to replace” is the pattern of what I want to remove, “what I will use for as replacement”, is what I’ll use for substitution at the places I have removed the characters, /g is for global, so I’ll do the substitution on all the cases and not just the first in each line and file.txt is the file we will do the work on.

So, translating for double quotes removal, which is one of the patterns we want, we have this:

sed -i 's/"//g' with-quotes.txt

Following the explanation I gave before, this will replace double quotes with nothing, so we are halfway to our goal, which is removing double quotes and single quotes at the start of values, and the with-quotes.txt should be like this:

tt0294236	short	'Arry and Arriet's Evening Out	'Arry and 'Arriet's Evening Out	0	1909	Comedy
tt9913170	tvEpisode	American Gods	American Gods	0	2019	Talk-Show

Now we have to remove the single at the start of values. As the values are separated by tabs, all the starting values are preceded by tabs, in which case all starting single quotes will have the following pattern: \t. The \t means the tab. We will replace this pattern with a tab only, and then we will have only the tab remaining and the single quote will disappear.

For this case we should use this command:

sed -i "s/\t'/\t/g" with-quotes.txt

And then we have clean data:

 tt0294236	short	Arry and Arriet's Evening Out	Arry and 'Arriet's Evening Out	0	1909	Comedy

 tt9913170	tvEpisode	American Gods	American Gods	0	2019	Talk-Show

As I have several files that have to be cleaned, it makes sense to have a bash script that does the job instead of running these command in each file. The bash script that will do the job is this one:

    #!/bin/bash
    echo "Type the file name in the current directory you want to clean"
    read filename

    echo "Cleaning double quotes"
    command_out=$(sed -i -e 's/"//g' ./$filename)
    echo $command_out
    
    echo "Cleaning phrases starting with single quotes"
    command_out=$(sed -i -e "s/\t'/\t/g" ./$filename)
    echo $command_out

    echo "Done!"

And then we can run these sed command or eventually other commands we have to use to clean the data in one step