Bash, the Data Scientist’s Magnifying Glass

रहिमन देखि बड़ेन को, लघु न दीजिए डारि
जहां काम आवै सुई, कहा करै तलवारि

“ Looking at the big and grand, don’t look down upon the small, 
Where the needle is needed, what use is the sword! “ – Rahim


Python and R have emerged as the language of choice for Data Science. This is not surprising, given their ability to rapidly prototype analyses, use tools like Jupyter Notebook and afford help from a massive community that provides cheat sheets, resources, reliable packages and an array of linux commands.

These languages serve as microscopes into data. Once you know where to look, you can look as closely as you want.

This microscope approach works well when one knows what they’re dealing with. However, most exploratory work that one would do as a Data Scientist would involve getting a “feel” of the data as the most preliminary step. This task is often better suited to a magnifying glass, that offers a closer look at the data, and doesn’t require the overhead that comes with analysis under a microscope.

In the context of computing, BASH serves as a magnifying glass.

It can be justified because of the following features:

  1. By issuing single commands in Bash script, one can perform such operations, which would take way more linux commands on a language as simple as Python (eg: something as simple as printing the first two lines)
  2. It doesn’t require one to fire up another environment (of course, to the pedantic, the terminal emulator too is an environment)
  3. A lot of the linux commands available on Bash are more efficient than the implementations one can do on a Python shell script
  4. They are available in all POSIX compliant Operating Systems. This means you can run them whether you’re running MacOS or Linux.
  5. This portability is very useful while working on environments like compute instances on AWS (Amazon Web Services) and on HPC grids/clusters found popularly in Academic settings.

Having a reasonable command on Bash script can allow one to prepare for a more detailed analysis.
Here’s a walk-through of some useful bash commands.

Getting Started

  1. If you’re on a *nix or MacOS system Just fire up your terminal
  2. If you’re on a Windows system
    1. Ask yourself why!
    2. Reconsider life choices.
    3. Install Ubuntu Bash for Windows after enabling the Windows Subsystem for Linux ( in Windows 10)

Download an example dataset

Kaggle’s introductory Data science/ML challenge is on data from the Titanic. We are going to use that data set to demonstrate most of the commands here. You can find the data set and the tutorial here : https://www.kaggle.com/c/titanic/data. The dataset contains three files. For now we can ignore the one named “gender_submission.csv”.

Getting a feel of the data

Often, faced with a new data set, the first thing one has to do is see what the data is like. The tutorial at Kaggle recommends using Pandas to load the CSV file onto a Jupyter Notebook, and then call the head()function to see what the training data looks like.

Just issuing a head/tail bash command achieves the exact same effect:head train.csv
To manipulate how many lines to see, just use head -n 2 train.csv
Lastly to have a look at the last few lines, use tail, in the exact same way tail -n 10 train.csv
In both cases, the -n flag takes the number of lines to be displayed.

Bash script terminal

See your data in totality

Now that we know what to expect from the data, we may want to glance through the whole data set. We may be interested to see if there are some anomalous values in certain columns.

Using more train.csv The whole file can be seen being printed on the shell. You can scroll down the file by pressing space or return. The more shell command goes through data only in one direction. You can’t scroll up.
At the end of the file, more will relinquish control to the shell script. In case you want to go up and down the file, use less train.csv. Inside less, you can scroll up and down using arrow keys.

You can search inside the text by typing “/<search string>” and hitting enter. Press “n” to go to the next match. Try searching for “,,”. This highlights places wherever two commas come together. We now know, this CSV contains some blank.

When you’re done, press “q” to exit.

Bash script terminal

See how much data there is

Just use the wc bash command like wc train.csv to see the number of characters in the file. Or, more usefully, use wc -l train.csv  to see the number of lines in the file. We now know the training data has 891 records!

Find if your data has things you are looking for

We now know that the data has some blank fields. So let us find all of them to get a sense of the scale of the issue. Here, you will find the bash command grep to be useful. Use grep ”,,” train.csv to see all the lines with a match with the pattern you gave. But this isn’t too useful in its own right.

Bash script terminal

Now, say grep ”,,” train.csv | wc -l . The vertical line character is called a “Pipe” . It pipes the output from the STDOUT of the command on the left into the input of the command on the right. The output here is the number of lines with atleast one empty field.

But this too, is only somewhat useful. Let us go one step ahead, and ask, how many rows have a missing age. By doing a head -n 1 train.csv you can see that Age is the 6th column. Since, this is a CSV file, we can be sure that there’s a comma at the end of each field entry in each column.

So let us use the cut shell command to “cut” the contents of the file according to a delimiter. Say cut -d”,” -f6 train.csv | more.

Bash script terminal

But there seems to be an issue, while the first entry says “ Age”, the following lines are all the sex of the passengers. A head -n 3 train.csv can tell you that the Names of all passengers are written as “ Last Name, First Name”. There’s an extra comma that you’ll have to account for. Now, type in, cut -d”,” -f7 train.csv | grep ”^$” | wc -l. This bash command prints all ages, then “greps” the ones which are blank using the regex characters for beginning and end of line, and then counts the number of lines. We now know there are 177 records with the Age missing.

Work on data streams to pre-process

Now that we have a sense of where all in the data we may want to replace such occurrences so as to do some sort of processing. Let us say we decided to change all the blank values in age to be 30. We can use the sed bash command to write a substitution string. Say, sed -i -e ”s/male,,/male,30,/” train.csv. We use the string “male” to seek the blanks especially because there’s another field that can be blank but age is next to sex.

So let’s see what the distribution of age is . Enter the following bash command, cut -d, -f7 train.csv | tail -n +2 | sort -n > age. Here, we print all the age values, exclude the first line which is a header ( by using the plus sign on tail ) and then pipe the values of age into the sort bash command and ask it to sort numerically (using -n) and write the values in a file called age.

Now let us use the shell command xmgrace age. [Note: This requires you to install the Grace package]

Below is a plot showing the value of age in each line :

Value of Age in each line, plot created using the Grace package in Bash

We can see in the above plot:

  • There are ages from 0 through 80
  • The plateaus tell us that many people have that age
  • The longest plateau, is at 30 ( the value we replaces blanks with )

Now try going to Data->Transformation->Histograms. Set the min, max and number of bins. You’ll get a histogram. Use the Magnification button to zoom into the histogram.

Histogram created using the Grace Package in Bash

Perform some sort of conditional extraction

The cut bash command is great at what it does, but sometimes we need more power. Let’s try the awk bash command. Awk is a programming language in itself but we leave it upto the reader to delve into depth.
For now, try the following bash command, awk -F”,” '{if($2==1&&$3==3){print $3}}' train.csv | wc -l. Here we tell awk to use delimiter “,” and print the value of the PClass if Surviving is 1. Then, we count number of such records. What we get is the number of people who belonged to class 3 and survived. We get 119.
Now, try changing the if condition to $2==0. We get 372. Repeat this with other combinations.

We can easily get a hint of the relation of the Passenger Class with the chance of survival, with class=1 being more likely to survive.

When we build a prediction model now, we could be sure to use the PClass column

Conclusion

Learning Bash has a little bit of a learning curve but the returns on that investment are large in terms of convenience. Also, the portability of the environment means you can work on most systems you will encounter as a Data Scientist.



Bonus

Tools you may want to check out, if you fall in love with the white cursor blinking on the black background
<3

Vim

A powerful and customisable shell script editor on the command line.

Emacs

A (potentially) complete operating system that allows you to edit files, take notes, work on your email, and whatever else you can program to do it in ELisp.

Disclaimer We don’t know much about Emacs other than that people who use it would have had been severely offended if we didn’t include it in an article that mentioned vim. They would have had then proceeded to explain how Emacs is superior. They’re nice folk though! Not a cult! Definitely not a cult!

wget and curl

Useful in downloading your data, especially on remote machines

Bash Threads

A simple job queue to run and manage multiple commands to exploit a multi-core machine very well.

Say you want to run 10000 matrix multiplications each of which take 5 seconds to run you wouldn’t want to run them one by one as that would take 13 hours. You couldn’t fire them all together either as that would crash your machine. But if you were to run them 10 a time, you could get a time boost and also not screw up your machine. Just configure bash_threads to run these 10000 commands one by one, and tell it that it can run upto 10 at a time. It will run them in queue.

Wondering how you’d type out 10000 commands on the command list for bash_threads? Think for loop in bash!

Pandoc and Mermaid

Take notes in markdown and convert them into pretty PDFs using pandoc.Express flowcharts, sequence diagrams and Gantt charts in mermaid and convert them to nice pictures.
Integrate mermaid into pandoc, and express notes and diagrams in a single file and get notes that have nice diagrams in a single PDF.


Also published on Medium.

Abhirath Batra Written by:

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *