The Sultan of Awk

Back during one of my undergraduate astronomy data analysis courses, we briefly covered useful shell commands for searching and parsing data such as grep, sed, cut, and awk. At the time I remember thinking that they looked very useful, but without practice I quickly reverted to more familiar and less efficient methods.

Fast forward to the week before I started the Zipfian Academy. In our pre-course work, we were tasked with parsing and performing preliminary analysis on New York nursing home bed census data using three different approaches: SQLite3, Python, and Shell. Between the three methods, Shell was by far the shortest and (arguably) most elegant solution. In order to demonstrate my point, I will use some super-useful shell commands on two MLB data sets containing player batting stats.

Shell scripting commands can be dense, so I will do my best to spell out each command in the plainest English.

Current American League Batting Stats

Converting between tab or space delimited format to a comma delimited format

stats.tsv is currently delimited by whitespace of various lengths. This is not a very useful format. The following command uses cat to send all of stats.tsv to standard output, then pipe | that output to sed -E, which takes “extended” regular expressions that find more than one whitespace [[:space:]]+ and replaces it with a single comma. Finally, send the resulting output > to stats.csv.

$ cat stats.tsv | sed -E "s/[[:space:]]+/,/g" > stats.csv

Displaying a .csv file in well-formatted columns

Show the top five rows of stats.csv, then use column -s, -t to tell shell to turn a comma delimited STDOUT into cleanly separated columns.

$ head -6 stats.csv | column -s, -t
RK  Last      F.  Team  Pos  G   AB   R   H   2B  3B  HR  RBI  BB  SO  SB  CS  AVG   OBP   SLG   OPS
 Bautista  J   TOR   RF   44  157  35  47  8   0   11  30   37  32  1   1   .299  .437  .561  .997
 Choo      S   TEX   LF   38  132  20  40  7   1   4   11   24  40  3   2   .303  .427  .462  .889
 Dunn      A   CWS   1B   34  114  14  29  7   0   6   15   27  39  1   0   .254  .399  .474  .872
 Napoli    M   BOS   1B   40  143  16  38  9   0   5   22   29  44  0   1   .266  .397  .434  .830
 Ortiz     D   BOS   DH   40  152  22  46  8   0   11  25   22  29  0   0   .303  .392  .572  .964

Count unique teams with word count

tail +2 stats.csv will output lines from row 2 to the end of stats.csv. In other words, tail +2 outputs all lines except for the header. cut -d, will allow you to select fields from a comma delimited file and -f4 tells cut to take the 4th column. Finally, sort the teams, perform uniq to remove similar adjacent lines, and use wc -l to count the lines.

$ tail +2 stats.csv | cut -d, -f4 | sort | uniq | wc -l
    15

Show only Oakland Athletics players and sort by batting average

grep OAK performs a regular expression search with OAK, which returns all of the Athletics players in the list. Finally, sort using numeric values -n with a comma delimiter -t, starting and ending on field 18 -k18,18 in descending order -r.

$ grep OAK stats.csv | sort -n -t, -k18,18 -r | column -s, -t
Moss       B  OAK  1B  42  143  20  41  8   1  9   36  15  31  1  0  .287  .372  .545  .917
Donaldson  J  OAK  3B  42  173  34  48  10  1  10  34  19  39  1  0  .277  .351  .520  .871
Lowrie     J  OAK  SS  42  162  24  42  14  0  3   18  22  20  0  0  .259  .360  .401  .761
Cespedes   Y  OAK  LF  39  144  23  37  11  1  7   22  16  28  0  1  .257  .329  .493  .822
Callaspo   A  OAK  2B  37  124  14  31  4   0  3   15  21  18  0  0  .250  .359  .355  .713
Reddick    J  OAK  RF  39  131  18  30  1   3  4   18  10  31  1  0  .229  .289  .374  .663

Baseball Batting History

Switching gears, let’s look at a data set containing all MLB batters from 1871 to 2013. Many thanks to the Lahman Baseball Database for making the data easily accessible.

Display batting stats from the 1985 San Francisco Giants

This is a large data set so let’s get specific: how can I look up the player stats for the 1985 San Francisco Giants? This is easily done with chained grep commands. But first, in order to display the header along with the grep output, I send the header and the output into two different files, header.csv and giants1985.csv. Finally, I concatenate the two files with cat, and cut out unwanted columns (keeping columns 1-5 and 8-22).

$ head -1 Batting.csv > header.csv

$ grep SFN Batting.csv | grep 1985 > giants1985.csv

$ cat header.csv giants1985.csv | cut -d, -f -5,8-22 | column -s, -t
playerID   yearID  stint  teamID  lgID  AB   R   H    2B  3B  HR  RBI  SB  CS  BB  SO   IBB  HBP  SH  SF
adamsri02  1985    1      SFN     NL    121  12  23   3   1   2   10   1   1   5   23   3    1    3   0
bluevi01   1985    1      SFN     NL    30   0   4    1   0   0   0    0   0   3   12   0    0    8   0
brenlbo01  1985    1      SFN     NL    440  41  97   16  1   19  56   1   4   57  62   5    2    4   2
brownch02  1985    1      SFN     NL    432  50  117  20  3   16  61   2   3   38  78   4    11   1   0
davisch01  1985    1      SFN     NL    481  53  130  25  2   13  56   15  7   62  74   12   0    1   7
davisma01  1985    1      SFN     NL    12   0   3    0   1   0   0    0   1   0   5    0    0    4   0
deerro01   1985    1      SFN     NL    162  22  30   5   1   8   20   0   1   23  71   0    0    0   2
driesda01  1985    2      SFN     NL    181  22  42   8   0   3   22   0   0   17  22   3    1    0   3
garresc01  1985    1      SFN     NL    9    1   2    1   0   0   2    0   0   1   4    0    0    0   0
gladdda01  1985    1      SFN     NL    502  64  122  15  8   7   41   32  15  40  78   1    7    10  2
gottji01   1985    1      SFN     NL    51   6   10   2   0   3   3    0   1   1   30   0    0    4   0
greenda03  1985    1      SFN     NL    294  36  73   10  2   5   20   6   5   22  58   3    1    2   2
hammaat01  1985    1      SFN     NL    47   0   4    0   0   0   0    0   0   0   17   0    0    6   0
jeffcmi01  1985    2      SFN     NL    1    0   0    0   0   0   0    0   0   1   0    0    0    0   0
krukomi01  1985    1      SFN     NL    55   2   12   4   0   1   3    1   1   1   15   0    2    8   0
kuipedu01  1985    1      SFN     NL    5    0   3    0   0   0   0    0   0   1   0    0    0    2   0
lapoida01  1985    1      SFN     NL    60   4   10   1   0   0   6    0   0   6   11   0    0    5   0
laskebi01  1985    1      SFN     NL    30   1   4    0   0   0   1    0   0   3   12   0    1    5   0
lemasjo01  1985    1      SFN     NL    16   1   0    0   0   0   0    0   0   1   5    0    0    0   0
leonaje01  1985    1      SFN     NL    507  49  122  20  3   17  62   11  6   21  107  5    1    1   1
masonro01  1985    1      SFN     NL    11   1   1    0   0   0   0    0   0   0   5    0    1    0   0
mintogr01  1985    1      SFN     NL    8    1   0    0   0   0   1    0   0   1   6    0    0    0   0
moorebo01  1985    1      SFN     NL    2    0   0    0   0   0   0    0   0   0   0    0    0    0   0
nokesma01  1985    1      SFN     NL    53   3   11   2   0   2   5    0   0   1   9    0    1    0   0
rajsiga01  1985    1      SFN     NL    91   5   15   6   0   0   10   0   1   17  22   4    0    2   0
robinje01  1985    1      SFN     NL    0    0   0    0   0   0   0    0   0   0   0    0    0    0   0
roeniro01  1985    1      SFN     NL    133  23  34   9   1   3   13   6   2   35  27   3    0    1   1
thompsc01  1985    1      SFN     NL    111  8   23   5   0   0   6    0   0   2   10   0    0    1   0
trevial01  1985    1      SFN     NL    157  17  34   10  1   6   19   0   0   20  24   0    0    1   1
trillma01  1985    1      SFN     NL    451  36  101  16  2   3   25   2   0   40  44   0    1    11  2
uribejo01  1985    1      SFN     NL    476  46  113  20  4   3   26   8   2   30  57   8    2    5   0
wardco01   1985    1      SFN     NL    2    0   0    0   0   0   0    0   0   0   1    0    0    0   0
wellmbr01  1985    1      SFN     NL    174  16  41   11  1   0   16   5   2   4   33   1    4    5   1
willifr01  1985    1      SFN     NL    3    0   0    0   0   0   0    0   0   0   0    0    0    1   0
woodami01  1985    1      SFN     NL    82   12  20   1   0   0   9    6   1   5   3    0    0    1   0
youngjo02  1985    1      SFN     NL    230  24  62   6   0   4   24   3   2   30  37   1    1    1   1

Total Career Hits

Since this data set contains multiple entries/rows for every player (every year they played), we need to aggregate their stats by year to come up with career hit totals. This can easily be done with awk arrays. Arrays in awk are tables with unique indices, so you can pass a column as its index and use ++ to count rows with the same index or +=$col to sum a particular column on rows with the same index. In the command below, p[$1]++ creates an array that uses column 1 (playerID) as its index, then counts the number of duplicate index entries. This basically computes the length of a players career. H[$1]+=10 is an array indexed by playerID, whose value is the sum of column 10 (Hits) for each playerID. Finally, the statement starting with for (i in p) prints the playerID, career length, and hit total in CSV format for every playerID. Also note that the field separator switch in awk is -F.

$ awk -F, '{p[$1]++;H[$1]+=$10} END {for (i in p) print i","p[i]","H[i]}' Batting.csv > H.csv

All-time greatest hitters

A reverse sort on field 3 is all that is needed here. As expected, we see Pete Rose, Ty Cobb, Hank Aaron, Stan Musial, and Tris Speaker. Comparing these to Baseball-Reference, I noticed that I over-counted Pete Rose’s career length by one year. A quick grep on rosepe01 confirmed that he has two entries for 1984 because he switched from Montreal to Cincinnati mid-season.

$ sort -n -t, -r -k3,3 H.csv | head -5 | column -s, -t
rosepe01   25  4256
cobbty01   24  4189
aaronha01  23  3771
musiast01  22  3630
speaktr01  22  3514

Batting Averages

Now suppose I want to compute batting averages and add them as a column in H.csv. This can be done with another awk command, much like the one above. Then, use join to merge the two data sets on the first field, playerID. Note that the awk command’s if-else statement avoids an arithmetic error by printing 0.00 if a player has zero “at bats”.

$ awk -F, '{AB[$1]+=$8;H[$1]+=$10} END {for (i in H) if (AB[i] != 0) print i","H[i]/AB[i]; else print i",0.00"}' Batting.csv > AVG.csv

$ join -t, H.csv AVG.csv > HAVG.csv

Now repeat the above search showing the greatest hitters.

$ sort -n -t, -r -k3,3 HAVG.csv | head -5 | column -s, -t
rosepe01   25  4256  0.302853
cobbty01   24  4189  0.366363
aaronha01  23  3771  0.304998
musiast01  22  3630  0.330842
speaktr01  22  3514  0.344679

The Great Bambino

Let’s do a search for The Sultan of Swat. Since I know playerIDs start with the last name followed by the first two letters of their first name, I will do a grep on “ruthba”. Sure enough, George Herman Ruth Junior comes up. In order to display his stats with column names, I will first save the grep search to its own file then cat the header and Babe’s stats together. Again, I use to cut to be picky about which columns to view, and then column for its fancy formatting.

$ grep ruthba Batting.csv > greatbambino.csv

$ cat header.csv greatbambino.csv | cut -d, -f 1,2,4,5,8-14 | column -s, -t
playerID  yearID  teamID  lgID  AB   R    H    2B  3B  HR  RBI
ruthba01  1914    BOS     AL    10   1    2    1   0   0   2
ruthba01  1915    BOS     AL    92   16   29   10  1   4   21
ruthba01  1916    BOS     AL    136  18   37   5   3   3   15
ruthba01  1917    BOS     AL    123  14   40   6   3   2   12
ruthba01  1918    BOS     AL    317  50   95   26  11  11  66
ruthba01  1919    BOS     AL    432  103  139  34  12  29  114
ruthba01  1920    NYA     AL    457  158  172  36  9   54  137
ruthba01  1921    NYA     AL    540  177  204  44  16  59  171
ruthba01  1922    NYA     AL    406  94   128  24  8   35  99
ruthba01  1923    NYA     AL    522  151  205  45  13  41  131
ruthba01  1924    NYA     AL    529  143  200  39  7   46  121
ruthba01  1925    NYA     AL    359  61   104  12  2   25  66
ruthba01  1926    NYA     AL    495  139  184  30  5   47  150
ruthba01  1927    NYA     AL    540  158  192  29  8   60  164
ruthba01  1928    NYA     AL    536  163  173  29  8   54  142
ruthba01  1929    NYA     AL    499  121  172  26  6   46  154
ruthba01  1930    NYA     AL    518  150  186  28  9   49  153
ruthba01  1931    NYA     AL    534  149  199  31  3   46  163
ruthba01  1932    NYA     AL    457  120  156  13  5   41  137
ruthba01  1933    NYA     AL    459  97   138  21  3   34  103
ruthba01  1934    NYA     AL    365  78   105  17  4   22  84
ruthba01  1935    BSN     NL    72   13   13   0   0   6   12

How about the Colossus of Clout’s career totals?

$ grep ruthba HAVG.csv | column -s, -t
ruthba01  22  2873  0.342105

Why don’t you just use Pandas?

Yes, Pandas or R can do all of the above data manipulation and so much more. Why use awkward and opaque shell commands? Because they are efficient, built-in, and fantastic for simple data exploration and manipulation.