R and Quakes

Update: The 6.0 Napa earthquake occured a week after I originally published this post. I re-ran the code and analysis to include the latest activity.

I decided to do a bit of data exploration around major California earthquakes in the last 30 years. Specifically, I will look at earthquakes within 200 km of San Francisco and Los Angeles between August 1984 and August 2014.

USGS has a very accessible Earthquake API and online query form. I wrote a simple script in Python3 which queries the API and downloads the result.

Data preparation

Fetching the data

To pull down the data, use usgs.APIquery with the desired parameters. All parameters are defined here.

import usgs

# obtain data for the greater San Francisco area, start of 1984 through August 24, 2014 in PST
usgs.APIquery(starttime = "1984-01-01T07:00:00", endtime = "2014-08-25T07:00:00",
              minmagnitude = "1.5",
              latitude = "37.77", longitude = "-122.44",
              minradiuskm = "0", maxradiuskm = "200",
              reviewstatus = "reviewed",
              filename = "usgsQuery_SF_84-14.csv",
              format = "csv")

# obtain data for the greater Los Angeles area, start of 1984 through August 24, 2014 in PST
usgs.APIquery(starttime = "1984-01-01T07:00:00", endtime = "2014-08-25T07:00:00",
              minmagnitude = "1.5",
              latitude = "34.05", longitude = "-118.26",
              minradiuskm = "0", maxradiuskm = "200",
              reviewstatus = "reviewed",
              filename = "usgsQuery_LA_84-14.csv",
              format = "csv")

Switching over to use R, I then read the CSVs into dataframes. For convenience, I define an area label and combine the dataframes into one.

SFquakes = read.csv("usgsQuery_SF_84-14.csv", header = TRUE, stringsAsFactors = FALSE)
LAquakes = read.csv("usgsQuery_LA_84-14.csv", header = TRUE, stringsAsFactors = FALSE)
SFquakes$area = "SF"
LAquakes$area = "LA"
quakes = rbind(SFquakes, LAquakes)

Data scrubbing and binning

# remove columns with all NA values
quakes = quakes[,-7:-10]

# create column with time object
quakes$ptime = as.POSIXlt(strptime(quakes$time, "%Y-%m-%dT%T"))

# remove instances of mwb, an alternative form of magnitude measurement
quakes = quakes[-which(quakes$magType == "mwb"),]

# create columns with year bins
quakes$yearbins = strftime(cut(quakes$ptime, "year", right = F), "%Y")

# keep one significant figure after the decimal point for magnitude
quakes$mag = round(quakes$mag, 1)

# create columns with half and whole magnitude bins
quakes$Mhalfbins = cut(quakes$mag, seq(2.0,7.5,.5), right = F)
quakes$Mwholebins = cut(quakes$mag, seq(2.0,8.0,1), right = F)

# create a column with varied magnitude bins, necessary if plotting without a log scale
quakes$Mvariedbins = cut(quakes$mag, c(2.0,2.5,3.0,3.5,4.0,5.0,7.5), right = F)

Magnitude-energy conversion

One way to comprehend the destructive force of earthquakes is to convert from earthquake magnitude to a more imaginative unit of energy: the ton of TNT. To do this conversion, we need to be able to convert from magnitude to energy, perhaps with the help of Beno and Charles. Also, it is useful to know that one ton of TNT is equivalent to about 4.184e9 Joules.

The Gutenberg and Richter energy-magnitude relation (in Joules):

E[M] = 10^(1.5M + 4.8)

quakes$Ejoules = 10^(1.5*quakes$mag + 4.8) # units of Joules
quakes$Etnt = quakes$Ejoules/4.184e9       # units of kilotonnes, TNT

Calculate approximate distance from epicenter to city center

For this calculation I decided to use the Haversine formula. A great reference to implementing this method and other methods can be found here.

Rearth = 6371 # average earth radius

# approximate city centroids
quakes$area_lon = 0
quakes$area_lon[quakes$area == "SF"] = -122.44
quakes$area_lon[quakes$area == "LA"] = -118.26
quakes$area_lat = 0
quakes$area_lat[quakes$area == "SF"] = 37.77
quakes$area_lat[quakes$area == "LA"] = 34.05

# haversine formula
lat1 = quakes$area_lat*pi/180
lat2 = quakes$latitude*pi/180
dlat = (quakes$area_lat - quakes$latitude)*pi/180
dlon = (quakes$area_lon - quakes$longitude)*pi/180
a = sin(dlat/2)*sin(dlat/2) + cos(lat1)*cos(lat2)*sin(dlon/2)*sin(dlon/2)
c = 2*atan2(sqrt(a), sqrt(1 - a))
quakes$dist = Rearth*c

Ask the data

What were the largest earthquakes in recent history?

quakes = quakes[order(quakes$mag, decreasing = TRUE),] # sort by magnitude
print(quakes[quakes$mag >= 6.0,c("ptime", "mag", "area", "Etnt", "dist")])

Output (with annotation):

                    ptime mag area       Etnt      dist
1992-06-28 11:57:38 7.3   LA 1344028.02 159.76987 <-- Landers
1999-10-16 09:46:46 7.2   LA  951498.97 175.46360 <-- Hector Mine
1989-10-18 00:04:16 6.9   SF  337604.58  94.25933 <-- Loma Prieta
1994-01-17 12:30:55 6.7   LA  169203.10  31.30647 <-- Northridge
1992-06-28 15:05:33 6.5   LA   84802.44 135.38916 <-- related to Landers
1984-04-24 21:15:20 6.1   SF   21301.41  82.69884 <-- Morgan Hill
1992-04-23 04:50:23 6.1   LA   21301.41 162.05213 <-- Joshua Tree, preceeded Landers
   2014-08-24 10:20:44 6.0   SF   15080.24  51.28585 <-- Napa

To compare the largest earthquakes in each area, the Landers quake had an explosive force of 1.3 megatonnes of TNT, while the Loma Prieta had about 340 kilotonnes of explosive force.

Since epicenter location is a huge factor in the destructive power of an earthquake, it is interesting to note that the Landers quake was about 160 km away from the densely populated LA metro area, while the Northridge quake was 30 km away, and about 4.0 (10^.6) times less powerful.

Which major city was most affected by earthquakes?

# Mean distance, magnitude, and Energy in TNT
print(aggregate(data = quakes, cbind(dist, mag, Etnt) ~ area, mean))

Output:

  area     dist      mag      Etnt
1   LA 124.8353 2.796353 273.54752
2   SF 115.9873 2.589565  56.07969

The question of which city has been more affected is more complex than a simple calculation of the 30-year mean of distance and magnitude. However, on average, recent earthquakes surrounding LA have been about 10% closer than Bay Area earthquakes, 1.6 (10^.21) times more severe in magnitude, and with over 4 times more explosive force.

What is the combined yearly mean distance, magnitude, and energy?

# yearly averages
print(aggregate(data = quakes, cbind(dist, mag, Etnt) ~ yearbins, mean))

Output:

yearbins      dist      mag         Etnt
    1984 109.04522 3.351741  118.2368472
    1985 119.81664 3.106047    7.7798894
    1986 123.08929 3.196345   57.9139893
    1987  97.24530 3.098084   28.2857204
    1988 117.30689 3.134848   14.5052295
    1989  99.45154 3.223005  804.5486309
    1990 109.68262 3.072021   23.2815528
    1991 113.49903 3.020319   24.0027741
    1992 156.32608 3.237759  983.2753462
   1993 135.49608 2.986907    2.7064472
   1994  70.18247 3.157971  168.1940170
   1995 138.14896 3.124582   17.2462753
   1996 129.99745 3.138735    6.0027307
   1997 115.75065 3.156890    6.6688196
   1998 117.41048 3.114228    6.7265345
   1999 172.62893 3.256908 1310.8670301
   2000 149.68578 3.105828    4.8142657
   2001 137.13644 3.147203    7.0122122
   2002 117.39979 3.076856    4.3661773
   2003 121.76297 3.115162    6.3490834
   2004 119.03760 3.097368    8.9307051
   2005 121.67829 3.138785   13.8969863
   2006 121.59305 3.050495    3.9256528
   2007 114.70162 3.046964   19.8885765
   2008 113.91730 2.963265   12.9691261
   2009  96.35781 2.678468    3.1106334
   2010 104.13835 2.665835    1.2364846
   2011  99.73551 2.676221    1.5248688
   2012 104.98631 2.606157    1.1377617
   2013 125.22960 1.933424    0.3044841
   2014 118.96839 1.942429    6.8136214

Due to the exponential nature of the data, the Etnt feature is a clear indicator for seismologically active years. Also, it appears we are seeing increased detection efficiency in the last few years with the average magnitude decreasing. Note, however, that 2014 is not a complete year at the time of this blog post.

What is the magnitude-frequency distribution for the two areas of interest?

freqSF = as.data.frame(table(quakes[quakes$area == "SF","Mhalfbins"]))
names(freqSF) = c("magSF", "freqSF")
freqLA = as.data.frame(table(quakes[quakes$area == "LA","Mhalfbins"]))
names(freqLA) = c("magLA", "freqLA")

# event frequency
print(cbind(freqSF, freqLA))

Output:

     magSF freqSF   magLA freqLA
[2,2.5)   1215 [2,2.5)   1018
[2.5,3)   2069 [2.5,3)   2600
[3,3.5)   1540 [3,3.5)   2848
[3.5,4)    460 [3.5,4)   1055
[4,4.5)    169 [4,4.5)    365
[4.5,5)     44 [4.5,5)     87
[5,5.5)      9 [5,5.5)     35
[5.5,6)      2 [5.5,6)     10
[6,6.5)      2 [6,6.5)      1
[6.5,7)      1 [6.5,7)      2
[7,7.5)      0 [7,7.5)      2

Note: the units for frequency are in Event Counts per 30 years.

I am not an expert in seismology. However, my guess would be that this magnitude-frequency should theoretically follow a power-law where at one end, as the magnitude approaches zero, the frequency increases exponentially, and at the other end, as the magnitude exceeds 10 the frequency diminishes to almost 0. The fact that the frequency distribution peaks between [3,3.5) and falls off as the magnitude approaches 0, given my uneducated guess, suggests that the detection efficiency (exponentially) decreases as it approaches 0. See the magnitude versus frequency plot below.

Plots

Time for some plots. Let’s have a look at the following correlations:

time vs. magnitude
magnitude-frequency correlation

Load ggplot2 and scales. The scales package is necessary for log tick marks.

library("ggplot2")
library("scales")

time vs. magnitude

png("SF-LA_timeVmag.png", width = 1000, height = 800)
timeVmag = ggplot(na.omit(quakes), aes(ptime, mag)) +
           geom_point(aes(size = mag, color = dist)) +
           ggtitle("SF and LA Earthquakes, 1984-2014") +
           xlab("time") +
           ylab("earthquake magnitude") + 
           guides(size = guide_legend(title = "magnitude")) +
           guides(color = guide_legend(title = "distance [km]")) +
           scale_color_gradient(low = "red", high = "dark gray") +
           theme_grey(base_size = 12) +
           theme(text = element_text(size = 22)) +
           facet_wrap(~ area, ncol = 1)
print(timeVmag)
dev.off()

Output:

Magnitude over Time

The major earthquakes that pop out in this plot are: Landers in 1992, Loma Prieta in 1989, and Northridge at the beginning of 1994. What is also interesting is the amount of aftershocks and related earthquakes for the major Los Angeles area earthquakes. Another interesting aspect of this plot is the increased detection efficiency of 2 to 2.5 magnitude earthquakes since 2009. Perhaps this is due to the Quake-Catcher Network (QCN)?

magnitude-frequency correlation

freq = as.data.frame(table(quakes[quakes$mag >= 2.0, c("area","mag")]))
freq = freq[order(freq$area),]
freq[freq$Freq == 0.0,] = NA # bins with zero counts cause log plot issues
png("SF-LA_magVfreq.png", width = 1000, height = 800)
magVfreq = ggplot(na.omit(freq), aes(mag, Freq)) +
           geom_point(aes(color = area), size = 4) + 
           ggtitle("SF and LA Earthquakes, 1984-2014") + 
           ylab("frequency [event count per 31 years]") + 
           scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
                         labels = trans_format("log10", math_format(10^.x))) +
           scale_x_discrete("magnitude", breaks = seq(2,7,1)) + 
           geom_smooth(method = "loess", aes(group = 1), color = "black") +
           theme_grey(base_size = 12) + 
           theme(text = element_text(size = 22))
print(magVfreq)
dev.off()

Output:

Magnitude versus Frequency

The correlation between magnitude and frequency for most of the Richter scale has a slope of about 10^1 events over one order of magnitude. Furthermore, as the magnitude increases, the spread of the data increases due to the insufficient amount of counts.

Code available on Github

The code for this post can be found under the exploration directory on github in my usgs repository: quakes.R

Many thanks to usgs.gov for providing easy access to their earthquake data.