SQL and Relational Databases

A relational database is a set of rectangular data frames called tables linked by keys relating one table to another. Software implementations of such data structures are known as relational database management systems (RDBMS). Most RDBMS use structured query language (SQL) to modify or search the relational database.

SQL provides a syntax for interfacing with relational data. It is largely a declarative language in that we use SQL to specify what we wish to accomplish, leaving the how to the RDBMS. While there are standards for SQL implementations put out by the International Organization for Standardization (ISO) and the American National Standards Institute (ANSI), there are several open source and commercial implementations that each have unique features.

I will try to focus on the commonalities, but will be using an SQLite engine in R for providing examples. One unique feature of SQLite is that it does not follow the client-server model. In this model, a physical computer storing the data and executing queries within the RDMBS, the server, is separate from the machine requesting the queries known as the client. The client-server model is popular in business, health care, and other domains as it allows security and monitoring of how the data is queried. It is also popular for many large open data projects (i.e ensemble) where it is beneficial for data to be centrally maintained and frequently accessed on the fly.

In order to write SQL queries within R I will be using the package dbplyr which provides an SQL back end to dplyr functions, but also allows SQL queries to be parsed as strings.

Lahman Examples

For our examples, we will use the Lahman package from R which contains historical baseball data from 1876-2018.

## # Source:   table<BATTING> [?? x 22]
## # Database: sqlite 3.29.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp7D9J7P/lahman.sqlite]
##    playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B
##    <chr>     <int> <int> <chr>  <chr> <int> <int> <int> <int> <int> <int>
##  1 abercda…   1871     1 TRO    NA        1     4     0     0     0     0
##  2 addybo01   1871     1 RC1    NA       25   118    30    32     6     0
##  3 allisar…   1871     1 CL1    NA       29   137    28    40     4     5
##  4 allisdo…   1871     1 WS3    NA       27   133    28    44    10     2
##  5 ansonca…   1871     1 RC1    NA       25   120    29    39    11     3
##  6 armstbo…   1871     1 FW1    NA       12    49     9    11     2     1
##  7 barkeal…   1871     1 RC1    NA        1     4     0     1     0     0
##  8 barnero…   1871     1 BS1    NA       31   157    66    63    10     9
##  9 barrebi…   1871     1 FW1    NA        1     5     1     1     1     0
## 10 barrofr…   1871     1 BS1    NA       18    86    13    13     2     1
## # … with more rows, and 11 more variables: HR <int>, RBI <int>, SB <int>,
## #   CS <int>, BB <int>, SO <int>, IBB <int>, HBP <int>, SH <int>,
## #   SF <int>, GIDP <int>
## # A tibble: 24 x 5
##    file                class        nobs  nvar title                    
##    <chr>               <chr>       <dbl> <dbl> <chr>                    
##  1 AllstarFull         data.frame   4993     8 AllstarFull table        
##  2 Appearances         data.frame  99466    21 Appearances table        
##  3 AwardsManagers      data.frame    171     6 AwardsManagers table     
##  4 AwardsPlayers       data.frame   6026     6 AwardsPlayers table      
##  5 AwardsShareManagers data.frame    401     7 AwardsShareManagers table
##  6 AwardsSharePlayers  data.frame   6705     7 AwardsSharePlayers table 
##  7 Batting             data.frame  99846    22 Batting table            
##  8 BattingPost         data.frame  11294    22 BattingPost table        
##  9 CollegePlaying      data.frame  17350     3 CollegePlaying table     
## 10 Fielding            data.frame 167938    18 Fielding table           
## 11 FieldingOF          data.frame  12028     6 FieldingOF table         
## 12 FieldingPost        data.frame  11924    17 FieldingPost data        
## 13 HallOfFame          data.frame   4088     9 Hall of Fame Voting Data 
## 14 Managers            data.frame   3370    10 Managers table           
## 15 ManagersHalf        data.frame     93    10 ManagersHalf table       
## 16 Master              data.frame  18589    26 Master table             
## 17 Pitching            data.frame  43330    30 Pitching table           
## 18 PitchingPost        data.frame   4945    30 PitchingPost table       
## 19 Salaries            data.frame  24758     5 Salaries table           
## 20 Schools             data.frame   1207     5 Schools table            
## 21 SeriesPost          data.frame    298     9 SeriesPost table         
## 22 Teams               data.frame   2775    48 Teams table              
## 23 TeamsFranchises     data.frame    120     4 TeamFranchises table     
## 24 TeamsHalf           data.frame     52    10 TeamsHalf table

Working with tables

Search, subset, and limiting clauses

The basic structure of a SQL query contains a SELECT statement indicating which columns are desired and a FROM statement explaining where to find them. You can indicate a subset of rows using a WHERE clause.

In the model above, CREATE TABLE will only work with a client-server model if you have permissions to write to memory. Use a wild card * to select all columns in a table:

## # A tibble: 5 x 26
##   playerID birthYear birthMonth birthDay birthCountry birthState birthCity
##   <chr>        <int>      <int>    <int> <chr>        <chr>      <chr>    
## 1 aardsda…      1981         12       27 USA          CO         Denver   
## 2 aaronha…      1934          2        5 USA          AL         Mobile   
## 3 aaronto…      1939          8        5 USA          AL         Mobile   
## 4 aasedo01      1954          9        8 USA          CA         Orange   
## 5 abadan01      1972          8       25 USA          FL         Palm Bea…
## # … with 19 more variables: deathYear <int>, deathMonth <int>,
## #   deathDay <int>, deathCountry <chr>, deathState <chr>, deathCity <chr>,
## #   nameFirst <chr>, nameLast <chr>, nameGiven <chr>, weight <int>,
## #   height <int>, bats <chr>, throws <chr>, debut <chr>, finalGame <chr>,
## #   retroID <chr>, bbrefID <chr>, deathDate <dbl>, birthDate <dbl>

Above we use a LIMIT clause to grab just the first 5 rows of our table. This is good practice when developing new queries as it prevents large wait times only to discover a bug.

Limiting Clauses

Here we will explore the limiting clause WHERE in greater detail. Most of what we discuss in this section will also work with HAVING which operates on summary statistics rather than individual rows.

Boolean Logic

We can use Boolean comparisons in a WHERE clause as shown in the example below. We find all player-seasons since 2000 in which the player was credited with an RBI 100 or more times.

Here is our query:

And here it is in dbplyr:

## # Source:   SQL [?? x 3]
## # Database: sqlite 3.29.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp7D9J7P/lahman.sqlite]
##    playerID  yearID   RBI
##    <chr>      <int> <int>
##  1 aloumo01    2000   114
##  2 anderga01   2000   117
##  3 bagweje01   2000   132
##  4 batisto01   2000   114
##  5 belleal01   2000   103
##  6 bondsba01   2000   106
##  7 cirilje01   2000   115
##  8 delgaca01   2000   137
##  9 dyeje01     2000   118
## 10 edmonji01   2000   108
## # … with more rows

Note that the code above creates and tests the query, but does not execute it over the entire table. To do that, we use collect().

## # A tibble: 496 x 3
##    playerID  yearID   RBI
##    <chr>      <int> <int>
##  1 aloumo01    2000   114
##  2 anderga01   2000   117
##  3 bagweje01   2000   132
##  4 batisto01   2000   114
##  5 belleal01   2000   103
##  6 bondsba01   2000   106
##  7 cirilje01   2000   115
##  8 delgaca01   2000   137
##  9 dyeje01     2000   118
## 10 edmonji01   2000   108
## # … with 486 more rows

IN

To select on a column by testing against a set of fixed values use IN.

## # A tibble: 6 x 3
##   Given             Last     born
##   <chr>             <chr>   <int>
## 1 Felipe Rojas      Alou     1935
## 2 Jesus Maria Rojas Alou     1942
## 3 Mateo Rojas       Alou     1938
## 4 Moises Rojas      Alou     1966
## 5 George Kenneth    Griffey  1950
## 6 George Kenneth    Griffey  1969

LIKE

Use a LIKE statement with a WHERE clause to get partial string matching. You can use % to match any sub-string.

## # A tibble: 26 x 3
##    Given              Last       born
##    <chr>              <chr>     <int>
##  1 Arthur Joseph      Griffin    1988
##  2 Alfredo Claudino   Griffin    1957
##  3 Bartholomew Joseph Griffith   1896
##  4 Clark Calvin       Griffith   1869
##  5 Robert Derrell     Griffith   1943
##  6 Douglas Lee        Griffin    1947
##  7 Frank Wesley       Griffith   1872
##  8 James Linton       Griffin    1886
##  9 Ivy Moore          Griffin    1896
## 10 Jeremy Richard     Griffiths  1978
## # … with 16 more rows

Most SQL implementations also have a REGEXP or REGEXLIKE function that works with regular expressions, but SQLite requires a user defined regex() for its use so we skip it here.

Combining limiting clauses

Limiting WHERE clauses can be combined using AND and OR. Statements can be negated using NOT.

## # A tibble: 25 x 4
##    Given         Last       born Place
##    <chr>         <chr>     <int> <chr>
##  1 Ednel Javier  Baez       1992 P.R. 
##  2 Jose Orlando  Berrios    1994 P.R. 
##  3 Victor Manuel Caratini   1993 P.R. 
##  4 Alexander     Claudio    1992 P.R. 
##  5 Joseph        Colon      1990 P.R. 
##  6 Carlos Javier Correa     1994 P.R. 
##  7 Noel A.       Cuevas     1991 P.R. 
##  8 Jose Eugenio  De Leon    1992 P.R. 
##  9 Edwin Orlando Diaz       1994 P.R. 
## 10 Enrique J.    Hernandez  1991 P.R. 
## # … with 15 more rows
## # A tibble: 55 x 4
##    Given                              Last        born Place
##    <chr>                              <chr>      <int> <chr>
##  1 Benny Peter                        Agbayani    1971 USA  
##  2 Dustin Kamakana Mai Ku'u Makualani Antolin     1989 USA  
##  3 Anthony Lee                        Barnette    1983 USA  
##  4 Chad Robert                        Bentz       1980 USA  
##  5 Douglas Edmund                     Capilla     1952 USA  
##  6 Shawn Anthony                      Chacon      1977 USA  
##  7 Steven Montague                    Cooke       1970 USA  
##  8 Aaron R.                           Cunningham  1986 USA  
##  9 Ronald Maurice                     Darling     1960 USA  
## 10 Joseph                             DeSa        1959 USA  
## # … with 45 more rows

ORDER BY

Use an ORDER BY clause with a comma separated list of columns to arrange the table.

## # A tibble: 152 x 3
##    playerID  yearID   RBI
##    <chr>      <int> <int>
##  1 cabremi01   2010   126
##  2 rodrial01   2010   125
##  3 bautijo02   2010   124
##  4 pujolal01   2010   118
##  5 gonzaca01   2010   117
##  6 guerrvl01   2010   115
##  7 vottojo01   2010   113
##  8 youngde03   2010   112
##  9 konerpa01   2010   111
## 10 canoro01    2010   109
## # … with 142 more rows

Aggregations, Group By

We can perform aggregations such as sums, means, and counts similar to summarize() from dplyr.

We can use a GROUP BY clause for the split-apply-combine pattern.

Here we find the players with the most total RBI since 2010.

## # A tibble: 3,304 x 2
##    playerID  RBI_total
##    <chr>         <int>
##  1 cabremi01       882
##  2 encared01       870
##  3 pujolal01       870
##  4 cruzne02        853
##  5 canoro01        839
##  6 beltrad01       801
##  7 gonzaad01       778
##  8 stantmi03       772
##  9 bruceja01       765
## 10 bautijo02       764
## # … with 3,294 more rows

Having

The operator defining a limiting clause on an aggregate variable is HAVING. It is essentially like WHERE except for operating on summary statistics rather than individual rows.

In the query below, observe that the HAVING clause comes after the GROUP BY but before the ORDER BY.

## # A tibble: 77 x 2
##    playerID  RBI_total
##    <chr>         <int>
##  1 cabremi01       882
##  2 encared01       870
##  3 pujolal01       870
##  4 cruzne02        853
##  5 canoro01        839
##  6 beltrad01       801
##  7 gonzaad01       778
##  8 stantmi03       772
##  9 bruceja01       765
## 10 bautijo02       764
## # … with 67 more rows

Joins

So far we have discussed working with single tables only. The SQL term for merging data from two or more tables is a ‘join’. All joins are based on the idea of equating rows that match on one or more variables. Below we will discuss the following types of joins:

Inner Join

What if we wanted to supplement our earlier table showing players with 500+ RBI since 2000 with information about those players? We could use an inner join of our RBI table with the ‘Master’ table to accomplish this.

## # Source:   SQL [?? x 3]
## # Database: sqlite 3.29.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp7D9J7P/lahman.sqlite]
##    First     Last        RBI_TOTAL
##    <chr>     <chr>           <int>
##  1 Miguel    Cabrera           882
##  2 Edwin     Encarnacion       870
##  3 Albert    Pujols            870
##  4 Nelson    Cruz              853
##  5 Robinson  Cano              839
##  6 Adrian    Beltre            801
##  7 Adrian    Gonzalez          778
##  8 Giancarlo Stanton           772
##  9 Jay       Bruce             765
## 10 Jose      Bautista          764
## # … with more rows

In this example, an inner join suffices because each playerID in the batting table is linked to a playerID in the master table.

Left & Right (Outer) Joins

In a left join – sometimes called a left outer join – we add columns from the right table to the left table when matching rows are found. Rows from the left table with no matches from the right table are retained with columns from the right table filled in as NULL (i.e. NA). When there are multiple matches of a row from the left table to rows in the right table, these each become a row in the new table.

A right join is equivalent to a left join with the exception that the roles between right and left are reversed. My personal preference is to generally structure queries as left joins for simplicity.

Left joins are particularly useful when the information in the right table is only applicable to a subset of the rows from the left table. As an example, suppose we would like to know which US colleges and universities have produced the most “Rookie of the Year Awards” given to the best debuting player(s) each season.

To get started, we first test a query to find the last college attended.

## # Source:   SQL [?? x 3]
## # Database: sqlite 3.29.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp7D9J7P/lahman.sqlite]
##    playerID  schoolID   yearID
##    <chr>     <chr>       <int>
##  1 aardsda01 rice         2003
##  2 abadan01  gamiddl      1993
##  3 abbeybe01 vermont      1892
##  4 abbotje01 kentucky     1994
##  5 abbotji01 michigan     1988
##  6 abbotky01 longbeach    1989
##  7 abercre01 flgateway    2000
##  8 abernte01 elon         1941
##  9 ablesha01 swesterntx   1904
## 10 accarje01 illinoisst   2003
## # … with more rows

Now, we find all distinct awards in the AwardPlayers table.

## # A tibble: 29 x 1
##    awardID                            
##    <chr>                              
##  1 ALCS MVP                           
##  2 All-Star Game MVP                  
##  3 Babe Ruth Award                    
##  4 Baseball Magazine All-Star         
##  5 Branch Rickey Award                
##  6 Comeback Player of the Year        
##  7 Cy Young Award                     
##  8 Gold Glove                         
##  9 Hank Aaron Award                   
## 10 Hutch Award                        
## 11 Lou Gehrig Memorial Award          
## 12 Most Valuable Player               
## 13 NLCS MVP                           
## 14 Outstanding DH Award               
## 15 Pitching Triple Crown              
## 16 Reliever of the Year Award         
## 17 Roberto Clemente Award             
## 18 Rolaids Relief Man Award           
## 19 Rookie of the Year                 
## 20 Silver Slugger                     
## 21 TSN All-Star                       
## 22 TSN Fireman of the Year            
## 23 TSN Guide MVP                      
## 24 TSN Major League Player of the Year
## 25 TSN Pitcher of the Year            
## 26 TSN Player of the Year             
## 27 TSN Reliever of the Year           
## 28 Triple Crown                       
## 29 World Series MVP

Next we test a query for finding all Rookie of the Year Awards.

## # Source:   SQL [?? x 6]
## # Database: sqlite 3.29.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp7D9J7P/lahman.sqlite]
##    playerID  awardID            yearID lgID  tie   notes
##    <chr>     <chr>               <int> <chr> <chr> <chr>
##  1 robinja02 Rookie of the Year   1947 ML    <NA>  <NA> 
##  2 darkal01  Rookie of the Year   1948 ML    <NA>  <NA> 
##  3 sievero01 Rookie of the Year   1949 AL    <NA>  <NA> 
##  4 newcodo01 Rookie of the Year   1949 NL    <NA>  <NA> 
##  5 dropowa01 Rookie of the Year   1950 AL    <NA>  <NA> 
##  6 jethrsa01 Rookie of the Year   1950 NL    <NA>  <NA> 
##  7 mcdougi01 Rookie of the Year   1951 AL    <NA>  <NA> 
##  8 mayswi01  Rookie of the Year   1951 NL    <NA>  <NA> 
##  9 byrdha01  Rookie of the Year   1952 AL    <NA>  <NA> 
## 10 blackjo02 Rookie of the Year   1952 NL    <NA>  <NA> 
## # … with more rows

Finally, we use a left join of the tables for ROY awards and last college attended to match winners to their schools. We need a left join as many of the winners may never have played collegiate baseball.

## # Source:   SQL [?? x 4]
## # Database: sqlite 3.29.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp7D9J7P/lahman.sqlite]
##    playerID   year league schoolID
##    <chr>     <int> <chr>  <chr>   
##  1 robinja02  1947 ML     ucla    
##  2 darkal01   1948 ML     ulala   
##  3 sievero01  1949 AL     <NA>    
##  4 newcodo01  1949 NL     <NA>    
##  5 dropowa01  1950 AL     uconn   
##  6 jethrsa01  1950 NL     <NA>    
##  7 mcdougi01  1951 AL     <NA>    
##  8 mayswi01   1951 NL     <NA>    
##  9 byrdha01   1952 AL     <NA>    
## 10 blackjo02  1952 NL     <NA>    
## # … with more rows

To complete the example, we modify the query to display which schools have produced the most ROY awards in total.

## # A tibble: 9 x 2
##   schoolID  ROY_awards
##   <chr>          <int>
## 1 arizonast          4
## 2 michigan           3
## 3 ucla               3
## 4 usc                3
## 5 floridaam          2
## 6 longbeach          2
## 7 oklahoma           2
## 8 texasam            2
## 9 wagner             2

Using show_query()

If you are new to SQL but comfortable with dplyr you may find show_query() useful for understanding how dplyr statements are translated to SQL.

## <SQL>
## SELECT *
## FROM (SELECT `playerID`, `schoolID`, COUNT() AS `n_year`
## FROM (SELECT *
## FROM `CollegePlaying`
## WHERE (`yearID` > 1900.0))
## GROUP BY `playerID`, `schoolID`)
## WHERE (`n_year` > 5.0)

Resources