SQL and Relational Databases

A relational database is a set of rectangular data frames called tables linked by keys relating one table to another. Software implementations of such data structures are known as relational database management systems (RDBMS). Most RDBMS use structured query language or SQL (“sequel” or “S-Q-L”) to modify or search the relational database.

SQL provides a syntax for interfacing with relational data. It is largely a declarative language in that we use SQL to specify what we wish to accomplish, leaving the how to the RDBMS. While there are standards for SQL implementations put out by the International Organization for Standardization (ISO) and the American National Standards Institute (ANSI), there are several open source and commercial implementations that each have unique features.

I will try to focus on the commonalities, but will be using an SQLite engine in R for providing examples. One unique feature of SQLite is that it does not follow the client-server model. In this model, a physical computer storing the data and executing queries within the RDMBS, the server, is separate from the machine requesting the queries known as the client.

The client-server model is popular in business, health care, and other domains as it allows security and monitoring of how the data is queried. It is also popular for many large open data projects (i.e ensemble) where it is beneficial for data to be centrally maintained and frequently accessed on the fly.

In order to write SQL queries within R I will be using the package dbplyr which provides an SQL back end to dplyr functions, but also allows SQL queries to be parsed as strings.

Lahman Examples

For our examples, we will use the Lahman package from R which contains historical baseball data from 1876-2018.

# Packages
library(tidyverse)
library(dbplyr)
library(Lahman)

# Create a local SQLite database of the Lahman data
lahman = lahman_sqlite()

# Query the batting table
lahman %>% tbl("BATTING")
## # Source:   table<BATTING> [?? x 22]
## # Database: sqlite 3.33.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
##    playerID yearID stint teamID lgID      G    AB     R     H   X2B   X3B    HR
##    <chr>     <int> <int> <chr>  <chr> <int> <int> <int> <int> <int> <int> <int>
##  1 abercda…   1871     1 TRO    NA        1     4     0     0     0     0     0
##  2 addybo01   1871     1 RC1    NA       25   118    30    32     6     0     0
##  3 allisar…   1871     1 CL1    NA       29   137    28    40     4     5     0
##  4 allisdo…   1871     1 WS3    NA       27   133    28    44    10     2     2
##  5 ansonca…   1871     1 RC1    NA       25   120    29    39    11     3     0
##  6 armstbo…   1871     1 FW1    NA       12    49     9    11     2     1     0
##  7 barkeal…   1871     1 RC1    NA        1     4     0     1     0     0     0
##  8 barnero…   1871     1 BS1    NA       31   157    66    63    10     9     0
##  9 barrebi…   1871     1 FW1    NA        1     5     1     1     1     0     0
## 10 barrofr…   1871     1 BS1    NA       18    86    13    13     2     1     0
## # … with more rows, and 10 more variables: RBI <int>, SB <int>, CS <int>,
## #   BB <int>, SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>
# Query 'LahmanData' with meta data
print(lahman %>% tbl('LahmanData') %>% collect(), n = 25)
## # A tibble: 24 x 5
##    file                class        nobs  nvar title                    
##    <chr>               <chr>       <dbl> <dbl> <chr>                    
##  1 AllstarFull         data.frame   4993     8 AllstarFull table        
##  2 Appearances         data.frame  99466    21 Appearances table        
##  3 AwardsManagers      data.frame    171     6 AwardsManagers table     
##  4 AwardsPlayers       data.frame   6026     6 AwardsPlayers table      
##  5 AwardsShareManagers data.frame    401     7 AwardsShareManagers table
##  6 AwardsSharePlayers  data.frame   6705     7 AwardsSharePlayers table 
##  7 Batting             data.frame  99846    22 Batting table            
##  8 BattingPost         data.frame  11294    22 BattingPost table        
##  9 CollegePlaying      data.frame  17350     3 CollegePlaying table     
## 10 Fielding            data.frame 167938    18 Fielding table           
## 11 FieldingOF          data.frame  12028     6 FieldingOF table         
## 12 FieldingPost        data.frame  11924    17 FieldingPost data        
## 13 HallOfFame          data.frame   4088     9 Hall of Fame Voting Data 
## 14 Managers            data.frame   3370    10 Managers table           
## 15 ManagersHalf        data.frame     93    10 ManagersHalf table       
## 16 Master              data.frame  18589    26 Master table             
## 17 Pitching            data.frame  43330    30 Pitching table           
## 18 PitchingPost        data.frame   4945    30 PitchingPost table       
## 19 Salaries            data.frame  24758     5 Salaries table           
## 20 Schools             data.frame   1207     5 Schools table            
## 21 SeriesPost          data.frame    298     9 SeriesPost table         
## 22 Teams               data.frame   2775    48 Teams table              
## 23 TeamsFranchises     data.frame    120     4 TeamFranchises table     
## 24 TeamsHalf           data.frame     52    10 TeamsHalf table

Working with tables

Search, subset, and limiting clauses

The basic structure of a SQL query contains a SELECT statement indicating which columns are desired and a FROM statement explaining where to find them. You can indicate a subset of rows using a WHERE clause.

CREATE TABLE my_table AS 
SELECT col1, col2, col3 
  FROM some_existing_table
  WHERE col1 IN ('A', 'B', 'C')

In the model above, CREATE TABLE will only work with a client-server model if you have permissions to write to memory. Use a wild card * to select all columns in a table:

lahman %>% 
  tbl(sql('
      SELECT *
        FROM master
        LIMIT 5
      ')) %>%
  collect()
## # A tibble: 5 x 26
##   playerID birthYear birthMonth birthDay birthCountry birthState birthCity
##   <chr>        <int>      <int>    <int> <chr>        <chr>      <chr>    
## 1 aardsda…      1981         12       27 USA          CO         Denver   
## 2 aaronha…      1934          2        5 USA          AL         Mobile   
## 3 aaronto…      1939          8        5 USA          AL         Mobile   
## 4 aasedo01      1954          9        8 USA          CA         Orange   
## 5 abadan01      1972          8       25 USA          FL         Palm Bea…
## # … with 19 more variables: deathYear <int>, deathMonth <int>, deathDay <int>,
## #   deathCountry <chr>, deathState <chr>, deathCity <chr>, nameFirst <chr>,
## #   nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <chr>,
## #   throws <chr>, debut <chr>, finalGame <chr>, retroID <chr>, bbrefID <chr>,
## #   deathDate <dbl>, birthDate <dbl>

Above we use a LIMIT clause to grab just the first 5 rows of our table. This is good practice when developing new queries as it prevents large wait times only to discover a bug.

Limiting Clauses

Here we will explore the limiting clause WHERE in greater detail. Most of what we discuss in this section will also work with HAVING which operates on summary statistics rather than individual rows.

Boolean Logic

We can use Boolean comparisons in a WHERE clause as shown in the example below. We find all player-seasons since 2000 in which the player was credited with an RBI 100 or more times.

Here is our query:

SELECT playerID, yearID, RBI 
  FROM batting 
 WHERE RBI >= 100 AND yearID >= 2000

And here it is in dbplyr:

## Get all 100+ RBI seasons since 2000
rbi100 =
  lahman %>% 
  tbl(sql('
      SELECT playerID, yearID, RBI 
        FROM batting 
        WHERE RBI >= 100 AND yearID >= 2000
      '))
print(rbi100)
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.33.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
##    playerID  yearID   RBI
##    <chr>      <int> <int>
##  1 aloumo01    2000   114
##  2 anderga01   2000   117
##  3 bagweje01   2000   132
##  4 batisto01   2000   114
##  5 belleal01   2000   103
##  6 bondsba01   2000   106
##  7 cirilje01   2000   115
##  8 delgaca01   2000   137
##  9 dyeje01     2000   118
## 10 edmonji01   2000   108
## # … with more rows

Note that the code above creates and tests the query, but does not execute it over the entire table. To do that, we use collect().

## This is dbplyr for performing the query
rbi100 %>% collect()
## # A tibble: 518 x 3
##    playerID  yearID   RBI
##    <chr>      <int> <int>
##  1 aloumo01    2000   114
##  2 anderga01   2000   117
##  3 bagweje01   2000   132
##  4 batisto01   2000   114
##  5 belleal01   2000   103
##  6 bondsba01   2000   106
##  7 cirilje01   2000   115
##  8 delgaca01   2000   137
##  9 dyeje01     2000   118
## 10 edmonji01   2000   108
## # … with 508 more rows

IN

To select on a column by testing against a set of fixed values use IN.

SELECT nameGiven as Given, nameLast Last, birthYear AS born --Note equivalent ways to rename columns
  FROM master
 WHERE nameLast IN ("Alou", "Griffey")
## Find all players with last name Alou or Griffey
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born
        FROM master
        WHERE nameLast IN ("Alou", "Griffey")
      ')) %>% 
  collect()
## # A tibble: 6 x 3
##   Given             Last     born
##   <chr>             <chr>   <int>
## 1 Felipe Rojas      Alou     1935
## 2 Jesus Maria Rojas Alou     1942
## 3 Mateo Rojas       Alou     1938
## 4 Moises Rojas      Alou     1966
## 5 George Kenneth    Griffey  1950
## 6 George Kenneth    Griffey  1969

LIKE

Use a LIKE statement with a WHERE clause to get partial string matching. You can use % to match any sub-string.

SELECT nameGiven as Given, nameLast Last, birthYear AS born
  FROM master
 WHERE nameLast LIKE "%riff%" 
## Find all players with last name containing a 'riff' sub-string
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born
        FROM master
        WHERE nameLast LIKE "%riff%"
      ')) %>%
  collect()
## # A tibble: 26 x 3
##    Given              Last       born
##    <chr>              <chr>     <int>
##  1 Arthur Joseph      Griffin    1988
##  2 Alfredo Claudino   Griffin    1957
##  3 Bartholomew Joseph Griffith   1896
##  4 Clark Calvin       Griffith   1869
##  5 Robert Derrell     Griffith   1943
##  6 Douglas Lee        Griffin    1947
##  7 Frank Wesley       Griffith   1872
##  8 James Linton       Griffin    1886
##  9 Ivy Moore          Griffin    1896
## 10 Jeremy Richard     Griffiths  1978
## # … with 16 more rows

Most SQL implementations also have a REGEXP or REGEXLIKE function that works with regular expressions, but SQLite requires a user defined regex() for its use so we skip it here.

Combining limiting clauses

Limiting WHERE clauses can be combined using AND and OR. Statements can be negated using NOT.

SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
  FROM master
 WHERE birthCountry == "P.R." AND birthYear LIKE "199%"
## Find all players born in Puerto Rico during the 1990's
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
        FROM master
        WHERE birthCountry == "P.R." AND birthYear LIKE "199%"
      ')) %>% 
  collect()
## # A tibble: 28 x 4
##    Given         Last      born Place
##    <chr>         <chr>    <int> <chr>
##  1 Ednel Javier  Baez      1992 P.R. 
##  2 Jose Orlando  Berrios   1994 P.R. 
##  3 Victor Manuel Caratini  1993 P.R. 
##  4 Willi Rafael  Castro    1997 P.R. 
##  5 Alexander     Claudio   1992 P.R. 
##  6 Joseph        Colon     1990 P.R. 
##  7 Carlos Javier Correa    1994 P.R. 
##  8 Noel A.       Cuevas    1991 P.R. 
##  9 Jose Eugenio  De Leon   1992 P.R. 
## 10 Edwin Orlando Diaz      1994 P.R. 
## # … with 18 more rows
## Find all players born in Alaska (AK) or Hawaii (HI)
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
        FROM master
        WHERE birthState == "HI" OR birthState == "AK"
      ')) %>% 
  collect()
## # A tibble: 58 x 4
##    Given                              Last        born Place
##    <chr>                              <chr>      <int> <chr>
##  1 Benny Peter                        Agbayani    1971 USA  
##  2 Dustin Kamakana Mai Ku'u Makualani Antolin     1989 USA  
##  3 Anthony Lee                        Barnette    1983 USA  
##  4 Chad Robert                        Bentz       1980 USA  
##  5 Douglas Edmund                     Capilla     1952 USA  
##  6 Shawn Anthony                      Chacon      1977 USA  
##  7 Steven Montague                    Cooke       1970 USA  
##  8 Aaron R.                           Cunningham  1986 USA  
##  9 Ronald Maurice                     Darling     1960 USA  
## 10 Joseph                             DeSa        1959 USA  
## # … with 48 more rows

ORDER BY

Use an ORDER BY clause with a comma separated list of columns to arrange the table.

SELECT playerID, yearID, RBI 
  FROM batting 
 WHERE RBI >= 100 AND yearID >= 2010
 ORDER BY yearID, RBI
## Get all 100+ RBI seasons since 2010, ordered 
lahman %>% 
  tbl(sql('
      SELECT playerID, yearID, RBI 
        FROM batting 
        WHERE RBI >= 100 AND yearID >= 2010
        ORDER BY yearID, -RBI
      ')) %>%
  collect()
## # A tibble: 174 x 3
##    playerID  yearID   RBI
##    <chr>      <int> <int>
##  1 cabremi01   2010   126
##  2 rodrial01   2010   125
##  3 bautijo02   2010   124
##  4 pujolal01   2010   118
##  5 gonzaca01   2010   117
##  6 guerrvl01   2010   115
##  7 vottojo01   2010   113
##  8 youngde03   2010   112
##  9 konerpa01   2010   111
## 10 canoro01    2010   109
## # … with 164 more rows

Aggregations, Group By

We can perform aggregations such as sums, means, and counts similar to summarize() from dplyr.

We can use a GROUP BY clause for the split-apply-combine pattern.

Here we find the players with the most total RBI since 2010.

SELECT playerID, sum(RBI) as RBI_total
  FROM batting 
  WHERE yearID >= 2010
  GROUP BY playerID
  ORDER BY -RBI_total
## Count total RBIs since 2010 by player
lahman %>% 
  tbl(sql('
      SELECT playerID, sum(RBI) as RBI_total
        FROM batting 
        WHERE yearID >= 2010
        GROUP BY playerID
        ORDER BY -RBI_total
      ')) %>% 
  collect()
## # A tibble: 3,566 x 2
##    playerID  RBI_total
##    <chr>         <int>
##  1 pujolal01       963
##  2 cruzne02        961
##  3 encared01       956
##  4 cabremi01       941
##  5 canoro01        878
##  6 bruceja01       824
##  7 longoev01       817
##  8 braunry02       811
##  9 goldspa01       807
## 10 jonesad01       806
## # … with 3,556 more rows

Having

The operator defining a limiting clause on an aggregate variable is HAVING. It is essentially like WHERE except for operating on summary statistics rather than individual rows. In other words, HAVING refers to the output table specified in SELECT rather than the input table(s) specified using FROM.

In the query below, observe that the HAVING clause comes after the GROUP BY but before the ORDER BY.

SELECT playerID, sum(RBI) as RBI_total
  FROM batting 
 WHERE yearID >= 2010
 GROUP BY playerID
   HAVING RBI_total >= 500
 ORDER BY -RBI_total
## Players with 500+ RBIs since 2010
lahman %>% 
  tbl(sql('
      SELECT playerID, sum(RBI) as RBI_total
        FROM batting 
        WHERE yearID >= 2010
        GROUP BY playerID
        HAVING RBI_total >= 500
        ORDER BY -RBI_total
      ')) %>%
  collect()
## # A tibble: 96 x 2
##    playerID  RBI_total
##    <chr>         <int>
##  1 pujolal01       963
##  2 cruzne02        961
##  3 encared01       956
##  4 cabremi01       941
##  5 canoro01        878
##  6 bruceja01       824
##  7 longoev01       817
##  8 braunry02       811
##  9 goldspa01       807
## 10 jonesad01       806
## # … with 86 more rows

Joins

So far we have discussed working with single tables only. The SQL term for merging data from two or more tables is a ‘join’. All joins are based on the idea of equating rows that match on one or more variables. Below we will discuss the following types of joins:

Inner Join

What if we wanted to supplement our earlier table showing players with 500+ RBI since 2000 with information about those players? We could use an inner join of our RBI table with the ‘Master’ table to accomplish this.

SELECT m.nameFirst First, m.nameLast Last, sum(RBI) as RBI_TOTAL
  FROM batting b
  INNER JOIN master m ON b.playerID = m.playerID 
  WHERE yearID >= 2010
  GROUP BY b.playerID
  HAVING RBI_total >= 500
  ORDER BY -RBI_total
lahman %>% tbl(sql(
'
SELECT m.nameFirst First, m.nameLast Last, sum(RBI) as RBI_TOTAL
  FROM batting b
  INNER JOIN master m ON b.playerID = m.playerID 
  WHERE yearID >= 2010
  GROUP BY b.playerID
  HAVING RBI_total >= 500
  ORDER BY -RBI_total
'
))
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.33.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
##    First    Last        RBI_TOTAL
##    <chr>    <chr>           <int>
##  1 Albert   Pujols            963
##  2 Nelson   Cruz              961
##  3 Edwin    Encarnacion       956
##  4 Miguel   Cabrera           941
##  5 Robinson Cano              878
##  6 Jay      Bruce             824
##  7 Evan     Longoria          817
##  8 Ryan     Braun             811
##  9 Paul     Goldschmidt       807
## 10 Adam     Jones             806
## # … with more rows

In this example, an inner join suffices because each playerID in the batting table is linked to a playerID in the master table.

Left & Right (Outer) Joins

In a left join – sometimes called a left outer join – we add columns from the right table to the left table when matching rows are found. Rows from the left table with no matches from the right table are retained with columns from the right table filled in as NULL (i.e. NA). When there are multiple matches of a row from the left table to rows in the right table, these each become a row in the new table.

A right join is equivalent to a left join with the exception that the roles between right and left are reversed. My personal preference is to generally structure queries as left joins for simplicity.

Left joins are particularly useful when the information in the right table is only applicable to a subset of the rows from the left table. As an example, suppose we would like to know which US colleges and universities have produced the most “Rookie of the Year Awards” given to the best debuting player(s) each season.

To get started, we first test a query to find the last college attended.

-- Last college attended
SELECT *
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
# Query to find last college atttended
lahman %>% tbl(sql(
'
SELECT *
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
'
))
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.33.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
##    playerID  schoolID   yearID
##    <chr>     <chr>       <int>
##  1 aardsda01 rice         2003
##  2 abadan01  gamiddl      1993
##  3 abbeybe01 vermont      1892
##  4 abbotje01 kentucky     1994
##  5 abbotji01 michigan     1988
##  6 abbotky01 longbeach    1989
##  7 abercre01 flgateway    2000
##  8 abernte01 elon         1941
##  9 ablesha01 swesterntx   1904
## 10 accarje01 illinoisst   2003
## # … with more rows

Now, we find all distinct awards in the AwardPlayers table.

# Distinct Player Awards
lahman %>% tbl(sql(
'
SELECT distinct(awardID)
FROM AwardsPlayers
'
)) %>% 
  collect() %>%
  print(n=30)
## # A tibble: 29 x 1
##    awardID                            
##    <chr>                              
##  1 ALCS MVP                           
##  2 All-Star Game MVP                  
##  3 Babe Ruth Award                    
##  4 Baseball Magazine All-Star         
##  5 Branch Rickey Award                
##  6 Comeback Player of the Year        
##  7 Cy Young Award                     
##  8 Gold Glove                         
##  9 Hank Aaron Award                   
## 10 Hutch Award                        
## 11 Lou Gehrig Memorial Award          
## 12 Most Valuable Player               
## 13 NLCS MVP                           
## 14 Outstanding DH Award               
## 15 Pitching Triple Crown              
## 16 Reliever of the Year Award         
## 17 Roberto Clemente Award             
## 18 Rolaids Relief Man Award           
## 19 Rookie of the Year                 
## 20 Silver Slugger                     
## 21 TSN All-Star                       
## 22 TSN Fireman of the Year            
## 23 TSN Guide MVP                      
## 24 TSN Major League Player of the Year
## 25 TSN Pitcher of the Year            
## 26 TSN Player of the Year             
## 27 TSN Reliever of the Year           
## 28 Triple Crown                       
## 29 World Series MVP

Next we test a query for finding all Rookie of the Year Awards.

SELECT *
FROM AwardsPlayers
WHERE awardID LIKE "Rookie%"
# Query to find Rookie of the Year Awards
lahman %>% tbl(sql(
'
SELECT *
FROM AwardsPlayers
WHERE awardID LIKE "Rookie%"
'
)) 
## # Source:   SQL [?? x 6]
## # Database: sqlite 3.33.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
##    playerID  awardID            yearID lgID  tie   notes
##    <chr>     <chr>               <int> <chr> <chr> <chr>
##  1 robinja02 Rookie of the Year   1947 ML    <NA>  <NA> 
##  2 darkal01  Rookie of the Year   1948 ML    <NA>  <NA> 
##  3 sievero01 Rookie of the Year   1949 AL    <NA>  <NA> 
##  4 newcodo01 Rookie of the Year   1949 NL    <NA>  <NA> 
##  5 dropowa01 Rookie of the Year   1950 AL    <NA>  <NA> 
##  6 jethrsa01 Rookie of the Year   1950 NL    <NA>  <NA> 
##  7 mcdougi01 Rookie of the Year   1951 AL    <NA>  <NA> 
##  8 mayswi01  Rookie of the Year   1951 NL    <NA>  <NA> 
##  9 byrdha01  Rookie of the Year   1952 AL    <NA>  <NA> 
## 10 blackjo02 Rookie of the Year   1952 NL    <NA>  <NA> 
## # … with more rows

Finally, we use a left join of the tables for ROY awards and last college attended to match winners to their schools. We need a left join as many of the winners may never have played collegiate baseball.

SELECT roy.playerID playerID, roy.yearID year, lgID league, schoolID
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Final College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(YearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
# Query to find last college for ROY
lahman %>% tbl(sql(
'
SELECT roy.playerID playerID, roy.yearID year, lgID league, schoolID
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Last College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(YearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
'
))
## # Source:   SQL [?? x 4]
## # Database: sqlite 3.33.0
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
##    playerID   year league schoolID
##    <chr>     <int> <chr>  <chr>   
##  1 robinja02  1947 ML     ucla    
##  2 darkal01   1948 ML     ulala   
##  3 sievero01  1949 AL     <NA>    
##  4 newcodo01  1949 NL     <NA>    
##  5 dropowa01  1950 AL     uconn   
##  6 jethrsa01  1950 NL     <NA>    
##  7 mcdougi01  1951 AL     <NA>    
##  8 mayswi01   1951 NL     <NA>    
##  9 byrdha01   1952 AL     <NA>    
## 10 blackjo02  1952 NL     <NA>    
## # … with more rows

To complete the example, we modify the query to display which schools have produced the most ROY awards in total.

SELECT schoolID, count(c.playerID) as ROY_awards
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Last College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(YearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
   AND schoolID IS NOT NULL
 GROUP BY schoolID
 HAVING ROY_awards > 1
 ORDER BY -ROY_awards
# Which schools have produced the most ROY?
lahman %>% tbl(sql(
'
SELECT schoolID, count(c.playerID) as ROY_awards
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Last College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(yearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
   AND schoolID IS NOT NULL
 GROUP BY schoolID
 HAVING ROY_awards > 1
 ORDER BY -ROY_awards
'
)) %>% collect()
## # A tibble: 9 x 2
##   schoolID  ROY_awards
##   <chr>          <int>
## 1 arizonast          4
## 2 michigan           3
## 3 ucla               3
## 4 usc                3
## 5 floridaam          2
## 6 longbeach          2
## 7 oklahoma           2
## 8 texasam            2
## 9 wagner             2

Using show_query()

If you are new to SQL but comfortable with dplyr you may find show_query() useful for understanding how dplyr statements are translated to SQL.

# Players with more than 5 years at a single school.
lahman %>% 
  tbl('CollegePlaying') %>% 
  filter(yearID > 1900) %>%
  group_by(playerID, schoolID) %>%
  summarize(n_year = n()) %>%
  filter(n_year>5) %>%
  show_query()
## <SQL>
## SELECT *
## FROM (SELECT `playerID`, `schoolID`, COUNT() AS `n_year`
## FROM (SELECT *
## FROM `CollegePlaying`
## WHERE (`yearID` > 1900.0))
## GROUP BY `playerID`, `schoolID`)
## WHERE (`n_year` > 5.0)

Resources