SQL and Relational Databases

A relational database is a set of rectangular data frames called tables linked by keys relating one table to another. Software implementations of such data structures are known as relational database management systems (RDBMS). Most RDBMS use structured query language (SQL) to modify or search the relational database.

SQL provides a syntax for interfacing with relational data. It is largely a declarative language in that we use SQL to specify what we wish to accomplish,leaving the how to the RDBMS. While there are standards for SQL implementations put out by the International Organization for Standardization (ISO) and the American National Standards Institute (ANSI), there are several open source and commercial implementations that each have unique features.

I will try to focus on the commonalities, but will be using an SQLlite engine in R for providing examples. One unique feature of SQLite is that it does not follow a client-server model. In this model, a physical computer storing the data and executing queries within the RDMBS, the server, is separate from the machine requesting the queries known as the client. The client-server model is popular in business, health care, and other domains as it allows security and monitoring of how the data is queried. It is also popular for many large open data projects (i.e ensemble) where it is beneficial for data to be centrally maintained and frequently accessed on the fly.

In order to write SQL queries within R I will be using the package dbplyr which provides an SQL back end to dplyr functions, and also allows SQL queries to be parsed as strings.

Lahman Examples

For our examples, we will use the Lahman package from R which contains historical baseball data from 1876-2016.

# Packages
library(tidyverse)
library(dbplyr)
library(Lahman)

# Create a local SQLlite database of the Lahman data
lahman = lahman_sqlite()

# Query the batting table
lahman %>% tbl("BATTING")
## # Source:   table<BATTING> [?? x 22]
## # Database: sqlite 3.19.3
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp8MUVms/lahman.sqlite]
##     playerID yearID stint teamID  lgID     G    AB     R     H   X2B   X3B
##        <chr>  <int> <int>  <chr> <chr> <int> <int> <int> <int> <int> <int>
##  1 abercda01   1871     1    TRO    NA     1     4     0     0     0     0
##  2  addybo01   1871     1    RC1    NA    25   118    30    32     6     0
##  3 allisar01   1871     1    CL1    NA    29   137    28    40     4     5
##  4 allisdo01   1871     1    WS3    NA    27   133    28    44    10     2
##  5 ansonca01   1871     1    RC1    NA    25   120    29    39    11     3
##  6 armstbo01   1871     1    FW1    NA    12    49     9    11     2     1
##  7 barkeal01   1871     1    RC1    NA     1     4     0     1     0     0
##  8 barnero01   1871     1    BS1    NA    31   157    66    63    10     9
##  9 barrebi01   1871     1    FW1    NA     1     5     1     1     1     0
## 10 barrofr01   1871     1    BS1    NA    18    86    13    13     2     1
## # ... with more rows, and 11 more variables: HR <int>, RBI <int>,
## #   SB <int>, CS <int>, BB <int>, SO <int>, IBB <int>, HBP <int>,
## #   SH <int>, SF <int>, GIDP <int>
# Query 'LahmanData' with meta data
print(lahman %>% tbl('LahmanData') %>% collect(), n = 25)
## # A tibble: 24 x 5
##                   file      class   nobs  nvar                     title
##                  <chr>      <chr>  <dbl> <dbl>                     <chr>
##  1         AllstarFull data.frame   4993     8         AllstarFull table
##  2         Appearances data.frame  99466    21         Appearances table
##  3      AwardsManagers data.frame    171     6      AwardsManagers table
##  4       AwardsPlayers data.frame   6026     6       AwardsPlayers table
##  5 AwardsShareManagers data.frame    401     7 AwardsShareManagers table
##  6  AwardsSharePlayers data.frame   6705     7  AwardsSharePlayers table
##  7             Batting data.frame  99846    22             Batting table
##  8         BattingPost data.frame  11294    22         BattingPost table
##  9      CollegePlaying data.frame  17350     3      CollegePlaying table
## 10            Fielding data.frame 167938    18            Fielding table
## 11          FieldingOF data.frame  12028     6          FieldingOF table
## 12        FieldingPost data.frame  11924    17         FieldingPost data
## 13          HallOfFame data.frame   4088     9  Hall of Fame Voting Data
## 14            Managers data.frame   3370    10            Managers table
## 15        ManagersHalf data.frame     93    10        ManagersHalf table
## 16              Master data.frame  18589    26              Master table
## 17            Pitching data.frame  43330    30            Pitching table
## 18        PitchingPost data.frame   4945    30        PitchingPost table
## 19            Salaries data.frame  24758     5            Salaries table
## 20             Schools data.frame   1207     5             Schools table
## 21          SeriesPost data.frame    298     9          SeriesPost table
## 22               Teams data.frame   2775    48               Teams table
## 23     TeamsFranchises data.frame    120     4      TeamFranchises table
## 24           TeamsHalf data.frame     52    10           TeamsHalf table

Working with tables

Search, subset, and limiting clauses

The basic structure of a SQL query contains a SELECT statement indicating which columns are desired and a FROM statement explaining where to find them. You can indicate a subset of rows using a WHERE clause.

CREATE TABLE my_table AS 
SELECT col1, col2, col3 
  FROM some_existing_table
  WHERE col1 IN ('A', 'B', 'C')

In the model above, CREATE TABLE will only work with a client-server model if you have permissions to write to memory. Use a wild card * to select all columns in a table:

lahman %>% 
  tbl(sql('
      SELECT *
        FROM master
        LIMIT 5
      ')) %>% collect()
## # A tibble: 5 x 26
##    playerID birthYear birthMonth birthDay birthCountry birthState
##       <chr>     <int>      <int>    <int>        <chr>      <chr>
## 1 aardsda01      1981         12       27          USA         CO
## 2 aaronha01      1934          2        5          USA         AL
## 3 aaronto01      1939          8        5          USA         AL
## 4  aasedo01      1954          9        8          USA         CA
## 5  abadan01      1972          8       25          USA         FL
## # ... with 20 more variables: birthCity <chr>, deathYear <int>,
## #   deathMonth <int>, deathDay <int>, deathCountry <chr>,
## #   deathState <chr>, deathCity <chr>, nameFirst <chr>, nameLast <chr>,
## #   nameGiven <chr>, weight <int>, height <int>, bats <chr>, throws <chr>,
## #   debut <chr>, finalGame <chr>, retroID <chr>, bbrefID <chr>,
## #   deathDate <dbl>, birthDate <dbl>

Above we use a LIMIT clause to grab just the first 5 rows of our table. This is good practice when developing new queries as it prevents large wait times only to discover a bug.

Limiting Clauses

Here we will explore the limiting clause WHERE in greater detail. Most of what we discuss in this section will also work with HAVING which operates on summary statistics rather than individual rows.

Boolean Logic

We can use Boolean comparisons in a WHERE clause as shown in the example below. We find all player-seasons since 2000 in which the player was credited with an RBI 100 or more times.

Here is our query:

SELECT playerID, yearID, RBI 
  FROM batting 
 WHERE RBI >= 100 AND yearID >= 2000

And here it is in dbplyr:

## Get all 100+ RBI seasons since 2000
rbi100 =
  lahman %>% 
  tbl(sql('
      SELECT playerID, yearID, RBI 
        FROM batting 
        WHERE RBI >= 100 AND yearID >= 2000
      '))
print(rbi100)
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.19.3
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp8MUVms/lahman.sqlite]
##     playerID yearID   RBI
##        <chr>  <int> <int>
##  1  aloumo01   2000   114
##  2 anderga01   2000   117
##  3 bagweje01   2000   132
##  4 batisto01   2000   114
##  5 belleal01   2000   103
##  6 bondsba01   2000   106
##  7 cirilje01   2000   115
##  8 delgaca01   2000   137
##  9   dyeje01   2000   118
## 10 edmonji01   2000   108
## # ... with more rows

Note that the code above creates and tests the query, but does not execute it over the entire table. To do that, we use collect().

## This is dblyr for peforming the query
rbi100 %>% collect()
## # A tibble: 460 x 3
##     playerID yearID   RBI
##        <chr>  <int> <int>
##  1  aloumo01   2000   114
##  2 anderga01   2000   117
##  3 bagweje01   2000   132
##  4 batisto01   2000   114
##  5 belleal01   2000   103
##  6 bondsba01   2000   106
##  7 cirilje01   2000   115
##  8 delgaca01   2000   137
##  9   dyeje01   2000   118
## 10 edmonji01   2000   108
## # ... with 450 more rows

IN

To select on a column by testing against a set of fixed values use IN.

SELECT nameGiven as Given, nameLast Last, birthYear AS born --Note equivalent ways to rename columns
  FROM master
 WHERE nameLast IN ("Alou", "Griffey")
## Find all players with last name Alou or Griffey
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born
        FROM master
        WHERE nameLast IN ("Alou", "Griffey")
      ')) %>% collect()
## # A tibble: 6 x 3
##               Given    Last  born
##               <chr>   <chr> <int>
## 1      Felipe Rojas    Alou  1935
## 2 Jesus Maria Rojas    Alou  1942
## 3       Mateo Rojas    Alou  1938
## 4      Moises Rojas    Alou  1966
## 5    George Kenneth Griffey  1950
## 6    George Kenneth Griffey  1969

LIKE

Use a LIKE statement with a WHERE clause to get partial string matching. You can use % to match any sub-string.

SELECT nameGiven as Given, nameLast Last, birthYear AS born
  FROM master
 WHERE nameLast LIKE "%riff%" 
## Find all players with last name containing 'riff' substring
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born
        FROM master
        WHERE nameLast LIKE "%riff%"
      ')) %>% collect()
## # A tibble: 25 x 3
##                 Given      Last  born
##                 <chr>     <chr> <int>
##  1      Arthur Joseph   Griffin  1988
##  2   Alfredo Claudino   Griffin  1957
##  3 Bartholomew Joseph  Griffith  1896
##  4       Clark Calvin  Griffith  1869
##  5     Robert Derrell  Griffith  1943
##  6        Douglas Lee   Griffin  1947
##  7       Frank Wesley  Griffith  1872
##  8       James Linton   Griffin  1886
##  9          Ivy Moore   Griffin  1896
## 10     Jeremy Richard Griffiths  1978
## # ... with 15 more rows

Most SQL implementations also have a REGEXP or REGEXLIKE function that works with regular expressions, but SQLite requires a user defined regex() for its use so we skip it here.

Combining limiting clauses

Limiting WHERE clauses can be combined using AND and/or OR. Statements can be negated using NOT.

SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
  FROM master
 WHERE birthCountry == "P.R." AND birthYear LIKE "199%"
## Find all players born in Puerto Rico during the 1990's
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
        FROM master
        WHERE birthCountry == "P.R." AND birthYear LIKE "199%"
      ')) %>% collect()
## # A tibble: 16 x 4
##               Given      Last  born Place
##               <chr>     <chr> <int> <chr>
##  1     Ednel Javier      Baez  1992  P.R.
##  2     Jose Orlando   Berrios  1994  P.R.
##  3        Alexander   Claudio  1992  P.R.
##  4           Joseph     Colon  1990  P.R.
##  5    Carlos Javier    Correa  1994  P.R.
##  6     Jose Eugenio   De Leon  1992  P.R.
##  7    Edwin Orlando      Diaz  1994  P.R.
##  8       Enrique J. Hernandez  1991  P.R.
##  9 Francisco Miguel    Lindor  1993  P.R.
## 10     Jorge Yabiel     Lopez  1993  P.R.
## 11           Steven      Moya  1991  P.R.
## 12           Yadiel    Rivera  1992  P.R.
## 13     Eddie Manuel   Rosario  1991  P.R.
## 14    Giovanni Luis      Soto  1991  P.R.
## 15           Kennys    Vargas  1990  P.R.
## 16 Christian Rafael   Vazquez  1990  P.R.
## Find all players born in Alaska (AK) or Hawaii (HI)
lahman %>% 
  tbl(sql('
      SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
        FROM master
        WHERE birthState == "HI" OR birthState == "AK"
      ')) %>% collect()
## # A tibble: 53 x 4
##                                 Given       Last  born Place
##                                 <chr>      <chr> <int> <chr>
##  1                        Benny Peter   Agbayani  1971   USA
##  2 Dustin Kamakana Mai Ku'u Makualani    Antolin  1989   USA
##  3                        Anthony Lee   Barnette  1983   USA
##  4                        Chad Robert      Bentz  1980   USA
##  5                     Douglas Edmund    Capilla  1952   USA
##  6                      Shawn Anthony     Chacon  1977   USA
##  7                    Steven Montague      Cooke  1970   USA
##  8                           Aaron R. Cunningham  1986   USA
##  9                     Ronald Maurice    Darling  1960   USA
## 10                             Joseph       DeSa  1959   USA
## # ... with 43 more rows

ORDER BY

Use an ORDER BY clause with a comma separated list of columns to arrange the table.

SELECT playerID, yearID, RBI 
  FROM batting 
 WHERE RBI >= 100 AND yearID >= 2010
 ORDER BY yearID, RBI
## Get all 100+ RBI seasons since 2010, ordered 
rbi100 =
  lahman %>% 
  tbl(sql('
      SELECT playerID, yearID, RBI 
        FROM batting 
        WHERE RBI >= 100 AND yearID >= 2010
        ORDER BY yearID, -RBI
      '))
rbi100 %>% collect()
## # A tibble: 116 x 3
##     playerID yearID   RBI
##        <chr>  <int> <int>
##  1 cabremi01   2010   126
##  2 rodrial01   2010   125
##  3 bautijo02   2010   124
##  4 pujolal01   2010   118
##  5 gonzaca01   2010   117
##  6 guerrvl01   2010   115
##  7 vottojo01   2010   113
##  8 youngde03   2010   112
##  9 konerpa01   2010   111
## 10  canoro01   2010   109
## # ... with 106 more rows

Aggregations, Group By

We can perform aggregations such as sums, means, and counts similar to summarize() from dplyr.

We can use a GROUP BY clause for the split-apply-combine pattern.

Here we find the players with the most total RBI since 2010.

SELECT playerID, sum(RBI) as RBI_total
  FROM batting 
  WHERE yearID >= 2010
  GROUP BY playerID
  ORDER BY -RBI_total
## Count total RBIs since 2010 by player
lahman %>% 
  tbl(sql('
      SELECT playerID, sum(RBI) as RBI_total
        FROM batting 
        WHERE yearID >= 2010
        GROUP BY playerID
        ORDER BY -RBI_total
      ')) %>% collect()
## # A tibble: 2,795 x 2
##     playerID RBI_total
##        <chr>     <int>
##  1 cabremi01       800
##  2 gonzaad01       722
##  3 pujolal01       705
##  4 ortizda01       700
##  5  canoro01       692
##  6 beltrad01       665
##  7 encared01       656
##  8 bautijo02       651
##  9  cruzne02       637
## 10 bruceja01       627
## # ... with 2,785 more rows

Having

The operator defining a limiting clause on an aggregate variable is HAVING. It is essentially like WHERE except for operating on summary statistics rather than individual rows.

In the query below, observe that the HAVING clause comes after the GROUP BY but before the ORDER BY.

SELECT playerID, sum(RBI) as RBI_total
  FROM batting 
 WHERE yearID >= 2010
 GROUP BY playerID
   HAVING RBI_total >= 500
 ORDER BY -RBI_total
## Players with 500+ RBIs since 2010
lahman %>% 
  tbl(sql('
      SELECT playerID, sum(RBI) as RBI_total
        FROM batting 
        WHERE yearID >= 2010
        GROUP BY playerID
        HAVING RBI_total >= 500
        ORDER BY -RBI_total
      ')) %>% collect()
## # A tibble: 40 x 2
##     playerID RBI_total
##        <chr>     <int>
##  1 cabremi01       800
##  2 gonzaad01       722
##  3 pujolal01       705
##  4 ortizda01       700
##  5  canoro01       692
##  6 beltrad01       665
##  7 encared01       656
##  8 bautijo02       651
##  9  cruzne02       637
## 10 bruceja01       627
## # ... with 30 more rows

Joins

So far we have discussed working with single tables only. The SQL term for merging data from two or more tables is a ‘join’. All joins are based on the idea of equating rows that match on one or more variables. Below we will discuss the following types of joins:

Inner Join

What if we wanted to supplement our earlier table showing players with 500+ RBI since 2000 with information about those players? We could use an inner join of our RBI table with the ‘Master’ table to accomplish this.

SELECT m.nameFirst First, m.nameLast Last, sum(RBI) as RBI_TOTAL
  FROM batting b
  INNER JOIN master m ON b.playerID = m.playerID 
  WHERE yearID >= 2010
  GROUP BY b.playerID
  HAVING RBI_total >= 500
  ORDER BY -RBI_total
lahman %>% tbl(sql(
'
SELECT m.nameFirst First, m.nameLast Last, sum(RBI) as RBI_TOTAL
  FROM batting b
  INNER JOIN master m ON b.playerID = m.playerID 
  WHERE yearID >= 2010
  GROUP BY b.playerID
  HAVING RBI_total >= 500
  ORDER BY -RBI_total
'
))
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.19.3
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp8MUVms/lahman.sqlite]
##       First        Last RBI_TOTAL
##       <chr>       <chr>     <int>
##  1   Miguel     Cabrera       800
##  2   Adrian    Gonzalez       722
##  3   Albert      Pujols       705
##  4    David       Ortiz       700
##  5 Robinson        Cano       692
##  6   Adrian      Beltre       665
##  7    Edwin Encarnacion       656
##  8     Jose    Bautista       651
##  9   Nelson        Cruz       637
## 10      Jay       Bruce       627
## # ... with more rows

In this example, an inner join suffices because each playerID in the batting table is linked to a playerID in the master table.

Left & Right (Outer) Joins

In a left join – sometimes called a left outer join – we add columns from the right table to the left table when matching rows are found. Rows from the left table with no matches from the right table are retained with columns from the right table filled in as NULL (i.e. NA). When there are multiple matches of a row from the left table to rows in the right table, these each become a row in the new table.

A right join is equivalent to a left join with the exception that the roles between right and left are reversed. My personal preferences is to always structure queries as left joins for simplicity.

Left joins are particularly useful when the information in the right table is only applicable to a subset of the rows from the left table. As an example, suppose we would like to know which US colleges and universities have produced the most “Rookie of the Year Awards” given to the best debuting player(s) each season.

To get started, we first test a query to find the last college attended.

-- Last college attended
SELECT *
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
# Query to find last college atttended
lahman %>% tbl(sql(
'
SELECT *
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
'
))
## # Source:   SQL [?? x 3]
## # Database: sqlite 3.19.3
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp8MUVms/lahman.sqlite]
##     playerID   schoolID yearID
##        <chr>      <chr>  <int>
##  1 aardsda01       rice   2003
##  2  abadan01    gamiddl   1993
##  3 abbeybe01    vermont   1892
##  4 abbotje01   kentucky   1994
##  5 abbotji01   michigan   1988
##  6 abbotky01  longbeach   1989
##  7 abercre01  flgateway   2000
##  8 abernte01       elon   1941
##  9 ablesha01 swesterntx   1904
## 10 accarje01 illinoisst   2003
## # ... with more rows

Now, we find all distinct awards in the AwardPlayers table.

# Distinct Player Awards
lahman %>% tbl(sql(
'
SELECT distinct(awardID)
FROM AwardsPlayers
'
)) %>% collect() %>% print(n=30)
## # A tibble: 29 x 1
##                                awardID
##                                  <chr>
##  1                            ALCS MVP
##  2                   All-Star Game MVP
##  3                     Babe Ruth Award
##  4          Baseball Magazine All-Star
##  5                 Branch Rickey Award
##  6         Comeback Player of the Year
##  7                      Cy Young Award
##  8                          Gold Glove
##  9                    Hank Aaron Award
## 10                         Hutch Award
## 11           Lou Gehrig Memorial Award
## 12                Most Valuable Player
## 13                            NLCS MVP
## 14                Outstanding DH Award
## 15               Pitching Triple Crown
## 16          Reliever of the Year Award
## 17              Roberto Clemente Award
## 18            Rolaids Relief Man Award
## 19                  Rookie of the Year
## 20                      Silver Slugger
## 21                        TSN All-Star
## 22             TSN Fireman of the Year
## 23                       TSN Guide MVP
## 24 TSN Major League Player of the Year
## 25             TSN Pitcher of the Year
## 26              TSN Player of the Year
## 27            TSN Reliever of the Year
## 28                        Triple Crown
## 29                    World Series MVP

Next we test a query for finding all Rookie of the Year Awards.

SELECT *
FROM AwardsPlayers
WHERE awardID LIKE "Rookie%"
# Query to find Rookie of the Year Awards
lahman %>% tbl(sql(
'
SELECT *
FROM AwardsPlayers
WHERE awardID LIKE "Rookie%"
'
)) 
## # Source:   SQL [?? x 6]
## # Database: sqlite 3.19.3
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp8MUVms/lahman.sqlite]
##     playerID            awardID yearID  lgID   tie notes
##        <chr>              <chr>  <int> <chr> <chr> <chr>
##  1 robinja02 Rookie of the Year   1947    ML  <NA>  <NA>
##  2  darkal01 Rookie of the Year   1948    ML  <NA>  <NA>
##  3 sievero01 Rookie of the Year   1949    AL  <NA>  <NA>
##  4 newcodo01 Rookie of the Year   1949    NL  <NA>  <NA>
##  5 dropowa01 Rookie of the Year   1950    AL  <NA>  <NA>
##  6 jethrsa01 Rookie of the Year   1950    NL  <NA>  <NA>
##  7 mcdougi01 Rookie of the Year   1951    AL  <NA>  <NA>
##  8  mayswi01 Rookie of the Year   1951    NL  <NA>  <NA>
##  9  byrdha01 Rookie of the Year   1952    AL  <NA>  <NA>
## 10 blackjo02 Rookie of the Year   1952    NL  <NA>  <NA>
## # ... with more rows

Finally, we use a left join of the tables for ROY awards and last college attended to match winners to their schools. We need a left join as many of the winners may never have played collegiate baseball.

SELECT roy.playerID playerID, roy.yearID year, lgID league, schoolID
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Final College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(YearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
# Query to find last college for ROY
lahman %>% tbl(sql(
'
SELECT roy.playerID playerID, roy.yearID year, lgID league, schoolID
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Last College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(YearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
'
))
## # Source:   SQL [?? x 4]
## # Database: sqlite 3.19.3
## #   [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//Rtmp8MUVms/lahman.sqlite]
##     playerID  year league schoolID
##        <chr> <int>  <chr>    <chr>
##  1 robinja02  1947     ML     ucla
##  2  darkal01  1948     ML    ulala
##  3 sievero01  1949     AL     <NA>
##  4 newcodo01  1949     NL     <NA>
##  5 dropowa01  1950     AL    uconn
##  6 jethrsa01  1950     NL     <NA>
##  7 mcdougi01  1951     AL     <NA>
##  8  mayswi01  1951     NL     <NA>
##  9  byrdha01  1952     AL     <NA>
## 10 blackjo02  1952     NL     <NA>
## # ... with more rows

To complete the example, we modify the query to display which schools have produced the most ROY awards in total.

SELECT schoolID, count(c.playerID) as ROY_awards
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Last College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(YearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
   AND schoolID IS NOT NULL
 GROUP BY schoolID
 HAVING ROY_awards > 1
 ORDER BY -ROY_awards
# Which schools have produced the most ROY?
lahman %>% tbl(sql(
'
SELECT schoolID, count(c.playerID) as ROY_awards
  FROM AwardsPlayers roy
  LEFT JOIN 
    (SELECT *  --Last College Attended
      FROM CollegePlaying
      GROUP BY playerID
      HAVING yearID == max(yearID)
    ) c
 ON c.playerID = roy.playerID
 WHERE awardID LIKE "Rookie%"
   AND schoolID IS NOT NULL
 GROUP BY schoolID
 HAVING ROY_awards > 1
 ORDER BY -ROY_awards
'
)) %>% collect()
## # A tibble: 9 x 2
##    schoolID ROY_awards
##       <chr>      <int>
## 1 arizonast          4
## 2  michigan          3
## 3      ucla          3
## 4       usc          3
## 5 floridaam          2
## 6 longbeach          2
## 7  oklahoma          2
## 8   texasam          2
## 9    wagner          2

Using show_query()

If you are new to SQL but comfortable with dplyr you may find show_query() useful for understanding how dplyr statements are translated to SQL.

# Players with more than 5 years at a single school.
lahman %>% 
  tbl('CollegePlaying') %>% 
  filter(yearID > 1900) %>%
  group_by(playerID, schoolID) %>%
  summarize(n_year = n()) %>%
  filter(n_year>5) %>%
  show_query()
## <SQL>
## SELECT *
## FROM (SELECT `playerID`, `schoolID`, COUNT() AS `n_year`
## FROM (SELECT *
## FROM `CollegePlaying`
## WHERE (`yearID` > 1900.0))
## GROUP BY `playerID`, `schoolID`)
## WHERE (`n_year` > 5.0)

Exercises

Resources