A relational database is a set of rectangular data frames called tables linked by keys relating one table to another. Software implementations of such data structures are known as relational database management systems (RDBMS). Most RDBMS use structured query language or SQL (“sequel” or “S-Q-L”) to modify or search the relational database.
SQL provides a syntax for interfacing with relational data. It is largely a declarative language in that we use SQL to specify what we wish to accomplish, leaving the how to the RDBMS. While there are standards for SQL implementations put out by the International Organization for Standardization (ISO) and the American National Standards Institute (ANSI), there are several open source and commercial implementations that each have unique features.
I will try to focus on the commonalities, but will be using an SQLite engine in R for providing examples. One unique feature of SQLite
is that it does not follow the client-server model. In this model, a physical computer storing the data and executing queries within the RDMBS, the server, is separate from the machine requesting the queries known as the client.
The client-server model is popular in business, health care, and other domains as it allows security and monitoring of how the data is queried. It is also popular for many large open data projects (i.e ensemble) where it is beneficial for data to be centrally maintained and frequently accessed on the fly.
In order to write SQL queries within R I will be using the package dbplyr which provides an SQL back end to dplyr
functions, but also allows SQL queries to be parsed as strings.
For our examples, we will use the Lahman
package from R which contains historical baseball data from 1876-2018.
# Packages
library(tidyverse)
library(dbplyr)
library(Lahman)
# Create a local SQLite database of the Lahman data
lahman = lahman_sqlite()
# Query the batting table
lahman %>% tbl("BATTING")
## # Source: table<BATTING> [?? x 22]
## # Database: sqlite 3.33.0
## # [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
## playerID yearID stint teamID lgID G AB R H X2B X3B HR
## <chr> <int> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 abercda… 1871 1 TRO NA 1 4 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0
## 3 allisar… 1871 1 CL1 NA 29 137 28 40 4 5 0
## 4 allisdo… 1871 1 WS3 NA 27 133 28 44 10 2 2
## 5 ansonca… 1871 1 RC1 NA 25 120 29 39 11 3 0
## 6 armstbo… 1871 1 FW1 NA 12 49 9 11 2 1 0
## 7 barkeal… 1871 1 RC1 NA 1 4 0 1 0 0 0
## 8 barnero… 1871 1 BS1 NA 31 157 66 63 10 9 0
## 9 barrebi… 1871 1 FW1 NA 1 5 1 1 1 0 0
## 10 barrofr… 1871 1 BS1 NA 18 86 13 13 2 1 0
## # … with more rows, and 10 more variables: RBI <int>, SB <int>, CS <int>,
## # BB <int>, SO <int>, IBB <int>, HBP <int>, SH <int>, SF <int>, GIDP <int>
## # A tibble: 24 x 5
## file class nobs nvar title
## <chr> <chr> <dbl> <dbl> <chr>
## 1 AllstarFull data.frame 4993 8 AllstarFull table
## 2 Appearances data.frame 99466 21 Appearances table
## 3 AwardsManagers data.frame 171 6 AwardsManagers table
## 4 AwardsPlayers data.frame 6026 6 AwardsPlayers table
## 5 AwardsShareManagers data.frame 401 7 AwardsShareManagers table
## 6 AwardsSharePlayers data.frame 6705 7 AwardsSharePlayers table
## 7 Batting data.frame 99846 22 Batting table
## 8 BattingPost data.frame 11294 22 BattingPost table
## 9 CollegePlaying data.frame 17350 3 CollegePlaying table
## 10 Fielding data.frame 167938 18 Fielding table
## 11 FieldingOF data.frame 12028 6 FieldingOF table
## 12 FieldingPost data.frame 11924 17 FieldingPost data
## 13 HallOfFame data.frame 4088 9 Hall of Fame Voting Data
## 14 Managers data.frame 3370 10 Managers table
## 15 ManagersHalf data.frame 93 10 ManagersHalf table
## 16 Master data.frame 18589 26 Master table
## 17 Pitching data.frame 43330 30 Pitching table
## 18 PitchingPost data.frame 4945 30 PitchingPost table
## 19 Salaries data.frame 24758 5 Salaries table
## 20 Schools data.frame 1207 5 Schools table
## 21 SeriesPost data.frame 298 9 SeriesPost table
## 22 Teams data.frame 2775 48 Teams table
## 23 TeamsFranchises data.frame 120 4 TeamFranchises table
## 24 TeamsHalf data.frame 52 10 TeamsHalf table
The basic structure of a SQL query contains a SELECT
statement indicating which columns are desired and a FROM
statement explaining where to find them. You can indicate a subset of rows using a WHERE
clause.
CREATE TABLE my_table AS
SELECT col1, col2, col3
FROM some_existing_table
WHERE col1 IN ('A', 'B', 'C')
In the model above, CREATE TABLE
will only work with a client-server model if you have permissions to write to memory. Use a wild card *
to select all columns in a table:
## # A tibble: 5 x 26
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## <chr> <int> <int> <int> <chr> <chr> <chr>
## 1 aardsda… 1981 12 27 USA CO Denver
## 2 aaronha… 1934 2 5 USA AL Mobile
## 3 aaronto… 1939 8 5 USA AL Mobile
## 4 aasedo01 1954 9 8 USA CA Orange
## 5 abadan01 1972 8 25 USA FL Palm Bea…
## # … with 19 more variables: deathYear <int>, deathMonth <int>, deathDay <int>,
## # deathCountry <chr>, deathState <chr>, deathCity <chr>, nameFirst <chr>,
## # nameLast <chr>, nameGiven <chr>, weight <int>, height <int>, bats <chr>,
## # throws <chr>, debut <chr>, finalGame <chr>, retroID <chr>, bbrefID <chr>,
## # deathDate <dbl>, birthDate <dbl>
Above we use a LIMIT
clause to grab just the first 5 rows of our table. This is good practice when developing new queries as it prevents large wait times only to discover a bug.
Here we will explore the limiting clause WHERE
in greater detail. Most of what we discuss in this section will also work with HAVING
which operates on summary statistics rather than individual rows.
We can use Boolean comparisons in a WHERE
clause as shown in the example below. We find all player-seasons since 2000 in which the player was credited with an RBI 100 or more times.
Here is our query:
And here it is in dbplyr
:
## Get all 100+ RBI seasons since 2000
rbi100 =
lahman %>%
tbl(sql('
SELECT playerID, yearID, RBI
FROM batting
WHERE RBI >= 100 AND yearID >= 2000
'))
print(rbi100)
## # Source: SQL [?? x 3]
## # Database: sqlite 3.33.0
## # [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
## playerID yearID RBI
## <chr> <int> <int>
## 1 aloumo01 2000 114
## 2 anderga01 2000 117
## 3 bagweje01 2000 132
## 4 batisto01 2000 114
## 5 belleal01 2000 103
## 6 bondsba01 2000 106
## 7 cirilje01 2000 115
## 8 delgaca01 2000 137
## 9 dyeje01 2000 118
## 10 edmonji01 2000 108
## # … with more rows
Note that the code above creates and tests the query, but does not execute it over the entire table. To do that, we use collect()
.
## # A tibble: 518 x 3
## playerID yearID RBI
## <chr> <int> <int>
## 1 aloumo01 2000 114
## 2 anderga01 2000 117
## 3 bagweje01 2000 132
## 4 batisto01 2000 114
## 5 belleal01 2000 103
## 6 bondsba01 2000 106
## 7 cirilje01 2000 115
## 8 delgaca01 2000 137
## 9 dyeje01 2000 118
## 10 edmonji01 2000 108
## # … with 508 more rows
To select on a column by testing against a set of fixed values use IN
.
SELECT nameGiven as Given, nameLast Last, birthYear AS born --Note equivalent ways to rename columns
FROM master
WHERE nameLast IN ("Alou", "Griffey")
## Find all players with last name Alou or Griffey
lahman %>%
tbl(sql('
SELECT nameGiven as Given, nameLast Last, birthYear AS born
FROM master
WHERE nameLast IN ("Alou", "Griffey")
')) %>%
collect()
## # A tibble: 6 x 3
## Given Last born
## <chr> <chr> <int>
## 1 Felipe Rojas Alou 1935
## 2 Jesus Maria Rojas Alou 1942
## 3 Mateo Rojas Alou 1938
## 4 Moises Rojas Alou 1966
## 5 George Kenneth Griffey 1950
## 6 George Kenneth Griffey 1969
Use a LIKE
statement with a WHERE
clause to get partial string matching. You can use %
to match any sub-string.
SELECT nameGiven as Given, nameLast Last, birthYear AS born
FROM master
WHERE nameLast LIKE "%riff%"
## Find all players with last name containing a 'riff' sub-string
lahman %>%
tbl(sql('
SELECT nameGiven as Given, nameLast Last, birthYear AS born
FROM master
WHERE nameLast LIKE "%riff%"
')) %>%
collect()
## # A tibble: 26 x 3
## Given Last born
## <chr> <chr> <int>
## 1 Arthur Joseph Griffin 1988
## 2 Alfredo Claudino Griffin 1957
## 3 Bartholomew Joseph Griffith 1896
## 4 Clark Calvin Griffith 1869
## 5 Robert Derrell Griffith 1943
## 6 Douglas Lee Griffin 1947
## 7 Frank Wesley Griffith 1872
## 8 James Linton Griffin 1886
## 9 Ivy Moore Griffin 1896
## 10 Jeremy Richard Griffiths 1978
## # … with 16 more rows
Most SQL implementations also have a REGEXP
or REGEXLIKE
function that works with regular expressions, but SQLite
requires a user defined regex()
for its use so we skip it here.
Limiting WHERE
clauses can be combined using AND
and OR
. Statements can be negated using NOT
.
SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
FROM master
WHERE birthCountry == "P.R." AND birthYear LIKE "199%"
## Find all players born in Puerto Rico during the 1990's
lahman %>%
tbl(sql('
SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
FROM master
WHERE birthCountry == "P.R." AND birthYear LIKE "199%"
')) %>%
collect()
## # A tibble: 28 x 4
## Given Last born Place
## <chr> <chr> <int> <chr>
## 1 Ednel Javier Baez 1992 P.R.
## 2 Jose Orlando Berrios 1994 P.R.
## 3 Victor Manuel Caratini 1993 P.R.
## 4 Willi Rafael Castro 1997 P.R.
## 5 Alexander Claudio 1992 P.R.
## 6 Joseph Colon 1990 P.R.
## 7 Carlos Javier Correa 1994 P.R.
## 8 Noel A. Cuevas 1991 P.R.
## 9 Jose Eugenio De Leon 1992 P.R.
## 10 Edwin Orlando Diaz 1994 P.R.
## # … with 18 more rows
## Find all players born in Alaska (AK) or Hawaii (HI)
lahman %>%
tbl(sql('
SELECT nameGiven as Given, nameLast Last, birthYear AS born, birthCountry as Place
FROM master
WHERE birthState == "HI" OR birthState == "AK"
')) %>%
collect()
## # A tibble: 58 x 4
## Given Last born Place
## <chr> <chr> <int> <chr>
## 1 Benny Peter Agbayani 1971 USA
## 2 Dustin Kamakana Mai Ku'u Makualani Antolin 1989 USA
## 3 Anthony Lee Barnette 1983 USA
## 4 Chad Robert Bentz 1980 USA
## 5 Douglas Edmund Capilla 1952 USA
## 6 Shawn Anthony Chacon 1977 USA
## 7 Steven Montague Cooke 1970 USA
## 8 Aaron R. Cunningham 1986 USA
## 9 Ronald Maurice Darling 1960 USA
## 10 Joseph DeSa 1959 USA
## # … with 48 more rows
Use an ORDER BY
clause with a comma separated list of columns to arrange the table.
## Get all 100+ RBI seasons since 2010, ordered
lahman %>%
tbl(sql('
SELECT playerID, yearID, RBI
FROM batting
WHERE RBI >= 100 AND yearID >= 2010
ORDER BY yearID, -RBI
')) %>%
collect()
## # A tibble: 174 x 3
## playerID yearID RBI
## <chr> <int> <int>
## 1 cabremi01 2010 126
## 2 rodrial01 2010 125
## 3 bautijo02 2010 124
## 4 pujolal01 2010 118
## 5 gonzaca01 2010 117
## 6 guerrvl01 2010 115
## 7 vottojo01 2010 113
## 8 youngde03 2010 112
## 9 konerpa01 2010 111
## 10 canoro01 2010 109
## # … with 164 more rows
We can perform aggregations such as sums, means, and counts similar to summarize()
from dplyr
.
We can use a GROUP BY
clause for the split-apply-combine pattern.
Here we find the players with the most total RBI since 2010.
SELECT playerID, sum(RBI) as RBI_total
FROM batting
WHERE yearID >= 2010
GROUP BY playerID
ORDER BY -RBI_total
## Count total RBIs since 2010 by player
lahman %>%
tbl(sql('
SELECT playerID, sum(RBI) as RBI_total
FROM batting
WHERE yearID >= 2010
GROUP BY playerID
ORDER BY -RBI_total
')) %>%
collect()
## # A tibble: 3,566 x 2
## playerID RBI_total
## <chr> <int>
## 1 pujolal01 963
## 2 cruzne02 961
## 3 encared01 956
## 4 cabremi01 941
## 5 canoro01 878
## 6 bruceja01 824
## 7 longoev01 817
## 8 braunry02 811
## 9 goldspa01 807
## 10 jonesad01 806
## # … with 3,556 more rows
The operator defining a limiting clause on an aggregate variable is HAVING
. It is essentially like WHERE
except for operating on summary statistics rather than individual rows. In other words, HAVING
refers to the output table specified in SELECT
rather than the input table(s) specified using FROM
.
In the query below, observe that the HAVING
clause comes after the GROUP BY
but before the ORDER BY
.
SELECT playerID, sum(RBI) as RBI_total
FROM batting
WHERE yearID >= 2010
GROUP BY playerID
HAVING RBI_total >= 500
ORDER BY -RBI_total
## Players with 500+ RBIs since 2010
lahman %>%
tbl(sql('
SELECT playerID, sum(RBI) as RBI_total
FROM batting
WHERE yearID >= 2010
GROUP BY playerID
HAVING RBI_total >= 500
ORDER BY -RBI_total
')) %>%
collect()
## # A tibble: 96 x 2
## playerID RBI_total
## <chr> <int>
## 1 pujolal01 963
## 2 cruzne02 961
## 3 encared01 956
## 4 cabremi01 941
## 5 canoro01 878
## 6 bruceja01 824
## 7 longoev01 817
## 8 braunry02 811
## 9 goldspa01 807
## 10 jonesad01 806
## # … with 86 more rows
So far we have discussed working with single tables only. The SQL term for merging data from two or more tables is a ‘join’. All joins are based on the idea of equating rows that match on one or more variables. Below we will discuss the following types of joins:
What if we wanted to supplement our earlier table showing players with 500+ RBI since 2000 with information about those players? We could use an inner join of our RBI table with the ‘Master’ table to accomplish this.
SELECT m.nameFirst First, m.nameLast Last, sum(RBI) as RBI_TOTAL
FROM batting b
INNER JOIN master m ON b.playerID = m.playerID
WHERE yearID >= 2010
GROUP BY b.playerID
HAVING RBI_total >= 500
ORDER BY -RBI_total
lahman %>% tbl(sql(
'
SELECT m.nameFirst First, m.nameLast Last, sum(RBI) as RBI_TOTAL
FROM batting b
INNER JOIN master m ON b.playerID = m.playerID
WHERE yearID >= 2010
GROUP BY b.playerID
HAVING RBI_total >= 500
ORDER BY -RBI_total
'
))
## # Source: SQL [?? x 3]
## # Database: sqlite 3.33.0
## # [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
## First Last RBI_TOTAL
## <chr> <chr> <int>
## 1 Albert Pujols 963
## 2 Nelson Cruz 961
## 3 Edwin Encarnacion 956
## 4 Miguel Cabrera 941
## 5 Robinson Cano 878
## 6 Jay Bruce 824
## 7 Evan Longoria 817
## 8 Ryan Braun 811
## 9 Paul Goldschmidt 807
## 10 Adam Jones 806
## # … with more rows
In this example, an inner join suffices because each playerID
in the batting
table is linked to a playerID
in the master
table.
In a left join – sometimes called a left outer join – we add columns from the right table to the left table when matching rows are found. Rows from the left table with no matches from the right table are retained with columns from the right table filled in as NULL
(i.e. NA
). When there are multiple matches of a row from the left table to rows in the right table, these each become a row in the new table.
A right join is equivalent to a left join with the exception that the roles between right and left are reversed. My personal preference is to generally structure queries as left joins for simplicity.
Left joins are particularly useful when the information in the right table is only applicable to a subset of the rows from the left table. As an example, suppose we would like to know which US colleges and universities have produced the most “Rookie of the Year Awards” given to the best debuting player(s) each season.
To get started, we first test a query to find the last college attended.
-- Last college attended
SELECT *
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
# Query to find last college atttended
lahman %>% tbl(sql(
'
SELECT *
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
'
))
## # Source: SQL [?? x 3]
## # Database: sqlite 3.33.0
## # [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
## playerID schoolID yearID
## <chr> <chr> <int>
## 1 aardsda01 rice 2003
## 2 abadan01 gamiddl 1993
## 3 abbeybe01 vermont 1892
## 4 abbotje01 kentucky 1994
## 5 abbotji01 michigan 1988
## 6 abbotky01 longbeach 1989
## 7 abercre01 flgateway 2000
## 8 abernte01 elon 1941
## 9 ablesha01 swesterntx 1904
## 10 accarje01 illinoisst 2003
## # … with more rows
Now, we find all distinct awards in the AwardPlayers
table.
# Distinct Player Awards
lahman %>% tbl(sql(
'
SELECT distinct(awardID)
FROM AwardsPlayers
'
)) %>%
collect() %>%
print(n=30)
## # A tibble: 29 x 1
## awardID
## <chr>
## 1 ALCS MVP
## 2 All-Star Game MVP
## 3 Babe Ruth Award
## 4 Baseball Magazine All-Star
## 5 Branch Rickey Award
## 6 Comeback Player of the Year
## 7 Cy Young Award
## 8 Gold Glove
## 9 Hank Aaron Award
## 10 Hutch Award
## 11 Lou Gehrig Memorial Award
## 12 Most Valuable Player
## 13 NLCS MVP
## 14 Outstanding DH Award
## 15 Pitching Triple Crown
## 16 Reliever of the Year Award
## 17 Roberto Clemente Award
## 18 Rolaids Relief Man Award
## 19 Rookie of the Year
## 20 Silver Slugger
## 21 TSN All-Star
## 22 TSN Fireman of the Year
## 23 TSN Guide MVP
## 24 TSN Major League Player of the Year
## 25 TSN Pitcher of the Year
## 26 TSN Player of the Year
## 27 TSN Reliever of the Year
## 28 Triple Crown
## 29 World Series MVP
Next we test a query for finding all Rookie of the Year Awards.
# Query to find Rookie of the Year Awards
lahman %>% tbl(sql(
'
SELECT *
FROM AwardsPlayers
WHERE awardID LIKE "Rookie%"
'
))
## # Source: SQL [?? x 6]
## # Database: sqlite 3.33.0
## # [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
## playerID awardID yearID lgID tie notes
## <chr> <chr> <int> <chr> <chr> <chr>
## 1 robinja02 Rookie of the Year 1947 ML <NA> <NA>
## 2 darkal01 Rookie of the Year 1948 ML <NA> <NA>
## 3 sievero01 Rookie of the Year 1949 AL <NA> <NA>
## 4 newcodo01 Rookie of the Year 1949 NL <NA> <NA>
## 5 dropowa01 Rookie of the Year 1950 AL <NA> <NA>
## 6 jethrsa01 Rookie of the Year 1950 NL <NA> <NA>
## 7 mcdougi01 Rookie of the Year 1951 AL <NA> <NA>
## 8 mayswi01 Rookie of the Year 1951 NL <NA> <NA>
## 9 byrdha01 Rookie of the Year 1952 AL <NA> <NA>
## 10 blackjo02 Rookie of the Year 1952 NL <NA> <NA>
## # … with more rows
Finally, we use a left join of the tables for ROY awards and last college attended to match winners to their schools. We need a left join as many of the winners may never have played collegiate baseball.
SELECT roy.playerID playerID, roy.yearID year, lgID league, schoolID
FROM AwardsPlayers roy
LEFT JOIN
(SELECT * --Final College Attended
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
) c
ON c.playerID = roy.playerID
WHERE awardID LIKE "Rookie%"
# Query to find last college for ROY
lahman %>% tbl(sql(
'
SELECT roy.playerID playerID, roy.yearID year, lgID league, schoolID
FROM AwardsPlayers roy
LEFT JOIN
(SELECT * --Last College Attended
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
) c
ON c.playerID = roy.playerID
WHERE awardID LIKE "Rookie%"
'
))
## # Source: SQL [?? x 4]
## # Database: sqlite 3.33.0
## # [/var/folders/t6/m1sbk4dn10181ns0jx_ss0s00000gn/T//RtmpCJciUo/lahman.sqlite]
## playerID year league schoolID
## <chr> <int> <chr> <chr>
## 1 robinja02 1947 ML ucla
## 2 darkal01 1948 ML ulala
## 3 sievero01 1949 AL <NA>
## 4 newcodo01 1949 NL <NA>
## 5 dropowa01 1950 AL uconn
## 6 jethrsa01 1950 NL <NA>
## 7 mcdougi01 1951 AL <NA>
## 8 mayswi01 1951 NL <NA>
## 9 byrdha01 1952 AL <NA>
## 10 blackjo02 1952 NL <NA>
## # … with more rows
To complete the example, we modify the query to display which schools have produced the most ROY awards in total.
SELECT schoolID, count(c.playerID) as ROY_awards
FROM AwardsPlayers roy
LEFT JOIN
(SELECT * --Last College Attended
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(YearID)
) c
ON c.playerID = roy.playerID
WHERE awardID LIKE "Rookie%"
AND schoolID IS NOT NULL
GROUP BY schoolID
HAVING ROY_awards > 1
ORDER BY -ROY_awards
# Which schools have produced the most ROY?
lahman %>% tbl(sql(
'
SELECT schoolID, count(c.playerID) as ROY_awards
FROM AwardsPlayers roy
LEFT JOIN
(SELECT * --Last College Attended
FROM CollegePlaying
GROUP BY playerID
HAVING yearID == max(yearID)
) c
ON c.playerID = roy.playerID
WHERE awardID LIKE "Rookie%"
AND schoolID IS NOT NULL
GROUP BY schoolID
HAVING ROY_awards > 1
ORDER BY -ROY_awards
'
)) %>% collect()
## # A tibble: 9 x 2
## schoolID ROY_awards
## <chr> <int>
## 1 arizonast 4
## 2 michigan 3
## 3 ucla 3
## 4 usc 3
## 5 floridaam 2
## 6 longbeach 2
## 7 oklahoma 2
## 8 texasam 2
## 9 wagner 2
show_query()
If you are new to SQL but comfortable with dplyr
you may find show_query()
useful for understanding how dplyr statements are translated to SQL.
# Players with more than 5 years at a single school.
lahman %>%
tbl('CollegePlaying') %>%
filter(yearID > 1900) %>%
group_by(playerID, schoolID) %>%
summarize(n_year = n()) %>%
filter(n_year>5) %>%
show_query()
## <SQL>
## SELECT *
## FROM (SELECT `playerID`, `schoolID`, COUNT() AS `n_year`
## FROM (SELECT *
## FROM `CollegePlaying`
## WHERE (`yearID` > 1900.0))
## GROUP BY `playerID`, `schoolID`)
## WHERE (`n_year` > 5.0)