Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. --[Wikipedia][hadoop]
hdfs dfs
to work with files.hdfs dfs -put <local_file> <path/new_file>
to put data into HDFS. hdfs dfs -get <hdfs_file> <local_file>
to get data from HDFS. %%bash
ssh cavium-thunderx.arc-ts.umich.edu
hdfs dfs -ls stats507
hdfs dfs -ls /user/jbhender/stats507
/hadoop-fuse/user/<email>/
and use linux file system
commands without hdfs dfs
prefix. cd /hadoop-fuse/
ls /user/jbhender/stats507/
head -5 /user/jbhender/stats507/rectangles.csv
--master yarn
is necessary. pyspark --master yarn --queue default --num-executors=8 --executor-memory=1g
spark
- an instance of a SparkSession()
,sc
- an instance of a SparkContext()
,sqlContext
- an instance of SQLContext()
. from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
sc.parallelize()
to distribute the data..repartition()
or .coalesce()
to redistribute an existing RDD.sqlContext.registerDataFrameAsTable()
method..registerTempTable()
method so table's
don't persis across jobs. sqlContext.sql()
. .show()
to print a DataFrame (e.g. resulting from a SQL query)..collect()
to gather the results into memory..persist()
to save results so they don't need to be recomputed. JOIN
s after FROM
/WHERE
.%%sql
SELECT
FROM
WHERE
GROUP BY
HAVING
FROM
.WHERE
to specify a subset of data to include using conditions on \
existing tables (in FROM
)SELECT
.GROUP BY
for split-apply-combined
operations.FROM
use a short name to alias a table.SELECT
rename a column/computations using as
. AS
:CREATE TABLE <name> AS SELECT ...
(SELECT ...) a
. ON
e.g. ON a.id = b.id
.LEFT JOIN
or INNER JOIN
for consistency.