Distributed computing describes the use of many computing nodes to process large amounts of data in parallel. In contrast to parallel programming models discussed previously the data itself is distributed in advance across compute nodes, usually redundantly. A cluster manager then distributes computational tasks among various worker nodes and aggregates the results.
Two prominent models for distributed computing are Hadoop and Spark. Hadoop makes heavy use of disk storage for working with large data sets while Spark is more focused on computations performed in working memory (RAM).
MapReduce refers to both a programming model and associated implementations for distributed computing. It resembles the familiar “split-apply-combine” pattern but is designed for processing large amounts of distributed data in parallel.
As the name suggests, it includes both map and reduce steps.
In a map step each element of a collection (think rows or subsets of data) is transformed, typically producing a <key, (value)>
pair.
A reduce step is used to aggregate the outputs from the map step sharing the same key.
In most MapReduce implementations, there is also a shuffle step in which the outputs from the map step are reorganized by key. This would typically involve communication over a network.
This diagram illustrates this setup.
To make this concrete, suppose a large online retail company would like to compute its total sales and average sale price by state based on shipping address. Its sales data, however, is not centrally organized into a database but distributed across a collection of databases determined by the location of the server farm that handled the order.
A MapReduce solution to this problem starts with an initial set of keys k1
describing each of the databases. Map functions take each key in k1
and return a set of new key value pairs, <state, (n_sales, total_sales)>
. Here state represents the new key (k2
).
These new pairs are then shuffled so that all pairs with the same k2
key (state) are collocated on a node. The reduce function aggregates these values as they arrive by summing n_sales
and total_sales
.
Hadoop is software with the goal of enabling processing for very large amounts of data using MapReduce by distributing both the data and the processing across many (“thousands”) of servers.
Here is the description from the hadoop manual (man hadoop
):
DESCRIPTION
Here's what makes Hadoop especially useful:
Here's what makes Hadoop especially
useful:
Scalable
Hadoop can reliably store and
process petabytes.
Economical
It distributes the data and pro‐
cessing across clusters of commonly
available computers. These clusters
can number into the thousands of
nodes.
Efficient
By distributing the data, Hadoop
can process it in parallel on the
nodes where the data is located.
This makes it extremely rapid.
Reliable
Hadoop automatically maintains mul‐
tiple copies of data and automati‐
cally redeploys computing tasks
based on failures.
Hadoop consists of two main components: the HDFS file system and a resource manager called YARN.
HDFS is an acronym for Hadoop Distributed File System which is built in JAVA. This is a distributed file system designed to be fault tolerant. If follows a “write-once-read-many” model which makes it difficult to modify files once written. This is by design as it allows assumptions to be made about data and coherency to be maintained across replicate data sets.
It is also partially based on the principle that “moving computation is cheaper than moving data”. Data sets are split into “blocks” which are stored locally and redundantly on “DataNodes” typically on per node in a cluster. The location and organization of these blocks is maintained by a master “NameNode”.
This file system is structured similarly to a Linux alike system with a folder hierarchy, but it is not a POSIX file system and relaxes several requirements of that standard to enable faster read and write operations (“throughput”) and facilitate streaming data applications.
However, most common Linux file system commands have implementations within the hdfs dfs
command.
For instance, the following commands would create a directory with my uniqname within an hdfs /user/
directory.
hdfs dfs -mkdir /user/jbhender
Local data example.csv
could then be placed into this directory using -put
.
hdfs dfs -put ./example.csv example.csv
See the examples that have been put there:
hdfs dfs -ls /user/jbhender/
And, after some processing, maybe extract a results file back to the local directory.
hdfs dfs -get results.csv results.csv
Learn more about the HDFS file system shell at the Apache website.
YARN is the hadoop resource manager similar to the role played by Moab/TORQUE/PBS job scheduling system used on Flux.
To be more specific, in the Flux framework:
Hadoop is designed to implement the MapReduce abstraction for data processing. Under the map reduce paradigm, data processing tasks are split into multiple smaller jobs which can be executed in parallel using local subsets of data – this is the map step – with the results then aggregated in a “reduce” step.
Apache Spark is another popular framework for computing on a distributed cluster.
As with Hadoop, Spark is built around the core concept of distributed data. The fundamental idea within Spark is the ‘Resilient Distributed Dataset’ or (RDD). RDDs are immutable meaning they cannot be modified after creation. Instead, we work with RDDs by transforming them to create new RDDs or by performing actions such as filtering and summarizing.
By combining actions, more complex statistical tasks such as fitting a GLM can be carried out. A collection of such tasks are available in the Spark machine learning library MLib.
Since the release of Spark 2.0, there is DataFrame library built on top of the RDDs which has a more familiar table structure familar from R or SQL.
The native language for Sparc is Scala which resembles Java. However, there are interfaces for R and Python making Sparc more accessible. The PySpark API for python is more mature than the R API.
A limited R interface called sparklyr is available from RStudio. A more complete but less fmailiar interface is SparkR.
We can install Spark and SparklyR as follows.
install.packages(sparklyr)
library(sparklyr)
spark_install(version = "2.1.0")
We will look at a simple examples from sparklyr avaialalbe here
Learn more about hadoop at http://hadoop.apache.org.
There are some additional examples on the Flux Hadoop User Guide.
Spark