R frontend for Spark

Project maintained by amplab-extras Hosted on GitHub Pages — Theme by mattgraham

R on Spark

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.


RDDs as Distributed Lists

SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply on a RDD.

  sc <- sparkR.init("local")
  lines <- textFile(sc, "hdfs://data.txt")
  wordsPerLine <- lapply(lines, function(line) { length(unlist(strsplit(line, " "))) })

In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. Other supported RDD functions include operations like reduce, reduceByKey, groupByKey and collect.

Serializing closures

SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply, SparkR will automatically capture these variables and copy them to the cluster. An example of using a random weight vector to initialize a matrix is shown below

   lines <- textFile(sc, "hdfs://data.txt")
   initialWeights <- runif(n=D, min = -1, max = 1)
   createMatrix <- function(line) {
     as.numeric(unlist(strsplit(line, " "))) %*% t(initialWeights)
   # initialWeights is automatically serialized
   matrixRDD <- lapply(lines, createMatrix)

Using existing R packages

SparkR also allows easy use of existing R packages inside closures. The includePackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster. For example to use the Matrix in a closure applied on each partition of an RDD, you could run

  generateSparse <- function(x) {
    # Use sparseMatrix function from the Matrix package
    sparseMatrix(i=c(1, 2, 3), j=c(1, 2, 3), x=c(1, 2, 3))
  includePackage(sc, Matrix)
  sparseMat <- lapplyPartition(rdd, generateSparse)

Installing SparkR

SparkR requires Scala 2.10 and Spark version >= 1.1.0 (support for Spark 1.3.0 coming soon) and depends on R packages testthat (only required for running unit tests).

For lastest information, please refer to README.

If you wish to try out SparkR, you can use install_github from the devtools package to directly install the package.

install_github("amplab-extras/SparkR-pkg", subdir="pkg")

If you wish to clone the repository and build from source, you can using the following script to build the package locally.


Running sparkR

If you have installed it directly from github, you can include the SparkR package and then initialize a SparkContext. For example to run with a local Spark master you can launch R and then run

sc <- sparkR.init(master="local")

If you have cloned and built SparkR, you can start using it by launching the SparkR shell with


SparkR also comes with several sample programs in the examples directory. To run one of them, use ./sparkR <filename> <args>. For example:

./sparkR examples/pi.R local[2]  

You can also run the unit-tests for SparkR by running


SparkR package API documentation

Documentation for package ‘SparkR’

Report Issues and Feedback

You can report issues and provide feedback through Feedback. For more information, please refer to README