SparklingPandas

SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. SparklingPandas builds on Spark's DataFrame class to give you a polished, pythonic, and Pandas-like API.

Using

You can install SparklingPandas with pip:

pip install sparklingpandas

Once installed, you can import the package and hack away:

import sparklingpandas

Make sure you have the SPARK_HOME environment variable set to the root directory of your Spark 1.4.0 (or above) distribution.

Requirements

The primary requirement of SparklingPandas is that you have a recent (v1.4 currently) version of Spark installed - http://spark.apache.org and Python 2.7.

State

This is in early development. Feedback is taken seriously and is seriously appreciated. As you can tell, us SparklingPandas are a pretty serious bunch.

Videos

An early version of Sparkling Pandas was discussed in Sparkling Pandas - using Apache Spark to scale Pandas - Holden Karau and Juliet Hougland

Support

Check out our Google group at https://groups.google.com/forum/#!forum/sparklingpandas