SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. SparklingPandas builds on Spark's DataFrame class to give you a polished, pythonic, and Pandas-like API.


You can install SparklingPandas with pip:

pip install sparklingpandas

Once installed, you can import the package and hack away:

import sparklingpandas

Make sure you have the SPARK_HOME environment variable set to the root directory of your Spark 1.4.0 (or above) distribution.


The primary requirement of SparklingPandas is that you have a recent (v1.4 currently) version of Spark installed - and Python 2.7.


This is in early development. Feedback is taken seriously and is seriously appreciated. As you can tell, us SparklingPandas are a pretty serious bunch.


An early version of Sparkling Pandas was discussed in Sparkling Pandas - using Apache Spark to scale Pandas - Holden Karau and Juliet Hougland


Check out our Google group at!forum/sparklingpandas