SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. SparklingPandas builds on Spark's DataFrame class to give you a polished, pythonic, and Pandas-like API.
You can install SparklingPandas with pip:
pip install sparklingpandas
Once installed, you can import the package and hack away:
Make sure you have the SPARK_HOME environment variable set to the root directory of your Spark 1.4.0 (or above) distribution.
The primary requirement of SparklingPandas is that you have a recent (v1.4 currently) version of Spark installed - http://spark.apache.org and Python 2.7.
This is in early development. Feedback is taken seriously and is seriously appreciated. As you can tell, us SparklingPandas are a pretty serious bunch.
An early version of Sparkling Pandas was discussed in Sparkling Pandas - using Apache Spark to scale Pandas - Holden Karau and Juliet Hougland
Check out our Google group at https://groups.google.com/forum/#!forum/sparklingpandas