top of page

Spark and Parquet for Optimized Data Processing

Bess Yang (qy561@nyu.edu),  Iris Lu (hl5679@nyu.edu), Chloe Kwon (ekk294@nyu.edu)

Project Overview

This project focused on testing several hypotheses regarding movie ratings data from 1097 participants across 400 movies. The dataset included various behavioral and demographic attributes. We applied a variety of statistical techniques, including independent t-tests, ANOVA, and non-parametric tests, to investigate factors such as gender differences in movie enjoyment, sibling influences, and the consistency of quality in movie franchises.

Languages, Platforms, and Tools

  • Languages: Python, SQL

  • Tools: Apache Spark, Hadoop Distributed File System (HDFS), Parquet

  • Platforms: NYU's High-Performance Computing (HPC) environment, Google Cloud Dataproc

Results

  • Dataset Optimization

    • We tested three datasets (peopleSmallOpt2.parquet, peopleModerateOpt2.parquet, and peopleBigOpt2.parquet) using different optimization strategies. For each dataset, we recorded the minimum, maximum, and median execution times for 25 runs. Across all dataset sizes, applying optimizations resulted in significant improvements in performance, especially for the larger dataset, where we observed a decrease in both min and max execution times.

  • Further iterations and testing with additional optimizations also showed improvements across the board. We did observe some fluctuation in times due to the usage of a shared computing cluster.

My Contributions

In this project, we did not split or distribute the work among team members. Instead, each of us completed the entire homework independently and then compared our results. We discussed our findings and approaches, ultimately creating a final version that we all agreed upon.

I was responsible for writing queries based on the instructions given and implementing and testing the optimization strategies on Spark using Parquet datasets. I designed and executed the scripts that processed the datasets and logged the execution times across multiple runs for comparison. Through careful testing and tuning, we were able to reduce the overhead of reading and processing data, ensuring scalability as data sizes increased.

bottom of page