Photo by Kristopher Roller on Unsplash

Predicting Churn on Big-Data

…with the help of PySpark!

Project Definition

Sparkify is a fictional music-streaming service created by Udacity. The service is represented by a 26 million row dataset which contains behavior and characteristics of 27 thousand users for each second of October and November in 2018. The size of this dataset is 12 gigabytes. This dataset falls into the realm of ‘Big Data,’ which means that its size is prohibitive for a conventional computer to analyze. The github for this project is here:

Pyspark-Udacity-Capstone/Sparkify_Cluster.ipynb at main · DJSherwood/Pyspark-Udacity-Capstone (github.com)

Problem Statement

For any service, the issue of a customer / user quitting or ‘churning’ is concerning. …


Photo by Matt Hudson on Unsplash

The worldwide pandemic has changed our lives, mostly for the worse. The loss of life, economic opportunity, and old fashioned human contact will be felt for years to come. But there was one aspect of the pandemic that I enjoyed.

Traffic! Or, the lack thereof.

I wondered recently if automobile crashes had improved during the pandemic as well. If there are fewer cars on the road, the surely there should be fewer crashes, right? After about 5 seconds of searching, I found this dataset:

US-Accidents: A Countrywide Traffic Accident Dataset — Sobhan Moosavi (smoosavi.org)

I am using “Version 4,” which…


Photo by Quinten de Graaf on Unsplash

There is a wealth of information about ML methods and stats for the aspiring data scientist, but not nearly enough information as to how it should be organized. All too often I have seen data science projects presented in Jupyter Notebooks almost haphazardly. The lack of organization obscures the insight and can even hide disastrous methodological errors ( like a data leak ).

But there is solution! By using pipelines from Sci-Kit Learn you can develop a consistent workflow which shows all of your data transformations and can be applied to many data science projects.

The Pipeline() class obtains this…

Daniel Sherwood

Always reading, always learning, and always looking for opportunities to implement data science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store