The internet is teeming with resources on Big Data, Spark, machine learning, data science, Python, and analytic notebooks. However, with the high level of interest in and the degree of overlap among these topics, finding the best references isn't always as simple as a quick Google search. Which have you come across that combine them all, and are good enough that you'd recommend them to a friend?
Here are a few of my own favorites:
- This quick tutorial on ML in PySpark by Data Bricks is a pretty good walk-through, and doesn't tread into the proprietary DB-only classes: MLlib and Machine Learning — Databricks Documentation
- This free video course by Eric Charles is pretty nice, including many examples in Scala, Python and R, all executed within Zeppelin. Some of my favorite chapters are:
- Manipulating data frames in Zeppelin, in videos 2.6 and 2.7 (featuring Scala and SQL)
- An introduction to visualizing data in Zeppelin, using ggplot2, matplotlib and Angular.js
- Videos 2.10 and beyond introduce Spark ML and the critical pipeline concept
- Although we will generally prefer the scalable libraries of Spark ML, this introduction to ML with scikit-learn is a nice introduction to the topic in Python.
What are some of your favorites?