Data Wrangling and Modelling with PySpark

Data Science Workshops / online

-15%Register/ sign-in to reveal discount code

Apply on site

Data can a valuable asset, especially when there’s a lot of it. Exploratory data analysis, business intelligence, and machine learning can benefit tremendously if such Big Data can be wrangled and modelled at scale. Apache Spark is an open-source distributed engine for querying and processing data. In this three-day hands-on workshop, you will learn how to leverage Spark and Python to process Big Data.

You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python and Jupyter Notebook environment for Spark. You’ll learn about different techniques for collecting and processing data. We’ll begin with Resilient Distributed Datasets (RDDs) and work our way up to DataFrames.

We provide examples of how to read data from files and how to specify schemas using reflection or programmatically. The concept of lazy execution is discussed in detail and we demonstrate various transformations and actions specific to RDDs and DataFrames. We show you how DataFrames can be manipulated using SQL queries.

We’ll show you how to apply supervised machine learning models such as linear regression, logistic regression, decision trees, and random forest. You’ll also see unsupervised machine learning models such as K-means and hierarchical clustering.

By the end of this workshop, you will have a solid understanding of how to process data using PySpark and you will understand how to use Spark’s machine learning library to build and train various machine learning models.

Click on the button below for full program and prerequisites.

Full description & Apply on site

Share this workshop