World-Class Ed-Tech Institute For Cutting-Edge Training In AI & Blockchain

About the Course

In this course you will learn about basic statistics and data types, preparing data, feature engineering, fitting a model and pipelines and grid search. Apache Spark™ is a fast and general engine for large-scale data processing with built-in modules for streaming, machine learning and graph processing. This course shows you how to use Spark’s machine learning pipelines to fit models and search for optimal hyperparameters using a Spark cluster.

Course Syllabus

Module 1 - Basic Statistics and Data Types

Vectors and Labelled Points
Local and Distributed Matrices
Summary Statistics, Correlations, and Random Data
Sampling
Hypothesis Testing

Module 2 - Preparing Data

Statistics, Random data and Sampling on Data Frames
Handling Missing Data and Imputing Values
Transformers and Estimators
Data Normalization
Identifying Outliers

Module 3 - Feature Engineering

Feature Vectors
Categorical Features
Using Explode, User Defined Functions, and Pivot
Principal Component Analysis (PCA) in Feature Engineering
RFormulas

Module 4 - Fitting a Model

Decision Trees
Random Forests
Gradient-Boosting Trees
Linear Methods
Evaluation

Module 5 - Pipeline and Grid Search

Predicting Grant Applications: Introduction
Predicting Grant Applications: Creating Features
Predicting Grant Applications: Building a Pipeline
Prediciting Grant Applications: Cross Validation and Model Tuning
Predicting Grant Applications: Wrapping up

General Informatiln

Self-paced
Flexible enrolment
Audit multiple times
There is only ONE chance to pass the course, but multiple attempts per question

Recommended Existing Skills

General understanding of Scala Experience with Java (preferred)
Python, or another object oriented language
General understanding of machine learning

Course Staff

Petro Verkhogliad

Petro Verkhogliad is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.

Dr Priya Dev

Dr Priya Dev is a lecturer of statistics at ANU and UNSW and also a founder of a mobile commerce startup, Qhopper. She completed a PhD in probability theory from ANU and Columbia University and has been a data analytics consultant to ASX listed companies and global banks. Qhopper is a massively scalable mobile commerce platform built on the Lightbend platform using Scala and Spark. It bridges the technology gap for hospitality businesses, helping them create better experiences and connect with new and existing customers through their own online ordering, CRM and business intelligence suite.

Joseph Santarcangelo Ph.D.

Joseph has a Ph.D. in Electrical Engineering. His research focuses on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

Other Contributors

Agatha Colangelo also contributed.

Data Science for Scala

Course Details