DSCI-272: Predicting with Cloudera Machine Learning

Duration: 4 Days (32 Hours)

DSCI-272: Predicting with Cloudera Machine Learning Course Overview:

For effective collaboration and streamlined processes, enterprise data science teams necessitate unified access to both business data and the essential tools and computing resources necessary for the development and deployment of machine learning workflows. Addressing this need, Cloudera Machine Learning (CML), integrated within the Cloudera Data Platform (CDP), offers a comprehensive solution, catering to the requirements of data science teams.

Spanning four days, this course focuses on mastering machine learning workflows and their operational aspects using CML. Participants delve into the realms of data exploration, visualization, and analysis, alongside the pivotal tasks of training, evaluating, and deploying machine learning models.

The curriculum guides participants through an end-to-end journey of data science and machine learning, grounded in practical scenarios and datasets drawn from a fictional technology enterprise. Throughout the course, demonstrations and hands-on exercises are conducted using Python (utilizing PySpark) within the CML framework. This comprehensive training equips attendees to navigate intricate data science and machine learning workflows with confidence and proficiency.

Intended Audience:

  • The course is designed for data scientists who need to understand how to utilize Cloudera Machine Learning and the Cloudera Data Platform to achieve faster model development and deliver production machine learning at scale.
  • Data engineers, developers, and solution architects who collaborate with data scientists will also find this course valuable.

Learning Objectives of DSCI-272: Predicting with Cloudera Machine Learning:

Through lecture and hands-on exercises, you will learn how to:

  • Utilize Cloudera SDX and other components of the Cloudera Data Platform to locate data for machine learning experiments
  • Use an Applied ML Prototype (AMP)
  • Manage machine learning experiments
  • Connect to various data sources and explore data
  • Utilize Apache Spark and Spark ML
  • Deploy an ML model as a REST API
  • Manage and monitor deployed ML models
Introduction to CML
  • Overview
  • CML Versus CDSW
  • ML Workspaces
  • Workspace Roles
  • Projects and Teams
  • Settings
  • Runtimes/Legacy Engines
  • Editors and IDE
  • Git
  • Embedded Web Applications
  • AMPs
  • SDX Overview
  • Data Catalog
  • Authorization
  • Lineage
  • Data Visualization Overview
  • CDP Data Visualization Concepts
  • Using Data Visualization in CML
  • Experiments in CML
  • Entering Code
  • Getting Help
  • Accessing the Linux Command Line
  • Working With Python Packages
  • Formatting Session Output
  • How Spark Works
  • The Spark Stack
  • File Formats in Spark
  • Spark Interface Languages
  • Introduction to PySpark
  • How DataFrame Operations Become Spark Jobs
  • How Spark Executes a Job
  • Running a Spark Application
  • Reading data into a Spark SQL DataFrame
  • Examining the Schema of a DataFrame
  • Computing the Number of Rows and Columns of a DataFrame
  • Examining a Few Rows of a DataFrame
  • Stopping a Spark Application
  • Inspecting a DataFrame
  • Inspecting a DataFrame Column
  • Spark SQL DataFrames
  • Working with Columns
  • Working with Rows
  • Working with Missing Values
  • Spark SQL Data Types
  • Working with Numerical Columns
  • Working with String Columns
  • Working with Date and Timestamp Columns
  • Working with Boolean Columns
  • Complex Collection Data Types
  • Arrays
  • Maps
  • Structs
  • User-Defined Functions
  • Example 1: Hour of Day
  • Example 2: Great-Circle Distance
  • Working with Delimited Text Files
  • Working with Text Files
  • Working with Parquet Files
  • Working with Hive Tables
  • Working with Object Stores
  • Working with Pandas DataFrames
  • Combining and Splitting DataFrames
  • Joining DataFrames
  • Splitting a DataFrame
  • Summarizing Data with Aggregate Functions
  • Grouping Data
  • Pivoting Data
  • Window Functions
  • Example: Cumulative Count and Sum
  • Example: Compute Average Days Between Rides for Each Rider
  • Introduction to Machine Learning
  • Machine Learning Tools
  • Introduction to Apache Spark MLlib
  • Possible Workflows for Big Data
  • Exploring a Single Variable
  • Exploring a Pair of Variables
  • Monitoring Spark Applications
  • Configuring the Spark Environment
  • Assemble the Feature Vector
  • Fit the Linear Regression Model
  • Generate Label
  • Fit the Logistic Regression Model
  • Requirements for Hyperparameter Tuning
  • Tune the Hyperparameters Using Holdout Cross-Validation
  • Tune the Hyperparameters Using K-Fold Cross-Validation
  • Print and Plot the Home Coordinates
  • Fit a Gaussian Mixture Model
  • Explore the Cluster Profiles
  • Fit a Topic Model Using Latent Dirichlet Allocation
  • Recommender Models
  • Generate Recommendations
  • Fit the Pipeline Model
  • Inspect the Pipeline Model
  • Build a Scikit-Learn Model
  • Apply the Model Using a Spark UDF
  • Load the Serialized Model
  • Define a Wrapper Function to Generate a Prediction
  • Test the Function
  • Autoscaling Workloads
  • Working with GPUs
  • Why Monitor Models?
  • Common Models Metrics
  • Models Monitoring With Evidently
  • Continuous Model Monitoring

DSCI-272: Predicting with Cloudera Machine Learning Course Prerequisites

  • Familiarity with basic concepts of machine learning and predictive modeling.
  • A working understanding of Cloudera Machine Learning (CML) platform.
  • Proficiency in using Python for data manipulation and analysis.
  • Basic knowledge of data preprocessing techniques and feature engineering.
  • Experience with data visualization and interpretation of machine learning models.

Discover the perfect fit for your learning journey

Choose Learning Modality

Live Online

  • Convenience
  • Cost-effective
  • Self-paced learning
  • Scalability


  • Interaction and collaboration
  • Networking opportunities
  • Real-time feedback
  • Personal attention


  • Familiar environment
  • Confidentiality
  • Team building
  • Immediate application

Training Exclusives

This course comes with following benefits:

  • Practice Labs.
  • Get Trained by Certified Trainers.
  • Access to the recordings of your class sessions for 90 days.
  • Digital courseware
  • Experience 24*7 learner support.

Got more questions? We’re all ears and ready to assist!

Request More Details

Please enable JavaScript in your browser to complete this form.

Subscribe to our Newsletter

Please enable JavaScript in your browser to complete this form.