Course Code: 19306

Data Science and Data Engineering for Architects Training

Class Dates:
4 Days
Class Time:
Instructor-Led Training, Virtual Instructor-Led Training


  • Course Overview
  • This Data Science training course is complemented by a variety of hands-on exercises to help the attendees reinforce their theoretical knowledge of the material being studied.

    Applied data science, business analytics, and data engineering
    Common data science/machine learning algorithms for supervised and unsupervised machine learning
    NumPy, pandas, matplotlib, seaborn, scikit-learn
    Python REPLs
    Jupyter notebooks
    Data analytics life-cycle phases
    Data repairing and normalizing
    Data aggregation and grouping
    Data visualization and EDA
    Operational data analytics
    Distributed and scalable data processing
    Cloud machine learning and data engineering capabilities
  • Audience
  • IT architects and technical managers


  • Participants should have a working knowledge of Python (or have the programming background and/or the ability to quickly pick up Python’s syntax), and be familiar with core statistical concepts (variance, correlation, etc.)

Course Details

  • Lesson 1. Python for Data Science
  • Python Data Science-Centric Libraries
  • SciPy
  • NumPy
  • pandas
  • Scikit-learn
  • Matplotlib
  • Seaborn
  • Python Dev Tools and REPLs
  • IPython
  • Jupyter Notebooks
  • Anaconda
  • Summary
  • Lesson 2. Data Visualization in Python
  • Why Do I Need Data Visualization?
  • Data Visualization in Python
  • Getting Started with matplotlib
  • A Basic Plot, Scatter Plots, Figures
  • Saving Figures to a File, Seaborn
  • Getting Started with seaborn
  • Histograms and KDE
  • Plotting Bivariate Distributions
  • Scatter Plots in seaborn, Pair plots in seaborn
  • Heatmaps
  • A Seaborn Scatterplot with Varying Point Sizes and Hues
  • Summary
  • Lesson 3. Introduction to NumPy
  • What is NumPy?
  • The First Take on NumPy Arrays, The ndarray Data Structure
  • Understanding Axes, Indexing Elements in a NumPy Array
  • Re-Shaping, Commonly Used Array Metrics
  • Commonly Used Aggregate Functions
  • Sorting Arrays, Vectorization, Vectorization Visually
  • Broadcasting, Broadcasting Visually
  • Filtering, Array Arithmetic Operations
  • Reductions: Finding the Sum of Elements by Axis
  • Array Slicing, 2-D Array Slicing
  • The Linear Algebra Functions
  • Summary
  • Lesson 4. Introduction to pandas
  • What is pandas?
  • The DataFrame Object, The DataFrame's Value Proposition
  • Creating a pandas DataFrame, Getting DataFrame Metrics
  • Accessing DataFrame Columns, Accessing DataFrame Rows
  • Accessing DataFrame Cells, Deleting Rows and Columns
  • Adding a New Column to a DataFrame
  • Getting Descriptive Statistics of DataFrame Columns
  • Getting Descriptive Statistics of DataFrames
  • Reading From CSV Files
  • Writing to a CSV File
  • Summary
  • Lesson 5. Repairing and Normalizing Data
  • Repairing and Normalizing Data
  • Dealing with the Missing Data
  • Sample Data Set, Getting Info on Null Data
  • Dropping a Column
  • Interpolating Missing Data in pandas
  • Replacing the Missing Values with the Mean Value
  • Scaling (Normalizing) the Data
  • Data Preprocessing with scikit-learn
  • Scaling with the scale() Function
  • The MinMaxScaler Object
  • Summary
  • Lesson 6. Defining Data Science
  • What is Data Science?
  • Data Science, Machine Learning, AI?, The Data Science Ecosystem
  • Tools of the Trade, The Data-Related Roles, Data Scientists at Work
  • Examples of Data Science Projects, The Concept of a Data Product
  • Applied Data Science at Google
  • Data Science and ML Terminology: Features and Observations
  • Terminology: Labels and Ground Truth, Label Examples
  • Terminology: Continuous and Categorical Features
  • Encoding Categorical Features using One-Hot Encoding Scheme
  • Example of 'One-Hot' Encoding Scheme
  • Gartner's Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)
  • Machine Learning in a Nutshell, Common Distance Metrics
  • .
  • The Euclidean Distance, Decision Boundary Examples (Object Classification)
  • What is a Model?, Training a Model to Make Predictions
  • Types of Machine Learning, Supervised vs Unsupervised Machine Learning, Supervised Machine Learning Algorithms
  • Unsupervised Machine Learning Algorithms, Which ML Algorithm to Choose?
  • Bias-Variance (Underfitting vs Overfitting) Trade-off
  • Underfitting vs Overfitting (a Regression Model Example) Visually
  • ML Model Evaluation, Mean Squared Error (MSE) and Mean Absolute Error (MAE)
  • Coefficient of Determination, Confusion Matrix
  • The Binary Classification Confusion Matrix, The Typical Machine Learning Process
  • A Better Algorithm or More Data?, The Typical Data Processing Pipeline in Data Science
  • Data Discovery Phase, Data Harvesting Phase
  • Data Cleaning/Priming/Enhancing Phase, Exploratory Data Analysis and Feature Selection
  • .
  • Exploratory Data Analysis and Feature Selection Cont'd
  • ML Model Planning Phase, Feature Engineering
  • ML Model Building Phase, Capacity Planning and Resource Provisioning
  • Communicating the Results
  • Production Roll-out
  • Data Science Gotchas
  • Summary
  • Lesson 7. Overview of the scikit-learn Library
  • The scikit-learn Library
  • The Navigational Map of ML Algorithms Supported by scikit-learn
  • Developer Support
  • scikit-learn Estimators, Models, and Predictors
  • Annotated Example of the LinearRegression Estimator
  • Annotated Example of the Support Vector Classification Estimator
  • Data Splitting into Training and Test Datasets
  • Data Splitting in scikit-learn
  • Cross-Validation Technique
  • Summary
  • 8. Classification Algorithms (Supervised Machine Learning)
  • Classification (Supervised ML) Use Cases
  • Classifying with k-Nearest Neighbors
  • k-Nearest Neighbors Algorithm Visually
  • Decision Trees, Decision Tree Terminology, Decision Tree Classification in the Context of Information Theory
  • Using Decision Trees, Properties of the Decision Tree Algorithm
  • The Simplified Decision Tree Algorithm
  • Random Forest, Properties of the Random Forest Algorithm
  • Support Vector Machines (SVMs), SVM Classification Visually
  • Properties of SVMs, Dealing with Non-Linear Class Boundaries,
  • Logistic Regression (Logit), The Sigmoid Function
  • Logistic Regression Classification Example
  • Logistic Regression's Problem Domain
  • .
  • Naive Bayes Classifier (SL)
  • Naive Bayesian Probabilistic Model in a Nutshell
  • Bayes Formula
  • Document Classification with Naive Bayes
  • Summary
  • Lesson 9. Unsupervised Machine Learning Algorithms
  • PCA, PCA and Data Variance, PCA Properties
  • Importance of Feature Scaling Visually
  • Unsupervised Learning Type: Clustering
  • Clustering vs Classification
  • Clustering Examples
  • k-means Clustering
  • k-means Clustering in a Nutshell
  • k-means Characteristics
  • Global vs Local Minimum Explained
  • Summary
  • Lab Exercises
  • Lab 1. Learning the CoLab Jupyter Notebook Environment
  • Lab 2. Data Visualization in Python
  • Lab 3. Understanding NumPy
  • Lab 4. Data Repairing
  • Lab 5. Understanding Common Metrics
  • Lab 6. Coding kNN Algorithm in NumPy (Optional)
  • Lab 7. Understanding Machine Learning Datasets in scikit-learn
  • Lab 8. Building Linear Regression Models
  • Lab 9. Spam Detection with Random Forest
  • Lab 10. Spam Detection with Support Vector Machines
  • Lab 11. Spam Detection with Logistic Regression
  • Lab 12. Comparing Classification Algorithms
  • .
  • Lab 13. Feature Engineering and EDA
  • Lab 14. Understanding PCA