Skip to content

wgu9/msds682-fall2023-data-streaming

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSDS 682 Data Streaming

Fall 2023/2024 | University of San Francisco | 2 Units

Author: Jeremy Gu (wgu9@usfca.edu or personal email jeremygu888@gmail.com)

ChangeLog

8/30/2024 Updated course materials for USF 9/18/2023 Create a place to publish Course notes and code

Course Overview

This hands-on course equips students with the essential skills for processing and analyzing real-time data streams at scale using modern data engineering tools and technologies. Students will gain practical experience with Apache Kafka and other streaming technologies while building foundational knowledge through real-world examples and projects.

What You'll Learn

  • Core Stream Processing Concepts: Understanding the fundamentals of real-time data processing vs. batch processing
  • Apache Kafka Ecosystem: Master Kafka topics, producers, consumers, and the broader ecosystem
  • Real-time Analytics: Build streaming applications that process data as it arrives
  • Industry Tools: Hands-on experience with Confluent Cloud, ksqlDB, Apache Faust, and more
  • Professional Skills: Project presentations, technical communication, and interview preparation

Key Technologies Covered

  • Apache Kafka - Distributed streaming platform
  • Confluent Cloud - Managed Kafka service
  • ksqlDB - Stream processing with SQL
  • Apache Faust - Python stream processing library
  • FastAPI - Modern web framework for APIs
  • Schema Registry - Data schema management
  • Kafka Connect - Data integration framework

Course Structure

Phase 1: Foundation in Stream Processing (Lectures 1-7)

Build solid understanding of data streaming concepts and Kafka fundamentals to prepare for the final project.

Phase 2: Hands-on Project Work (Lectures 8-9)

Practical application of concepts learned in Phase 1 with interactive project development sessions.

Phase 3: Advanced Topics (Lectures 10-14)

Explore deeper areas of data streaming and present final projects.

Detailed Schedule

Week Date Topic Covered Material
1 Oct 20 Intro & Data Streaming Course overview, stream vs batch processing
Oct 24 Apache Kafka (Pt. 1) Architecture, setup, topics, producers
2 Oct 27 Apache Kafka (Pt. 2) Consumers, hands-on demos
Oct 31 Apache Kafka (Pt. 3) & Data Schemas FastAPI integration, Schema Registry
3 Nov 3 Data Schemas (Pt. 2) REST API, Avro, demonstrations
Nov 7 Kafka Connect & Review Connectors, course review
4 Nov 10 Midterm Exam & Stream Processing Exam + stream processing strategies
Nov 14 ksqlDB (Pt. 1) Joins, tables, streams
5 Nov 17 ksqlDB (Pt. 2) Windowing, aggregation, querying
Nov 21 Course Review Review & interview tips
6 Nov 24 Thanksgiving Break No class
Nov 28 Data Pipeline (Pt. 1) Airflow concepts and demos
7 Dec 1 Data Pipeline (Pt. 2) Advanced Airflow, career advice
Dec 5 Final Presentations Student project presentations

Prerequisites

Required Knowledge

  • Statistics: Mean, variance, data visualization, pandas (MSDS 504, MSDS 593)
  • Python: Classes, objects, command line familiarity
  • SQL & Data Pipelines: Database fundamentals (MSDS 681 recommended)
  • Machine Learning: Scikit-learn, model evaluation (MSDS 621, 630, 680 helpful)
  • Business Communication: Professional written and spoken English

Assignments & Grading

Grade Breakdown

  • Attendance & Participation: 5%
  • Individual Assignments (3): 25% total
    • Assignment 1: 10%
    • Assignment 3: 15%
  • Final Project: 45% total
    • Proposal: 5%
    • Written Report: 20%
    • Presentation: 20%
  • Midterm Exam: 25%

Grading Scale

  • A (90-100): Exceptional understanding - ready for professional application
    • A+: 96-100 | A: 93-95 | A-: 90-92
  • B (80-89): Competent understanding - meets business expectations
    • B+: 87-89 | B: 83-86 | B-: 80-82
  • C (70-79): Basic understanding - room for improvement
    • C+: 77-79 | C: 73-76 | C-: 70-72
  • F (<70): Limited understanding - unacceptable performance

Final Project

The capstone individual project allows students to apply course concepts to a real-world streaming data problem.

Project Components

  1. Proposal (Due Nov 13): 2-page proposal with problem statement, implementation plan, and architecture diagram
  2. Written Report (Due Dec 3): 6-page technical report with code submission
  3. Presentation (Dec 5): 8-minute presentation with 4-minute Q&A

Project Requirements

  • Individual work demonstrating mastery of streaming concepts
  • Use of Apache Kafka for data ingestion
  • Real-time data processing and analytics
  • Peer review component for collaborative learning

Real-World Applications

Students will explore streaming applications across industries:

E-commerce

  • Real-time order processing
  • Dynamic pricing and recommendations
  • Inventory management

Transportation

  • GPS tracking and route optimization
  • Ride-sharing matching algorithms
  • Delivery logistics

Financial Services

  • Fraud detection
  • Risk assessment
  • Trading analytics

IoT & Monitoring

  • Sensor data processing
  • Predictive maintenance
  • Real-time alerting

Learning Outcomes

Upon completion, students will be able to:

  1. Design streaming architectures using Apache Kafka and related technologies
  2. Implement real-time data pipelines for production environments
  3. Develop streaming analytics applications with appropriate tools
  4. Evaluate trade-offs between streaming and batch processing approaches
  5. Communicate technical concepts effectively to business stakeholders
  6. Interview confidently for data engineering and streaming-focused roles

Course Resources

Required Texts

  • Kafka: The Definitive Guide (2nd Edition) - Shapira, Palino, Sivaram, Petty
  • Kafka in Action - Scott, Gamov, Klein

Online Resources

Tools & Platforms

  • Confluent Cloud (managed Kafka)
  • Python 3.8+
  • FastAPI
  • Docker (recommended)
  • Git/GitHub

Academic Policies

Collaboration Policy

  • Group discussions encouraged for concept understanding
  • All submitted work must be individual and original
  • AI tools (ChatGPT, etc.) permitted with proper attribution
  • Zero tolerance for plagiarism or cheating

Attendance Policy

  • Mandatory attendance for all lectures
  • Laptops closed unless instructed otherwise
  • Advanced notice required for absences
  • No distractions (phones, etc.) during class

Instructor Information

Jeremy Gu
Email: wgu9@usfca.edu
Office Hours: Wednesday 8:45-9:30am (Virtual)

Background: Senior data science leader with 10+ years experience at Uber, Amazon, and Shipt. Former CEO/founder of AI EdTech startup. Teaches graduate and executive courses at Stanford and USF.

Getting Started

Technical Setup

  1. Install Python 3.8+ and pip
  2. Set up development environment (VS Code recommended)
  3. Create Confluent Cloud account (free tier available)
  4. Clone course repository
  5. Install required packages: pip install -r requirements.txt

First Steps

  1. Review course materials and schedule
  2. Join Piazza for Q&A and discussions
  3. Complete environment setup
  4. Attend first lecture prepared with questions

Course Development

This course uses MkDocs Material for documentation:

python -m venv venv
source venv/bin/activate  # or .\venv\Scripts\Activate.ps1 on Windows
pip install mkdocs-material
pip install -r requirements.txt
mkdocs serve

Visit locally at http://127.0.0.1:8000/

Career Preparation

This course prepares students for roles including:

  • Data Engineer - Building and maintaining data pipelines
  • Machine Learning Engineer - Real-time ML model deployment
  • Data Scientist - Streaming analytics and insights
  • Software Engineer - Event-driven microservices
  • Data Architect - Designing scalable data systems

The course emphasizes practical skills, interview preparation, and professional communication to ensure career readiness in the rapidly growing field of real-time data processing.


"One day, you'll look back and thank your current self for the dedication and hard work you're investing now." - Course Philosophy

About

Fall 2023 MSDS682 Course

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 100.0%