MSDS 682 Data Streaming

Fall 2023/2024 | University of San Francisco | 2 Units

Author: Jeremy Gu (wgu9@usfca.edu or personal email jeremygu888@gmail.com)

ChangeLog

8/30/2024 Updated course materials for USF 9/18/2023 Create a place to publish Course notes and code

Course Overview

This hands-on course equips students with the essential skills for processing and analyzing real-time data streams at scale using modern data engineering tools and technologies. Students will gain practical experience with Apache Kafka and other streaming technologies while building foundational knowledge through real-world examples and projects.

What You'll Learn

Core Stream Processing Concepts: Understanding the fundamentals of real-time data processing vs. batch processing
Apache Kafka Ecosystem: Master Kafka topics, producers, consumers, and the broader ecosystem
Real-time Analytics: Build streaming applications that process data as it arrives
Industry Tools: Hands-on experience with Confluent Cloud, ksqlDB, Apache Faust, and more
Professional Skills: Project presentations, technical communication, and interview preparation

Key Technologies Covered

Apache Kafka - Distributed streaming platform
Confluent Cloud - Managed Kafka service
ksqlDB - Stream processing with SQL
Apache Faust - Python stream processing library
FastAPI - Modern web framework for APIs
Schema Registry - Data schema management
Kafka Connect - Data integration framework

Course Structure

Phase 1: Foundation in Stream Processing (Lectures 1-7)

Build solid understanding of data streaming concepts and Kafka fundamentals to prepare for the final project.

Phase 2: Hands-on Project Work (Lectures 8-9)

Practical application of concepts learned in Phase 1 with interactive project development sessions.

Phase 3: Advanced Topics (Lectures 10-14)

Explore deeper areas of data streaming and present final projects.

Detailed Schedule

Week	Date	Topic	Covered Material
1	Oct 20	Intro & Data Streaming	Course overview, stream vs batch processing
	Oct 24	Apache Kafka (Pt. 1)	Architecture, setup, topics, producers
2	Oct 27	Apache Kafka (Pt. 2)	Consumers, hands-on demos
	Oct 31	Apache Kafka (Pt. 3) & Data Schemas	FastAPI integration, Schema Registry
3	Nov 3	Data Schemas (Pt. 2)	REST API, Avro, demonstrations
	Nov 7	Kafka Connect & Review	Connectors, course review
4	Nov 10	Midterm Exam & Stream Processing	Exam + stream processing strategies
	Nov 14	ksqlDB (Pt. 1)	Joins, tables, streams
5	Nov 17	ksqlDB (Pt. 2)	Windowing, aggregation, querying
	Nov 21	Course Review	Review & interview tips
6	Nov 24	Thanksgiving Break	No class
	Nov 28	Data Pipeline (Pt. 1)	Airflow concepts and demos
7	Dec 1	Data Pipeline (Pt. 2)	Advanced Airflow, career advice
	Dec 5	Final Presentations	Student project presentations

Prerequisites

Required Knowledge

Statistics: Mean, variance, data visualization, pandas (MSDS 504, MSDS 593)
Python: Classes, objects, command line familiarity
SQL & Data Pipelines: Database fundamentals (MSDS 681 recommended)
Machine Learning: Scikit-learn, model evaluation (MSDS 621, 630, 680 helpful)
Business Communication: Professional written and spoken English

Assignments & Grading

Grade Breakdown

Attendance & Participation: 5%
Individual Assignments (3): 25% total
- Assignment 1: 10%
- Assignment 3: 15%
Final Project: 45% total
- Proposal: 5%
- Written Report: 20%
- Presentation: 20%
Midterm Exam: 25%

Grading Scale

A (90-100): Exceptional understanding - ready for professional application
- A+: 96-100 | A: 93-95 | A-: 90-92
B (80-89): Competent understanding - meets business expectations
- B+: 87-89 | B: 83-86 | B-: 80-82
C (70-79): Basic understanding - room for improvement
- C+: 77-79 | C: 73-76 | C-: 70-72
F (<70): Limited understanding - unacceptable performance

Final Project

The capstone individual project allows students to apply course concepts to a real-world streaming data problem.

Project Components

Proposal (Due Nov 13): 2-page proposal with problem statement, implementation plan, and architecture diagram
Written Report (Due Dec 3): 6-page technical report with code submission
Presentation (Dec 5): 8-minute presentation with 4-minute Q&A

Project Requirements

Individual work demonstrating mastery of streaming concepts
Use of Apache Kafka for data ingestion
Real-time data processing and analytics
Peer review component for collaborative learning

Real-World Applications

Students will explore streaming applications across industries:

E-commerce

Real-time order processing
Dynamic pricing and recommendations
Inventory management

Transportation

GPS tracking and route optimization
Ride-sharing matching algorithms
Delivery logistics

Financial Services

Fraud detection
Risk assessment
Trading analytics

IoT & Monitoring

Sensor data processing
Predictive maintenance
Real-time alerting

Learning Outcomes

Upon completion, students will be able to:

Design streaming architectures using Apache Kafka and related technologies
Implement real-time data pipelines for production environments
Develop streaming analytics applications with appropriate tools
Evaluate trade-offs between streaming and batch processing approaches
Communicate technical concepts effectively to business stakeholders
Interview confidently for data engineering and streaming-focused roles

Course Resources

Required Texts

Kafka: The Definitive Guide (2nd Edition) - Shapira, Palino, Sivaram, Petty
Kafka in Action - Scott, Gamov, Klein

Online Resources

Tools & Platforms

Confluent Cloud (managed Kafka)
Python 3.8+
FastAPI
Docker (recommended)
Git/GitHub

Academic Policies

Collaboration Policy

Group discussions encouraged for concept understanding
All submitted work must be individual and original
AI tools (ChatGPT, etc.) permitted with proper attribution
Zero tolerance for plagiarism or cheating

Attendance Policy

Mandatory attendance for all lectures
Laptops closed unless instructed otherwise
Advanced notice required for absences
No distractions (phones, etc.) during class

Instructor Information

Jeremy Gu
Email: wgu9@usfca.edu
Office Hours: Wednesday 8:45-9:30am (Virtual)

Background: Senior data science leader with 10+ years experience at Uber, Amazon, and Shipt. Former CEO/founder of AI EdTech startup. Teaches graduate and executive courses at Stanford and USF.

Getting Started

Technical Setup

Install Python 3.8+ and pip
Set up development environment (VS Code recommended)
Create Confluent Cloud account (free tier available)
Clone course repository
Install required packages: pip install -r requirements.txt

First Steps

Review course materials and schedule
Join Piazza for Q&A and discussions
Complete environment setup
Attend first lecture prepared with questions

Course Development

This course uses MkDocs Material for documentation:

python -m venv venv
source venv/bin/activate  # or .\venv\Scripts\Activate.ps1 on Windows
pip install mkdocs-material
pip install -r requirements.txt
mkdocs serve

Visit locally at http://127.0.0.1:8000/

Career Preparation

This course prepares students for roles including:

Data Engineer - Building and maintaining data pipelines
Machine Learning Engineer - Real-time ML model deployment
Data Scientist - Streaming analytics and insights
Software Engineer - Event-driven microservices
Data Architect - Designing scalable data systems

The course emphasizes practical skills, interview preparation, and professional communication to ensure career readiness in the rapidly growing field of real-time data processing.

"One day, you'll look back and thank your current self for the dedication and hard work you're investing now." - Course Philosophy

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.github/workflows		.github/workflows
docs		docs
.gitignore		.gitignore
README.md		README.md
init_web_folder.sh		init_web_folder.sh
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MSDS 682 Data Streaming

ChangeLog

Course Overview

What You'll Learn

Key Technologies Covered

Course Structure

Phase 1: Foundation in Stream Processing (Lectures 1-7)

Phase 2: Hands-on Project Work (Lectures 8-9)

Phase 3: Advanced Topics (Lectures 10-14)

Detailed Schedule

Prerequisites

Required Knowledge

Assignments & Grading

Grade Breakdown

Grading Scale

Final Project

Project Components

Project Requirements

Real-World Applications

E-commerce

Transportation

Financial Services

IoT & Monitoring

Learning Outcomes

Course Resources

Required Texts

Online Resources

Tools & Platforms

Academic Policies

Collaboration Policy

Attendance Policy

Instructor Information

Getting Started

Technical Setup

First Steps

Course Development

Career Preparation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages