Fall 2023/2024 | University of San Francisco | 2 Units
Author: Jeremy Gu (wgu9@usfca.edu or personal email jeremygu888@gmail.com)
8/30/2024 Updated course materials for USF 9/18/2023 Create a place to publish Course notes and code
This hands-on course equips students with the essential skills for processing and analyzing real-time data streams at scale using modern data engineering tools and technologies. Students will gain practical experience with Apache Kafka and other streaming technologies while building foundational knowledge through real-world examples and projects.
- Core Stream Processing Concepts: Understanding the fundamentals of real-time data processing vs. batch processing
- Apache Kafka Ecosystem: Master Kafka topics, producers, consumers, and the broader ecosystem
- Real-time Analytics: Build streaming applications that process data as it arrives
- Industry Tools: Hands-on experience with Confluent Cloud, ksqlDB, Apache Faust, and more
- Professional Skills: Project presentations, technical communication, and interview preparation
- Apache Kafka - Distributed streaming platform
- Confluent Cloud - Managed Kafka service
- ksqlDB - Stream processing with SQL
- Apache Faust - Python stream processing library
- FastAPI - Modern web framework for APIs
- Schema Registry - Data schema management
- Kafka Connect - Data integration framework
Build solid understanding of data streaming concepts and Kafka fundamentals to prepare for the final project.
Practical application of concepts learned in Phase 1 with interactive project development sessions.
Explore deeper areas of data streaming and present final projects.
| Week | Date | Topic | Covered Material |
|---|---|---|---|
| 1 | Oct 20 | Intro & Data Streaming | Course overview, stream vs batch processing |
| Oct 24 | Apache Kafka (Pt. 1) | Architecture, setup, topics, producers | |
| 2 | Oct 27 | Apache Kafka (Pt. 2) | Consumers, hands-on demos |
| Oct 31 | Apache Kafka (Pt. 3) & Data Schemas | FastAPI integration, Schema Registry | |
| 3 | Nov 3 | Data Schemas (Pt. 2) | REST API, Avro, demonstrations |
| Nov 7 | Kafka Connect & Review | Connectors, course review | |
| 4 | Nov 10 | Midterm Exam & Stream Processing | Exam + stream processing strategies |
| Nov 14 | ksqlDB (Pt. 1) | Joins, tables, streams | |
| 5 | Nov 17 | ksqlDB (Pt. 2) | Windowing, aggregation, querying |
| Nov 21 | Course Review | Review & interview tips | |
| 6 | Nov 24 | Thanksgiving Break | No class |
| Nov 28 | Data Pipeline (Pt. 1) | Airflow concepts and demos | |
| 7 | Dec 1 | Data Pipeline (Pt. 2) | Advanced Airflow, career advice |
| Dec 5 | Final Presentations | Student project presentations |
- Statistics: Mean, variance, data visualization, pandas (MSDS 504, MSDS 593)
- Python: Classes, objects, command line familiarity
- SQL & Data Pipelines: Database fundamentals (MSDS 681 recommended)
- Machine Learning: Scikit-learn, model evaluation (MSDS 621, 630, 680 helpful)
- Business Communication: Professional written and spoken English
- Attendance & Participation: 5%
- Individual Assignments (3): 25% total
- Assignment 1: 10%
- Assignment 3: 15%
- Final Project: 45% total
- Proposal: 5%
- Written Report: 20%
- Presentation: 20%
- Midterm Exam: 25%
- A (90-100): Exceptional understanding - ready for professional application
- A+: 96-100 | A: 93-95 | A-: 90-92
- B (80-89): Competent understanding - meets business expectations
- B+: 87-89 | B: 83-86 | B-: 80-82
- C (70-79): Basic understanding - room for improvement
- C+: 77-79 | C: 73-76 | C-: 70-72
- F (<70): Limited understanding - unacceptable performance
The capstone individual project allows students to apply course concepts to a real-world streaming data problem.
- Proposal (Due Nov 13): 2-page proposal with problem statement, implementation plan, and architecture diagram
- Written Report (Due Dec 3): 6-page technical report with code submission
- Presentation (Dec 5): 8-minute presentation with 4-minute Q&A
- Individual work demonstrating mastery of streaming concepts
- Use of Apache Kafka for data ingestion
- Real-time data processing and analytics
- Peer review component for collaborative learning
Students will explore streaming applications across industries:
- Real-time order processing
- Dynamic pricing and recommendations
- Inventory management
- GPS tracking and route optimization
- Ride-sharing matching algorithms
- Delivery logistics
- Fraud detection
- Risk assessment
- Trading analytics
- Sensor data processing
- Predictive maintenance
- Real-time alerting
Upon completion, students will be able to:
- Design streaming architectures using Apache Kafka and related technologies
- Implement real-time data pipelines for production environments
- Develop streaming analytics applications with appropriate tools
- Evaluate trade-offs between streaming and batch processing approaches
- Communicate technical concepts effectively to business stakeholders
- Interview confidently for data engineering and streaming-focused roles
- Kafka: The Definitive Guide (2nd Edition) - Shapira, Palino, Sivaram, Petty
- Kafka in Action - Scott, Gamov, Klein
- Confluent Cloud (managed Kafka)
- Python 3.8+
- FastAPI
- Docker (recommended)
- Git/GitHub
- Group discussions encouraged for concept understanding
- All submitted work must be individual and original
- AI tools (ChatGPT, etc.) permitted with proper attribution
- Zero tolerance for plagiarism or cheating
- Mandatory attendance for all lectures
- Laptops closed unless instructed otherwise
- Advanced notice required for absences
- No distractions (phones, etc.) during class
Jeremy Gu
Email: wgu9@usfca.edu
Office Hours: Wednesday 8:45-9:30am (Virtual)
Background: Senior data science leader with 10+ years experience at Uber, Amazon, and Shipt. Former CEO/founder of AI EdTech startup. Teaches graduate and executive courses at Stanford and USF.
- Install Python 3.8+ and pip
- Set up development environment (VS Code recommended)
- Create Confluent Cloud account (free tier available)
- Clone course repository
- Install required packages:
pip install -r requirements.txt
- Review course materials and schedule
- Join Piazza for Q&A and discussions
- Complete environment setup
- Attend first lecture prepared with questions
This course uses MkDocs Material for documentation:
python -m venv venv
source venv/bin/activate # or .\venv\Scripts\Activate.ps1 on Windows
pip install mkdocs-material
pip install -r requirements.txt
mkdocs serveVisit locally at http://127.0.0.1:8000/
This course prepares students for roles including:
- Data Engineer - Building and maintaining data pipelines
- Machine Learning Engineer - Real-time ML model deployment
- Data Scientist - Streaming analytics and insights
- Software Engineer - Event-driven microservices
- Data Architect - Designing scalable data systems
The course emphasizes practical skills, interview preparation, and professional communication to ensure career readiness in the rapidly growing field of real-time data processing.
"One day, you'll look back and thank your current self for the dedication and hard work you're investing now." - Course Philosophy