EECS 4415 - Big Data Systems

About

Course Description

Storing, managing, and processing datasets are foundational to both computer science and data science. The enormous size of today's data sets and the specific requirements of modern applications, necessitated the growth of a new generation of data management systems, where the emphasis is put on distributed and fault-tolerant processing. New programming paradigms have evolved, an abundance of information platforms offering data management and analysis solutions appeared and a number of novel methods and tools have been developed. This course introduces the fundamentals of big data storage, retrieval, and processing systems. As these fundamentals are introduced, exemplary technologies are used to illustrate how big data systems can leverage very large data sets that become available through multiple sources and are characterized by diverse levels of volume (terabytes; billion records), velocity (batch; real-time; streaming) and variety (structured; semi-structured; unstructured). The course aims to provide students with both theoretical knowledge and practical experience of the field by covering recent research on big data systems and their basic properties. Students consider both small and large datasets because both are equally important and justify different trade-offs. Topics include: software frameworks for distributed storage and processing of very large data sets, MapReduce programming model, querying of structured data sets, column stores, key-value stores, document stores, graph databases, distributed stream processing frameworks.

Topics

data-driven organizations
data ingestion
data quality
data storage (data lakes, RDBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage)
data definition (CAP theorem, schema-on-read, schema-on-write)
big data analytics architectures
batch processing
interactive query processing
data stream processing
unified processing engines
tools/systems for data analytics and visualization (examples: OpenRefine, Apache Hadoop/MapReduce, Google BigTable/BigQuery, Twitter Storm/Huron, Apache Spark)

Lectures & Office Hours

Lectures: Tue and Thu, 16:00pm-19:00pm (Online)

Office Hours: Tue, 13:00pm-14:00pm (Online)

Team

Manos Papagelis (papaggel@gmail.com)

Tilemachos Pechlivanoglou

Textbooks

The course will rely mainly on the following textbooks.

Mining of Massive Datasets, 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman (freely available online)

Syllabus

Download the syllabus (v1.0)

Handouts

Lecture 1. Introduction [Slides]

Introduction, administrivia.

Readings:

Mining of Massive Datasets, 2nd Edition (chapter 1)
Introduction to Data Mining (chapter 1)

Lecture 2. Data-driven Organizations [Slides]

Data-driven organizations, DDO solutions reference model.

Readings:

Jeff Hammerbacher. Information Platforms and the Rise of the Data Scientist. Beautiful Data, 73-84, 2009. (local copy)
Hilary Mason, DJ Patel. Data Driven, Creating a Data Culture. O'Reilly Media, 2015. (local copy)

Lecture 3. Data Ingestion and Data Quality [Slides]

Data ingestion, ETL, data quality, data quality reference model, record linkage, entity resolution, string similarity, data quality scaling issues.

Readings:

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE TKDE, 2007. (local copy)
Erhard Rahm and Hong Hai Do. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23.4, 2000. (local copy)
Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 1996. (local copy)
Vassiliadis, Panos. A survey of Extract–Transform–Load technology. International Journal of Data Warehousing and Mining (IJDWM) 5.3 (2009): 1-27. (local copy)

Lecture 4. Computing Platforms and Storage Systems [Slides]

Computing platforms, single-node computing, parellel computing, cluster computing, grid computing, data storage, data warehouse model, data lakes, data storage systems, relational DBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage, software defined storage, CAP theorem, moving large data, data definition, schema-on-read, schema-on-write, big data analytics architectures, lambda architecture, kappa architecture.

Readings:

Mars, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Sections 1.4–1.10.
Proper, H. A. (1997). Data schema design as a schema evolution process. Data & Knowledge Engineering, 22(2).
Allen, B., Bresnahan, J., Childers, L., Foster, I., et al. (2012). Software as a service for data scientists. Communications of the ACM, 55(2).
Ghemawat, S., Gobioff, H., & Leung, S. (2003). The Google file system. SOSP'03. (local copy)
Kreps, J. Narkhede, N., Rao, J. (2011). Kafka: a Distributed Messaging System for Log Processing. NetDB 2010. (local copy)
Stonebraker M., et al. (2005). C-Store: A Column-oriented DBMS. VLDB. (local copy)
Chaudhuri S. (1998). An Overview of Query Optimization in Relational Systems (PODS tutorial).
Goetz Graefe (1993). Query Evaluation Techniques for Large Databases (ACM survey)
Kreps, J. (2013). The log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn blog.
Krishna, S., & Tse, E. (2013). Hadoop platform as a service in the cloud. Netflix blog.

Lecture 5. Processing Systems - Batch Processing [Slides]

Batch processing, Hadoop MapReduce.

Readings:

Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google file system. SIGOPS, 2003.
Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. OSDI, 2004. (local copy)

Lecture 6. Processing Systems - Structured Data (Dremel/BigQuery) [Slides]

Structured data processing, Interactive query processing, Google Dremel/BigQuery.

Readings:

Chang, Fay, et al. Bigtable: A distributed storage system for structured data. ACM TOCS, 2008.
Melnik, Sergey, et al. Dremel: interactive analysis of web-scale datasets. VLDB, 2010. (local copy)

Lecture 7. Processing Systems - Streaming Data (Twitter Storm/Heron) [Slides]

Data stream processing, Twitter/Apache Storm, Twitter Heron.

Readings:

Toshniwal, Ankit, et al. Storm@Twitter. SIGMOD, 2014. (local copy)
Kulkarni, Sanjeev, et al. Twitter heron: Stream processing at scale. SIGMOD, 2015.

Lecture 8. Processing Systems - Unified Engine (Apache Spark) [Slides]

Unified processing engines (Spark), Resilient Distributed Dataset (RDDs).

Readings:

M Zaharia et al. Spark: Cluster computing with working sets. HotCloud 10, 2010. (local copy)
Matei Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. USENIX conference on NSDI, 2012.
Xiangrui Meng et al. Mllib: Machine learning in apache spark. JMLR, 2016.
Michael Armbrust et al. Spark sql: Relational data processing in spark. SIGMOD, 2015.

Lecture 9. Serving Data [Slides]

Analytics reporting, business intelligence (BI) tools, OLAP cube, cuboids, ROLAP, MOLAP, in-application/real-time analytics, Serving at-Scale.

Readings:

Chaudhuri, Surajit, and Umeshwar Dayal. An overview of data warehousing and OLAP technology. ACM Sigmod record 26.1 (1997): 65-74.
Gray, Jim, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery 1.1 (1997): 29-53.
Sathe, Gayatri, and Sunita Sarawagi. Intelligent rollups in multidimensional OLAP data. VLDB. Vol. 1. 2001.

Lecture 10. Course Review [Slides]

Comprehensive course review.

Lecture 11. NOSQL [Slides]

Structured, unstructured, semi-structured data, What is NOSQL, NOSQL taxonomy.

Assignments

Tutorials

Tutorial 1: Python Introduction to the Python programming language.
Tutorial 2: Docker & Python Introduction to the Docker platform; running Python scripts using Docker containers.
Tutorial 3: TF-IDF Explanation of the term frequency–inverse document frequency metric (TF-IDF).
Tutorial 4: HDFS Introduction to the Hadoop Distributed File System (HDFS).
Tutorial 5: MapReduce Introduction to the MapReduce Programming Model (MapReduce).
Tutorial 6: MapReduce in Docker Introduction on Running MapReduce in Docker.
Tutorial 7: The Hadoop Ecosystem Introduction to significant tools/systems that form the Hadoop ecosystem.
Tutorial 8: Apache Spark Introduction to Apache Spark. (sample Twitter app code). Check also an e-book about Apache Spark (local copy).

Project

There will be no separate project this term. An open task/small project will be incorporated to the last assignment.

Resources

Datasets

Online resources of data.

Online resources of network data.

Software Tools and Libraries

Data cleansing/wrangling

OpenRefine (formerly Google Refine) Standalone open source desktop application for data cleanup and transformation to other formats
Google Cloud DataPrep Data preparation and data cleansing

Graph/network analysis

SNAP Libary for working with massive network datsets (C++, Python)
NetworkX Library for studying graphs and networks (Python)
JUNG Library for modeling, analysis, and visualization of graphs (Java)
Metis Family of programs for partitioning graphs

Graph/network exploration and visualization

Pajek Program for large network analysis and visualization
Gephi Program for graph visualization and exploration

Data Visualization

The data visualisation catalogue Reference library for different types of data visualisations
Highcharts Libraries for aesthetically pleasing standard charts
Tableau Software for exploratory data analysis
D3 Software for interactive visualizations

Welcome to EECS4415 Big Data Systems

About

Course Description

Topics

Lectures & Office Hours

Team

Textbooks

Syllabus

Handouts

Lecture 1. Introduction [Slides]

Lecture 2. Data-driven Organizations [Slides]

Lecture 3. Data Ingestion and Data Quality [Slides]

Lecture 4. Computing Platforms and Storage Systems [Slides]

Lecture 5. Processing Systems - Batch Processing [Slides]

Lecture 6. Processing Systems - Structured Data (Dremel/BigQuery) [Slides]

Lecture 7. Processing Systems - Streaming Data (Twitter Storm/Heron) [Slides]

Lecture 8. Processing Systems - Unified Engine (Apache Spark) [Slides]

Lecture 9. Serving Data [Slides]

Lecture 10. Course Review [Slides]

Lecture 11. NOSQL [Slides]

Assignments

Tutorials

Project

Resources

Datasets

Software Tools and Libraries

Welcome to
EECS4415 Big Data Systems