Welcome to
EECS4415 Big Data Systems



About

Course Description

Storing, managing, and processing datasets are foundational to both computer science and data science. The enormous size of today's data sets and the specific requirements of modern applications, necessitated the growth of a new generation of data management systems, where the emphasis is put on distributed and fault-tolerant processing. New programming paradigms have evolved, an abundance of information platforms offering data management and analysis solutions appeared and a number of novel methods and tools have been developed. This course introduces the fundamentals of big data storage, retrieval, and processing systems. As these fundamentals are introduced, exemplary technologies are used to illustrate how big data systems can leverage very large data sets that become available through multiple sources and are characterized by diverse levels of volume (terabytes; billion records), velocity (batch; real-time; streaming) and variety (structured; semi-structured; unstructured). The course aims to provide students with both theoretical knowledge and practical experience of the field by covering recent research on big data systems and their basic properties. Students consider both small and large datasets because both are equally important and justify different trade-offs. Topics include: software frameworks for distributed storage and processing of very large data sets, MapReduce programming model, querying of structured data sets, column stores, key-value stores, document stores, graph databases, distributed stream processing frameworks.

Topics
  • data-driven organizations
  • data ingestion
  • data quality
  • data storage (data lakes, RDBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage)
  • data definition (CAP theorem, schema-on-read, schema-on-write)
  • big data analytics architectures
  • batch processing
  • interactive query processing
  • data stream processing
  • unified processing engines
  • tools/systems for data analytics and visualization (examples: OpenRefine, Apache Hadoop/MapReduce, Google BigTable/BigQuery, Twitter Storm/Huron, Apache Spark)
Lectures & Office Hours

Lectures: Wed, 17:30pm-20:30pm at VH 3006 (Vari Hall)

Office Hours: Wed, 12:00pm-13:00pm or by appointment (LAS3050, Lassonde building)

Team

Manos Papagelis (papaggel@gmail.com)

Tilemachos Pechlivanoglou

Textbooks

The course will rely mainly on the following textbooks.

Syllabus

Download the syllabus (v1.0)

Handouts

Lecture 1. Introduction [Slides]

Introduction, administrivia.

Readings:

Lecture 2. Data-driven Organizations [Slides]

Data-driven organizations, DDO solutions reference model.

Readings:

Lecture 3. Data Ingestion and Data Quality [Slides]

Data ingestion, ETL, data quality, data quality reference model, record linkage, entity resolution, string similarity, data quality scaling issues.

Readings:

Lecture 4. Computing Platforms and Storage Systems [Slides]

Computing platforms, single-node computing, parellel computing, cluster computing, grid computing, data storage, data warehouse model, data lakes, data storage systems, relational DBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage, software defined storage, CAP theorem, moving large data, data definition, schema-on-read, schema-on-write, big data analytics architectures, lambda architecture, kappa architecture.

Readings:

  • Mars, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Sections 1.4–1.10.
  • Proper, H. A. (1997). Data schema design as a schema evolution process. Data & Knowledge Engineering, 22(2).
  • Allen, B., Bresnahan, J., Childers, L., Foster, I., et al. (2012). Software as a service for data scientists. Communications of the ACM, 55(2).
  • Ghemawat, S., Gobioff, H., & Leung, S. (2003). The Google file system. SOSP'03. (local copy)
  • Kreps, J. Narkhede, N., Rao, J. (2011). Kafka: a Distributed Messaging System for Log Processing. NetDB 2010. (local copy)
  • Stonebraker M., et al. (2005). C-Store: A Column-oriented DBMS. VLDB. (local copy)
  • Chaudhuri S. (1998). An Overview of Query Optimization in Relational Systems (PODS tutorial).
  • Goetz Graefe (1993). Query Evaluation Techniques for Large Databases (ACM survey)
  • Kreps, J. (2013). The log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn blog.
  • Krishna, S., & Tse, E. (2013). Hadoop platform as a service in the cloud. Netflix blog.
Lecture 5. Processing Systems - Batch Processing [Slides]

Batch processing, Hadoop MapReduce.

Readings:

Lecture 6. Processing Systems - Structured Data (Dremel/BigQuery) [Slides]

Structured data processing, Interactive query processing, Google Dremel/BigQuery.

Readings:

Lecture 7. Processing Systems - Streaming Data (Twitter Storm/Heron) [Slides]

Data stream processing, Twitter/Apache Storm, Twitter Heron.

Readings:

  • Toshniwal, Ankit, et al. Storm@Twitter. SIGMOD, 2014. (local copy)
  • Kulkarni, Sanjeev, et al. Twitter heron: Stream processing at scale. SIGMOD, 2015.
Lecture 8. Processing Systems - Unified Engine (Apache Spark) [Slides]

Unified processing engines (Spark), Resilient Distributed Dataset (RDDs).

Readings:

  • M Zaharia et al. Spark: Cluster computing with working sets. HotCloud 10, 2010. (local copy)
  • Matei Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. USENIX conference on NSDI, 2012.
  • Xiangrui Meng et al. Mllib: Machine learning in apache spark. JMLR, 2016.
  • Michael Armbrust et al. Spark sql: Relational data processing in spark. SIGMOD, 2015.
Lecture 9. Serving Data [Slides]

Analytics reporting, business intelligence (BI) tools, OLAP cube, cuboids, ROLAP, MOLAP, in-application/real-time analytics, Serving at-Scale.

Readings:

  • Chaudhuri, Surajit, and Umeshwar Dayal. An overview of data warehousing and OLAP technology. ACM Sigmod record 26.1 (1997): 65-74.
  • Gray, Jim, et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery 1.1 (1997): 29-53.
  • Sathe, Gayatri, and Sunita Sarawagi. Intelligent rollups in multidimensional OLAP data. VLDB. Vol. 1. 2001.
Lecture 10. Course Review [Slides]

Comprehensive course review.

Assignments

Tutorials

Project

Project handouts:

Resources

Software Tools and Libraries

Data cleansing/wrangling

Graph/network analysis

  • SNAP Libary for working with massive network datsets (C++, Python)
  • NetworkX Library for studying graphs and networks (Python)
  • JUNG Library for modeling, analysis, and visualization of graphs (Java)
  • Metis Family of programs for partitioning graphs

Graph/network exploration and visualization

  • Pajek Program for large network analysis and visualization
  • Gephi Program for graph visualization and exploration

Data Visualization