Storing, managing, and processing datasets are foundational to both computer science and data science. The enormous size of today's data sets and the specific requirements of modern applications, necessitated the growth of a new generation of data management systems, where the emphasis is put on distributed and fault-tolerant processing. New programming paradigms have evolved, an abundance of information platforms offering data management and analysis solutions appeared and a number of novel methods and tools have been developed. This course introduces the fundamentals of big data storage, retrieval, and processing systems. As these fundamentals are introduced, exemplary technologies are used to illustrate how big data systems can leverage very large data sets that become available through multiple sources and are characterized by diverse levels of volume (terabytes; billion records), velocity (batch; real-time; streaming) and variety (structured; semi-structured; unstructured). The course aims to provide students with both theoretical knowledge and practical experience of the field by covering recent research on big data systems and their basic properties. Students consider both small and large datasets because both are equally important and justify different trade-offs. Topics include: software frameworks for distributed storage and processing of very large data sets, MapReduce programming model, querying of structured data sets, column stores, key-value stores, document stores, graph databases, distributed stream processing frameworks.
Lectures: Tue and Thu, 16:00pm-19:00pm (Online)
Office Hours: Tue, 13:00pm-14:00pm (Online)
The course will rely mainly on the following textbooks.
Download the syllabus (v1.0)
Introduction, administrivia.
Readings:
Data-driven organizations, DDO solutions reference model.
Readings:
Data ingestion, ETL, data quality, data quality reference model, record linkage, entity resolution, string similarity, data quality scaling issues.
Readings:
Computing platforms, single-node computing, parellel computing, cluster computing, grid computing, data storage, data warehouse model, data lakes, data storage systems, relational DBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage, software defined storage, CAP theorem, moving large data, data definition, schema-on-read, schema-on-write, big data analytics architectures, lambda architecture, kappa architecture.
Readings:
Batch processing, Hadoop MapReduce.
Readings:
Structured data processing, Interactive query processing, Google Dremel/BigQuery.
Readings:
Data stream processing, Twitter/Apache Storm, Twitter Heron.
Readings:
Unified processing engines (Spark), Resilient Distributed Dataset (RDDs).
Readings:
Analytics reporting, business intelligence (BI) tools, OLAP cube, cuboids, ROLAP, MOLAP, in-application/real-time analytics, Serving at-Scale.
Readings:
Comprehensive course review.
Structured, unstructured, semi-structured data, What is NOSQL, NOSQL taxonomy.
There will be no separate project this term. An open task/small project will be incorporated to the last assignment.
Online resources of data.
Online resources of network data.
Data cleansing/wrangling
Graph/network analysis
Graph/network exploration and visualization
Data Visualization