Data analytics and visualization is an emerging discipline of immense importance to any data-driven organization. This is a project-focused course that provides students with knowledge on tools for data mining and visualization and practical experience working with data mining and machine learning algorithms for analysis of very large amounts of data. It also focuses on methods and models for efficient communication of data results through data visualization.

- finding similar items
- frequent itemsets
- mining data streams
- clustering
- dimensionality reduction
- link analysis
- mining graphs
- recommendation systems
- value of visualization
- exploratory data analysis
- visualization of multidimensional data
- visualization of networks
- tools/systems for data analytics and visualization (examples: OpenRefine, Apache Hadoop MapReduce, Apache Spark, Twitter Storm/Huron, Tableau, D3, Google BigTable, Google BigPicture)

**Lectures**: Mon 16:00pm-19:00pm at VH 2005 (Vari Hall)

**Office Hours**: Drop by my office or by appointment (LAS3050, Lassonde building)

The course will rely mainly on the following textbooks.

- Mining of Massive Datasets, 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman (freely available online)
- Networks, Crowds, and Markets by David Easley, Jon Kleinberg (freely available online)
- Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, Vipin Kumar
- Social Media Mining by Reza Zafarani, Mohammad Ali Abbasi, Huan Liu (freely available online)
- The Visual Display of Quantitative Information, 2nd Edition by Edward Tufte
- Envisioning Information by Edward Tufte
- Interactive Data Visualization for the Web, 2nd Edition by Scott Murray

Download the syllabus (v1.0)

Introduction, administrivia.

Readings:

- Mining of Massive Datasets, 2nd Edition (chapter 1)
- Introduction to Data Mining (chapter 1)

Data-driven organizations, DDO solutions reference model.

Readings:

- Information Platforms and the Rise of the Data Scientist. Jeff Hammerbacher. Beautiful Data, 73-84, 2009.
- Data Driven, Creating a Data Culture. Hilary Mason, DJ Patel. O'Reilly Media, 2015.

Data ingestion, ETL, data quality, data quality reference model, record linkage, entity resolution, string similarity, data quality scaling issues.

Readings:

- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE TKDE, 2007.
- Erhard Rahm and Hong Hai Do. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23.4, 2000.
- Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 1996.

Computing platforms, single-node computing, parellel computing, cluster computing, grid computing, data storage, data warehouse model, data lakes, data storage systems, relational DBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage, software defined storage, CAP theorem, moving large data, data definition, schema-on-read, schema-on-write, big data analytics architectures, lambda architecture, kappa architecture.

Readings:

- Mars, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Sections 1.4–1.10.
- Proper, H. A. (1997). Data schema design as a schema evolution process. Data & Knowledge Engineering, 22(2).
- Allen, B., Bresnahan, J., Childers, L., Foster, I., et al. (2012). Software as a service for data scientists. Communications of the ACM, 55(2).
- Ghemawat, S., Gobioff, H., & Leung, S. (2003). The Google file system. SOSP'03.
- Kreps, J. Narkhede, N., Rao, J. (2011). Kafka: a Distributed Messaging System for Log Processing. NetDB.
- Stonebraker M., et al. (2005). C-Store: A Column-oriented DBMS. VLDB.
- Chaudhuri S. (1998). An Overview of Query Optimization in Relational Systems (PODS tutorial).
- Goetz Graefe (1993). Query Evaluation Techniques for Large Databases (ACM survey)
- Kreps, J. (2013). The log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn blog.
- Krishna, S., & Tse, E. (2013). Hadoop platform as a service in the cloud. Netflix blog.

Batch processing (Hadoop/MapReduce), interactive query processing (Dremel/BigQuery), data stream processing (Storm/Huron), unified processing engines (Spark).

Additional Slides:

- Dremel (slides): Interactive Analysis of Web-Scale Datasets
- Twitter Storm and Heron (slides): Scalable Streamin Analytics
- Spark (slides): Making Big Data Processing Simple

Readings:

- Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google file system. SIGOPS, 2003.
- Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008.
- Chang, Fay, et al. Bigtable: A distributed storage system for structured data. ACM TOCS, 2008.
- Melnik, Sergey, et al. Dremel: interactive analysis of web-scale datasets. VLDB, 2010.
- Toshniwal, Ankit, et al. Storm@ twitter. SIGMOD, 2014.
- Kulkarni, Sanjeev, et al. Twitter heron: Stream processing at scale. SIGMOD, 2015.
- M Zaharia et al. Spark: Cluster computing with working sets. HotCloud 10, 2010.
- Matei Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. USENIX conference on NSDI, 2012.
- Xiangrui Meng et al. Mllib: Machine learning in apache spark. JMLR, 2016.
- Michael Armbrust et al. Spark sql: Relational data processing in spark. SIGMOD, 2015.

Association rules, Market-Basket model, frequent itemsets, A-Priori algorithm, PCY algorithm, SON algorithm.

Readings:

- Mining of Massive Datasets (chapter 6)
- Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. SIGMOD record, 1993.
- Agrawal, Rakesh, and Ramakrishnan Srikant. Fast algorithms for mining association rules. VLDB, 1994.
- Zaki, Mohammed Javeed, et al. New Algorithms for Fast Discovery of Association Rules. KDD, 1997.
- Han, Jiawei, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. KDD, 2000.

Finding Similar Items, Shingling, Min-Hashing, Locality-Sensitive Hashing (LSH).

Readings:

- Mining of Massive Datasets (chapter 3)
- Manber, Udi. Finding similar files in a large file system. Usenix Winter. Vol. 94. 1994.
- Broder, Andrei Z. On the resemblance and containment of documents. Compression and Complexity of Sequences, 1997.
- Broder, Andrei Z., et al. Min-wise independent permutations. STOC, 1998.
- Indyk, Piotr, and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC, 1998.
- Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. VLDB Vol. 99, No 6, 1999.
- Andoni, Alexandr, and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. FOCS, 2006.

High dimensionality clustering, hierarchical clustering (dendrogram, Euclidean vs non-Euclidean cases), the k-means family of algorithms (initialization, picking k), the BFR algorithm, the CURE algorithm.

Readings:

- Mining of Massive Datasets (chapter 7)
- Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. BIRCH: an efficient data clustering method for very large databases. SIGMOD, 1996.
- Bradley, Paul S., Usama M. Fayyad, and Cory Reina. Scaling Clustering Algorithms to Large Databases. KDD, 1998.
- Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. CURE: an efficient clustering algorithm for large databases. SIGMOD, 1998.
- Ganti, Venkatesh, et al. Clustering large datasets in arbitrary metric spaces. ICDE, 1999.

Anscombe's quartet, Bertin's visual variables, cognition and perception, colors, pre-attentive vs attentive processing, Gestalt principles, visual metaphors, Tufte's principles of graphical excellence, data sculpture.

Readings:

- Tufte, Edward, and P. Graves-Morris. The visual display of quantitative information. 1983. (2014).
- Pandey, Anshul Vikram, et al. How deceptive are deceptive visualizations?: An empirical analysis of common distortion techniques. CHI, 2015.
- Simons, Daniel J., and Christopher F. Chabris. Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception 28.9 (1999): 1059-1074.
- Simons, Daniel J., and Daniel T. Levin. Failure to detect changes to people during a real-world interaction. Psychonomic Bulletin & Review 5.4 (1998): 644-649.
- Anscombe, Francis J. Graphs in statistical analysis. The American Statistician 27.1 (1973)0: 17-21.
- Matejka, Justin, and George Fitzmaurice. Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. CHI, 2017.

Taxonomy of visualization, visualizations qualitative and quantitave data (comparisons, proportions, relationships, hierarchies, maps, part-to-a-whole, distributions, patterns).

Readings:

- Tufte, Edward, and P. Graves-Morris. The visual display of quantitative information. 1983. (2014).
- Heer, Jeffrey, Michael Bostock, and Vadim Ogievetsky. A tour through the visualization zoo. Queue (8) 5, 2010.
- Heer, Jeffrey, and Ben Shneiderman. Interactive dynamics for visual analysis. Queue (10) 2, 2012.
- Severino Ribecca (catalogue curator). The data visualisation catalogue.

Introduction, administrivia, introduction to main problems about networks, basic mathematical concepts, bow-tie structure of the Web.

Readings:

- Networks, Crowds, and Markets (chapters 1, 2, 13)
- The structure and function of complex networks (by M. E. Newman)
- Social Media Mining (chapters 1, 2)

Optional readings:

- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener. Graph structure in the Web. Computer Networks, 33, 2000.

Degree distributions, shortest paths, clustering coefficient, measuring power-laws.

Readings:

- Networks, Crowds, and Markets (chapter 2)
- The structure and function of complex networks (by M. E. Newman)
- Social Media Mining (chapter 3)

Optional readings:

- M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemporary Physics.

Project-focused course; no assignments.

Project handouts:

Online resources of data.

Online resources of network data.

- Stanford Network Analysis Project @ Stanford
- Social and Information Network Analysis @ Stanford
- The Koblenz Network Collection
- The Twitter Project Page at MPI-SWS
- Network Data by Mark Newman
- ICWSM datasets

Online data visualization resources.

Data cleansing/wrangling

- OpenRefine (formerly Google Refine) Standalone open source desktop application for data cleanup and transformation to other formats
- Google Cloud DataPrep Data preparation and data cleansing

Graph/network analysis

- SNAP Libary for working with massive network datsets (C++, Python)
- NetworkX Library for studying graphs and networks (Python)
- JUNG Library for modeling, analysis, and visualization of graphs (Java)
- Metis Family of programs for partitioning graphs

Graph/network exploration and visualization

- Pajek Program for large network analysis and visualization
- Gephi Program for graph visualization and exploration

Data Visualization

- The data visualisation catalogue Reference library for different types of data visualisations
- Highcharts Libraries for aesthetically pleasing standard charts
- Google Charts Rich gallery of interactive charts and data tools by Google.
- Tableau Software for exploratory data analysis
- D3 Software for interactive visualizations

A list of useful online tutorials relating to the course material.

- Trajectory Data Mining: An Overview, Yu Zheng, ACM Transaction on Intelligent Systems and Technology, September 1, 2015.
- Learning Representations of Large-scale Networks, Jian Tang, Cheng Li and Qiaozhu Mei, ACM KDD 2017.
- Time Series data Mining Using the Matrix Profile, Abdullah Mueen and Eamonn Keogh, ACM KDD 2017.

Similar courses about information networks and network analysis.

- Mining Massive Data Sets, Jure Leskovec, Jeffrey Ullman (Stanford)
- Networks, David Easley, Jon Kleinberg, Eva Tardos (Cornell)
- Social and Information Network Analysis, Jure Leskovec (Stanford)
- Information Networks, Manos Papagelis (YorkU)