Welcome to
EECS6414 Data Analytics and Visualization



About

Course Description

Data analytics and visualization is an emerging discipline of immense importance to any data-driven organization. This is a project-focused course that provides students with knowledge on tools for data mining and visualization and practical experience working with data mining and machine learning algorithms for analysis of very large amounts of data. It also focuses on methods and models for efficient communication of data results through data visualization.

Topics
  • finding similar items
  • frequent itemsets
  • mining data streams
  • clustering
  • dimensionality reduction
  • link analysis
  • mining graphs
  • recommendation systems
  • value of visualization
  • exploratory data analysis
  • visualization of multidimensional data
  • visualization of networks
  • tools/systems for data analytics and visualization (examples: OpenRefine, Apache Hadoop MapReduce, Apache Spark, Twitter Storm/Huron, Tableau, D3, Google BigTable, Google BigPicture)
Lectures & Office Hours

Lectures: Mon 16:00pm-19:00pm at VH 2005 (Vari Hall)

Office Hours: Drop by my office or by appointment (LAS3050, Lassonde building)

Instructor

Manos Papagelis

Contact: papaggel@gmail.com, papaggel@eecs.yorku.ca

Textbooks

The course will rely mainly on the following textbooks.

Syllabus

Download the syllabus (v1.0)

Handouts

Lecture 1. Introduction [Slides]

Introduction, administrivia.

Readings:

Lecture 2. Data-driven Organizations [Slides]

Data-driven organizations, DDO solutions reference model.

Readings:

  • Information Platforms and the Rise of the Data Scientist. Jeff Hammerbacher. Beautiful Data, 73-84, 2009.
  • Data Driven, Creating a Data Culture. Hilary Mason, DJ Patel. O'Reilly Media, 2015.
Lecture 3. Data Ingestion and Data Quality [Slides]

Data ingestion, ETL, data quality, data quality reference model, record linkage, entity resolution, string similarity, data quality scaling issues.

Readings:

  • Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios. Duplicate Record Detection: A Survey. IEEE TKDE, 2007.
  • Erhard Rahm and Hong Hai Do. Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23.4, 2000.
  • Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems, 1996.
Lecture 4. Computing Platforms and Storage Systems [Slides]

Computing platforms, single-node computing, parellel computing, cluster computing, grid computing, data storage, data warehouse model, data lakes, data storage systems, relational DBMS, columnar DBMS, NoSQL, HDFS, Key-Value stores, object storage, software defined storage, CAP theorem, moving large data, data definition, schema-on-read, schema-on-write, big data analytics architectures, lambda architecture, kappa architecture.

Readings:

  • Mars, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning Publications Co. Sections 1.4–1.10.
  • Proper, H. A. (1997). Data schema design as a schema evolution process. Data & Knowledge Engineering, 22(2).
  • Allen, B., Bresnahan, J., Childers, L., Foster, I., et al. (2012). Software as a service for data scientists. Communications of the ACM, 55(2).
  • Ghemawat, S., Gobioff, H., & Leung, S. (2003). The Google file system. SOSP'03.
  • Kreps, J. Narkhede, N., Rao, J. (2011). Kafka: a Distributed Messaging System for Log Processing. NetDB.
  • Stonebraker M., et al. (2005). C-Store: A Column-oriented DBMS. VLDB.
  • Chaudhuri S. (1998). An Overview of Query Optimization in Relational Systems (PODS tutorial).
  • Goetz Graefe (1993). Query Evaluation Techniques for Large Databases (ACM survey)
  • Kreps, J. (2013). The log: What every software engineer should know about real-time data's unifying abstraction. LinkedIn blog.
  • Krishna, S., & Tse, E. (2013). Hadoop platform as a service in the cloud. Netflix blog.
Lecture 5. Processing Systems [Slides]

Batch processing (Hadoop/MapReduce), interactive query processing (Dremel/BigQuery), data stream processing (Storm/Huron), unified processing engines (Spark).

Additional Slides:

Readings:

  • Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. The Google file system. SIGOPS, 2003.
  • Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008.
  • Chang, Fay, et al. Bigtable: A distributed storage system for structured data. ACM TOCS, 2008.
  • Melnik, Sergey, et al. Dremel: interactive analysis of web-scale datasets. VLDB, 2010.
  • Toshniwal, Ankit, et al. Storm@ twitter. SIGMOD, 2014.
  • Kulkarni, Sanjeev, et al. Twitter heron: Stream processing at scale. SIGMOD, 2015.
  • M Zaharia et al. Spark: Cluster computing with working sets. HotCloud 10, 2010.
  • Matei Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. USENIX conference on NSDI, 2012.
  • Xiangrui Meng et al. Mllib: Machine learning in apache spark. JMLR, 2016.
  • Michael Armbrust et al. Spark sql: Relational data processing in spark. SIGMOD, 2015.
Lecture 6. Mining Frequent Itemsets [Slides]

Association rules, Market-Basket model, frequent itemsets, A-Priori algorithm, PCY algorithm, SON algorithm.

Readings:

  • Mining of Massive Datasets (chapter 6)
  • Agrawal, Rakesh, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. SIGMOD record, 1993.
  • Agrawal, Rakesh, and Ramakrishnan Srikant. Fast algorithms for mining association rules. VLDB, 1994.
  • Zaki, Mohammed Javeed, et al. New Algorithms for Fast Discovery of Association Rules. KDD, 1997.
  • Han, Jiawei, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. KDD, 2000.
Lecture 7. Finding Similar Items [Slides]

Finding Similar Items, Shingling, Min-Hashing, Locality-Sensitive Hashing (LSH).

Readings:

  • Mining of Massive Datasets (chapter 3)
  • Manber, Udi. Finding similar files in a large file system. Usenix Winter. Vol. 94. 1994.
  • Broder, Andrei Z. On the resemblance and containment of documents. Compression and Complexity of Sequences, 1997.
  • Broder, Andrei Z., et al. Min-wise independent permutations. STOC, 1998.
  • Indyk, Piotr, and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. STOC, 1998.
  • Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. VLDB Vol. 99, No 6, 1999.
  • Andoni, Alexandr, and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. FOCS, 2006.
Lecture 8. Clustering [Slides]

High dimensionality clustering, hierarchical clustering (dendrogram, Euclidean vs non-Euclidean cases), the k-means family of algorithms (initialization, picking k), the BFR algorithm, the CURE algorithm.

Readings:

  • Mining of Massive Datasets (chapter 7)
  • Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. BIRCH: an efficient data clustering method for very large databases. SIGMOD, 1996.
  • Bradley, Paul S., Usama M. Fayyad, and Cory Reina. Scaling Clustering Algorithms to Large Databases. KDD, 1998.
  • Guha, Sudipto, Rajeev Rastogi, and Kyuseok Shim. CURE: an efficient clustering algorithm for large databases. SIGMOD, 1998.
  • Ganti, Venkatesh, et al. Clustering large datasets in arbitrary metric spaces. ICDE, 1999.
Lecture 9. Information Visualization (part 1/2) [Slides]

Anscombe's quartet, Bertin's visual variables, cognition and perception, colors, pre-attentive vs attentive processing, Gestalt principles, visual metaphors, Tufte's principles of graphical excellence, data sculpture.

Readings:

  • Tufte, Edward, and P. Graves-Morris. The visual display of quantitative information. 1983. (2014).
  • Pandey, Anshul Vikram, et al. How deceptive are deceptive visualizations?: An empirical analysis of common distortion techniques. CHI, 2015.
  • Simons, Daniel J., and Christopher F. Chabris. Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception 28.9 (1999): 1059-1074.
  • Simons, Daniel J., and Daniel T. Levin. Failure to detect changes to people during a real-world interaction. Psychonomic Bulletin & Review 5.4 (1998): 644-649.
  • Anscombe, Francis J. Graphs in statistical analysis. The American Statistician 27.1 (1973)0: 17-21.
  • Matejka, Justin, and George Fitzmaurice. Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. CHI, 2017.
Lecture 10. Information Visualization (part 2/2) [Slides]

Taxonomy of visualization, visualizations qualitative and quantitave data (comparisons, proportions, relationships, hierarchies, maps, part-to-a-whole, distributions, patterns).

Readings:

Lecture 11. Information Networks [Slides]

Introduction, administrivia, introduction to main problems about networks, basic mathematical concepts, bow-tie structure of the Web.

Readings:

Optional readings:

  • A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener. Graph structure in the Web. Computer Networks, 33, 2000.
Lecture 12. Network Measurements [Slides]

Degree distributions, shortest paths, clustering coefficient, measuring power-laws.

Readings:

Optional readings:

  • M. E. J. Newman, Power laws, Pareto distributions and Zipf's law, Contemporary Physics.

Assignments

Project-focused course; no assignments.

Project

Resources

Software Tools and Libraries

Data cleansing/wrangling

Graph/network analysis

  • SNAP Libary for working with massive network datsets (C++, Python)
  • NetworkX Library for studying graphs and networks (Python)
  • JUNG Library for modeling, analysis, and visualization of graphs (Java)
  • Metis Family of programs for partitioning graphs

Graph/network exploration and visualization

  • Pajek Program for large network analysis and visualization
  • Gephi Program for graph visualization and exploration

Data Visualization

  • The data visualisation catalogue Reference library for different types of data visualisations
  • Highcharts Libraries for aesthetically pleasing standard charts
  • Google Charts Rich gallery of interactive charts and data tools by Google.
  • Tableau Software for exploratory data analysis
  • D3 Software for interactive visualizations
Tutorials

A list of useful online tutorials relating to the course material.

Similar Courses

Similar courses about information networks and network analysis.