Department of Electrical Engineering and Computer Science

Distributed and Parallel High Utility Sequential Pattern Mining

Morteza Zihayat, Zane Zhenhua Hu, Aijun An and Yonggang Hu

Technical Report EECS-2016-03

York University

April 12 2016

Abstract

The problem of mining high utility sequential patterns(HUSP) has been studied recently. Existing solutions aremostly memory-based, which assume that data can fit intothe main memory of a computer. However, with advent of bigdata, such an assumption does not hold any longer. In thispaper, we propose a new framework for mining HUSPs in bigdata. A distributed and parallel algorithm called BigHUSP isproposed to discover HUSPs efficiently. At its heart, BigHUSPuses multiple MapReduce-like steps to process data in parallel.We also propose a number of pruning strategies tominimize search space in a distributed environment, and thusdecrease computational and communication costs, while stillmaintaining correctness. Our experiments with real life andlarge synthetic datasets validate the effectiveness of BigHUSPfor mining HUSPs from large sequence datasets.

Download paper in PDF format.

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Department of Electrical Engineering & Computer Science

Distributed and Parallel High Utility Sequential Pattern Mining

Morteza Zihayat, Zane Zhenhua Hu, Aijun An and Yonggang Hu