An Eficient - Distributed model for mining sequential patterns on a large sequence dataset
Tóm tắt
Sequential pattern mining is an active research area because of its many different applications. There have been many studies suggesting efficient mining algorithms. With the current trend, the size of the sequence dataset is growing, and research has been applying the distributed processing model on the problem of sequential patterns on sequence databases. One of the algorithms that apply the distributed modeling to the efficient sequential pattern mining algorithm is the sequential pattern mining algorithm based on the MapReduce model on the cloud (SPAMC). However, SPAMC is still limited in mining datasets that have a large number of distinct items. This article proposes a distributed algorithm to deal with this problem, called the distributed algorithm for sequential pattern mining on a large sequence dataset using dynamic vector bit structures on the MapReduce distributed programming model (DSPDBV). In addition, the algorithm uses different techniques for early prune redundant candidates and reduce the amount of memory usage. Experimental results show that DSPDBV is highly efficient and scalable for large sequence datasets. Moreover, DSPDBV is more efficient than SPAMC handling datasets have a large number of distinct items.