João Galvão, December 2014

Segmentação de Vastos Volumes de Dados com o SNNagg

Maribel Yasmina Santos (superv.), Universidade do Minho, December 2014.
Keywords: density-based clustering, SNN, shared nearest neighbour
Abstract: Nowadays, and motivated by the recent advances in information technologies and in the massive use of electronic devices, the amount of generated data has increased at a very high rate. In order to be able to handle these large amounts of data, data mining algorithms are used. This work is focused in the use of clustering and, namely, in the SNN (Shared Nearest Neighbour) algorithm, a density-based clustering algorithm. Clustering algorithms usually present high runtimes due to the quadratic complexity. In this work, the SNN algorithm is used to analyse spatial data.

The main objective of this work is to propose and evaluate solutions capable of reducing the processing time of the algorithm taking into consideration that repeated objects can be excluded from the most demanding task in terms of processing time, which is the identification of the k-nearest neighbours of a point. This is a key point as the number of repeated objects that can be found in a spatial data set is usually high. Following the Design Science Research methodology, this work presents three different approaches that can reduce the processing time by excluding the repeated points of the process of identifying the nearest neighbours, task responsible for the quadratic complexity of the algorithm. The excluded points are added later to the identified clusters. For the three proposed approaches, the obtained results show that it is possible to reduce the processing time without compromising the quality of the identified clusters.