Bruno Faustino, November 2012

Implementation for Spatial Data of the Shared Nearest Neighbour with Metric Data Structures, João Moura Pires (superv.), UNL, November 2012.

Abstract: The Shared Nearest Neighbour (SNN) is a data clustering algorithm that identifies noise in data and finds clusters with different densities, shapes and sizes. These features make the SNN a good candidate to deal with spatial data. However, the SNN time complexity can be a bottleneck in spatial data clustering, since it has a time complexity in the worst case evaluated in O(n2), which compromises its scalability. In this thesis, it is proposed to use metric data structures in primary or secondary storage, to index spatial data and support the SNN in querying for the k-nearest neighbours, since this query is the source of the quadratic time complexity of the SNN. For low dimensional data, namely when dealing with spatial data, the time complexity in the average case of the SNN, using a metric data structure in primary storage (kd-Tree), is improved to at most O(n × log n). Furthermore, using a strategy to reuse the k-nearest neighbours between consecutive runs, it is possible to obtain a time complexity in the worst case of O(n). The experimental results were done using the kd-Tree and the DF-Tree, which work in primary and secondary storage, respectively. These results showed good performance gains in primary storage, when compared with the performance of a referenced imple- mentation of the SNN. In secondary storage, with 128 000 objects, the performance became comparable with the performance of the referenced implementation of the SNN.