How can we efficiently find similar items in vast datasets? This paper explores distance-based index structures for similarity queries in large metric spaces, focusing on applications where distance computations are expensive. The research addresses the challenge of finding approximate matches to a query item within a large collection of data, particularly when calculating distances between objects is computationally intensive. The authors elaborate on an approach using reference points (vantage points) to hierarchically partition the data space into spherical shell-like regions. The paper introduces the multivantage point tree structure (mvp-tree), which employs multiple vantage points to partition the space at each level. The mvp-tree also utilizes precomputed distances between data points and vantage points, improving query efficiency. Experiments comparing mvp-trees to vp-trees demonstrated significant performance gains, with mvp-trees outperforming vp-trees by 20% to 80%. Further experiments explored the impact of varying the number of vantage points and the use of precomputed distances. The results suggest that using a large number of vantage points and precomputed distances can provide more efficient filtering during search operations, making the mvp-tree a valuable tool for similarity searches in large metric spaces.
Published in ACM Transactions on Database Systems, this paper aligns with the journal's focus on efficient data management and retrieval techniques. The research on indexing large metric spaces for similarity queries directly addresses core topics within database systems. By presenting the mvp-tree structure and demonstrating its performance advantages, the paper offers a practical solution for handling similarity searches in large datasets.