Ever wondered where your data really comes from? This paper addresses the *view data lineage* problem in data warehousing: identifying the source data items that contribute to a specific data item in a materialized warehouse view. The research is aimed at developing algorithms for tracing lineage and mechanisms for ensuring consistent lineage tracing. A formal definition of the lineage problem is presented, and lineage tracing algorithms are developed for relational views with aggregation. Mechanisms for performing consistent lineage tracing in a multisource data warehousing environment are proposed. These results can form the basis of a tool that helps analysts to examine warehouse data, choose specific view tuples, and then “drill-through” to identify the exact source tuples that were used to derive the view tuples of interest. This enables a deeper understanding of data provenance and facilitates data quality management in warehousing environments.
Published in ACM Transactions on Database Systems, this paper aligns with the journal's focus on database management systems and data warehousing. The research on view data lineage is directly relevant to database research, adding to the journal's core topics.