Data quality maintenance in Data Integration Systems

Marotta, Adriana

Supervisor(es): Ruggia, Raúl

Resumen:

A Data Integration System (DIS) is an information system that integrates data from a set of heterogeneous and autonomous information sources and provides it to users. Quality in these systems consists of various factors that are measured in data. Some of the usually considered ones are completeness, accuracy, accessibility, freshness, availability. In a DIS, quality factors are associated to the sources, to the extracted and transformed information, and to the information provided by the DIS to the user. At the same time, the user has the possibility of posing quality requirements associated to his data requirements. DIS Quality is considered as better, the nearer it is to the user quality requirements. DIS quality depends on data sources quality, on data transformations and on quality required by users. Therefore, DIS quality is a property that varies in function of the variations of these three other properties. The general goal of this thesis is to provide mechanisms for maintaining DIS quality at a level that satisfies the user quality requirements, minimizing the modifications to the system that are generated by quality changes. The proposal of this thesis allows constructing and maintaining a DIS that is tolerant to quality changes. This means that the DIS is constructed taking into account previsions of quality behavior, such that if changes occur according to these previsions the system is not affected at all by them. These previsions are provided by models of quality behavior of DIS data, which must be maintained up to date. With this strategy, the DIS is affected only when quality behavior models change, instead of being affected each time there is a quality variation in the system. The thesis has a probabilistic approach, which allows modeling the behavior of the quality factors at the sources and at the DIS, allows the users to state flexible quality requirements (using probabilities), and provides tools, such as certainty, mathematical expectation, etc., that help to decide which quality changes are relevant to the DIS quality. The probabilistic models are monitored in order to detect source quality changes, strategy that allows detecting changes on quality behavior and not only punctual quality changes. We propose to monitor also other DIS properties that affect its quality, and for each of these changes decide if they affect the behavior of DIS quality, taking into account DIS quality models. Finally, the probabilistic approach is also applied at the moment of determining actions to take in order to improve DIS quality. For the interpretation of DIS situation we propose to use statistics, which include, in particular, the history of the quality models.


Detalles Bibliográficos
2008
Inglés
Universidad de la República
COLIBRI
https://hdl.handle.net/20.500.12008/34296
Acceso abierto
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)
Resumen:
Sumario:A Data Integration System (DIS) is an information system that integrates data from a set of heterogeneous and autonomous information sources and provides it to users. Quality in these systems consists of various factors that are measured in data. Some of the usually considered ones are completeness, accuracy, accessibility, freshness, availability. In a DIS, quality factors are associated to the sources, to the extracted and transformed information, and to the information provided by the DIS to the user. At the same time, the user has the possibility of posing quality requirements associated to his data requirements. DIS Quality is considered as better, the nearer it is to the user quality requirements. DIS quality depends on data sources quality, on data transformations and on quality required by users. Therefore, DIS quality is a property that varies in function of the variations of these three other properties. The general goal of this thesis is to provide mechanisms for maintaining DIS quality at a level that satisfies the user quality requirements, minimizing the modifications to the system that are generated by quality changes. The proposal of this thesis allows constructing and maintaining a DIS that is tolerant to quality changes. This means that the DIS is constructed taking into account previsions of quality behavior, such that if changes occur according to these previsions the system is not affected at all by them. These previsions are provided by models of quality behavior of DIS data, which must be maintained up to date. With this strategy, the DIS is affected only when quality behavior models change, instead of being affected each time there is a quality variation in the system. The thesis has a probabilistic approach, which allows modeling the behavior of the quality factors at the sources and at the DIS, allows the users to state flexible quality requirements (using probabilities), and provides tools, such as certainty, mathematical expectation, etc., that help to decide which quality changes are relevant to the DIS quality. The probabilistic models are monitored in order to detect source quality changes, strategy that allows detecting changes on quality behavior and not only punctual quality changes. We propose to monitor also other DIS properties that affect its quality, and for each of these changes decide if they affect the behavior of DIS quality, taking into account DIS quality models. Finally, the probabilistic approach is also applied at the moment of determining actions to take in order to improve DIS quality. For the interpretation of DIS situation we propose to use statistics, which include, in particular, the history of the quality models.