Data Integration

What is Data Integration?

Data integration is the problem of combining data residing at different sources and providing the user with a unified view of these data. This important problem emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories). Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has been the focus of extensive theoretical work and numerous open problems remain to be solved. In practice, data integration is frequently called Enterprise Information Integration.

How do I integrate data and maintain its integrity when using multiple software systems?

Data integrity must be assured for each customer of the data, whether the customer is a human or a software package. As such, the responsibility for data integrity must be placed on the central repository. Other tools can operate on data and even cross-check data integrity. When using data-customer software, the customer derives its data from your central repository and performs consistency checking on that data to ensure its integrity. In the reverse direction, the solution of the data-customer software is transmitted to the repository. The communication can be handled through a variety of means, from vendor defined APIís to communication file specifications. Although it is essential that software components communicate, the important factor to keep in mind is that projects succeed primarily because each component excels at performing its designed task.

Examples of Data Integration?

Consider a web application where a user can query a variety of information about cities such as crime statistics, weather, hotels, demographics, etc. Traditionally, the information must exist in a single database with a single schema. Information of this breadth, however, is difficult and expensive for a single enterprise to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data.

A data integration solution addresses this problem by considering these external resources as materialized views over a virtual mediated schema. This means application developers construct a schema to best model the kinds of answers their users want. This virtual schema is called the mediated schema. Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution. When an application-user queries the mediated schema, the data integration solution transforms this query into appropriate queries over the respective data sources. Finally, the results of these queries are combined into the answer to the user's query.

A convenience of this solution is that new sources can be added by simply constructing an adapter for them. This contrasts with ETL systems or a single database solution where the entire new dataset must be manually integrated into the system.


For more about data warehousing:         Data warehousing     |     Data integration     |     Data mining

See our comprehensive range of other professional data cleansing software products at