Invited Talks

Erhard Rahm (University of Leipzig, Germany)

  • Title: Scalable Matching of Real-world Data
  • Abstract:
    Effective and efficient entity resolution or object matching is a key challenge for data integration. Despite the existence of numerous commercial tools and research prototypes there are still significant quality, performance, and usability issues for real-world match tasks. For example, web data such as product offers from different online shops tend to be very difficult to match with each other. Furthermore, it is still difficult to find effective match strategies combining several match algorithms. While machine learning approaches help in this respect they depend on suitable training samples and often incur prohibitive execution times. We will discuss these issues and present our approaches to address them. In particular, we'll present a learning-based strategy for matching product offers. We also show how scalability can be improved by cloud-based entity resolution and new load balancing schemes dealing with data skew. We will further present the Dedoop (Deduplication with Hadoop) tool for cloud-based entity resolution.
  • Bio:
    Erhard Rahm is a full professor for computer science at the University of Leipzig, Germany. He chairs the database group as well as a lab on web data integration (WDI Lab). He held visiting research positions at IBM Research and at Microsoft Research. His current work areas include data integration, schema and ontology management, and cloud data processing. Professor Rahm has published about 200 peer-reviewed research papers and authored or co-edited several books. He is a recipient of the VLDB 2011 10 Year Best Paper Award for a paper on schema matching.

Ihab Ilyas (Qatar Computing Research Institute)

  • Title: Non-destructive Cleaning: Modeling and Querying Possible Data Repairs
  • Abstract:
    Many real world data experience several types of data quality problems such as duplicate records and integrity constrains violations. Current data cleaning procedures usually produce one clean instance (repair) of the input data by carefully choosing the parameters of the duplicate detection algorithms. Replacing the input dirty data with one possible clean instance may result in unrecoverable errors, for example, identification and merging of possible duplicate records in health care systems. In this talk, I present our recent approaches for probabilistic data cleaning mainly we focus on two problems: probabilistic record linkage, and modelling and querying possible repairs of data violating functional dependency constraints. I'll show how to efficiently support relational queries under our proposed model, and how to allow new types of queries on the set of possible repairs.
  • Bio:
    Dr. Ihab Ilyas is a principal scientist at the Qatar Computing Research Institute. He received his PhD in computer science from Purdue University, West Lafayette. His main research is in the area of database systems, with special interest in top-k and rank-aware query processing, managing uncertain and probabilistic databases, information extraction and data quality. He is an associate professor at the University of Waterloo, an IBM CAS faculty fellow since January 2006, and a recipient of Ontario Early Researcher Award in 2008.