An overview of end-to-end entity resolution for big data
An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127 The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. It’s an … Continue reading An overview of end-to-end entity resolution for big data

In the realm of big data, accurately identifying and linking records that refer to the same real-world entity is a critical task. This process, known as entity resolution (ER) or record linkage, is essential for tasks such as data deduplication and integrating data from multiple sources. A recent survey by Christophides et al., published in ACM Computing Surveys in December 2020, provides an in-depth overview of end-to-end entity resolution for big data.
Entity resolution aims to uncover different descriptions of the same real-world entity, whether within a single data source or across multiple sources, in the absence of unique identifiers. When applied to records from the same data source, ER is used for deduplication, while record linkage refers to the process of joining records across different sources. Scaling this task efficiently is challenging, as it inherently requires comparing each entity to every other, resulting in a quadratic complexity with respect to input size.
At the core of an ER pipeline is the concept of entity descriptions. An individual record or document representing an entity is called an entity description. A collection of such descriptions is referred to as an entity collection. Two descriptions that correspond to the same real-world entity are termed matches or duplicates.
The general flow of an ER pipeline involves three main stages: blocking, block processing, and comparison. Blocking takes input entity descriptions and assigns them to one or more blocks based on blocking keys. The primary goal of blocking is to reduce the number of comparisons that need to be made later on. By ensuring that any two entity descriptions that have a chance of referring to the same real-world entity end up in the same block under at least one blocking key, blocking minimizes the need for detailed comparisons across blocks. The key to effective blocking is redundancy, meaning that every entity is placed into multiple blocks, thereby increasing the likelihood that matching entities co-occur in at least one block.
Following blocking, block processing aims to further reduce the number of comparisons required. This stage involves techniques such as indexing and similarity estimation to identify potential matches within each block. By narrowing down the candidates, block processing prepares the data for the final comparison stage.
The comparison stage is where detailed analysis takes place to determine whether pairs of descriptions within the same block are indeed matches. This involves various methods, including rule-based matching, probabilistic models, and machine learning algorithms. The choice of comparison techniques depends on factors such as data quality, the nature of the data, and the desired level of accuracy.
End-to-end entity resolution for big data is a complex process that requires careful consideration of each stage. The survey by Christophides et al. highlights the challenges and opportunities in this field, offering insights into the latest advancements and best practices. As big data continues to grow and become more interconnected, the need for robust and scalable entity resolution solutions will only increase. This survey serves as a valuable resource for researchers, practitioners, and anyone interested in understanding and improving entity resolution in the context of big data.










