Home InternationalAn overview of end-to-end entity resolution for bi...
International⭐ Featured

An overview of end-to-end entity resolution for big data

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127 The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. It’s an … Continue reading An overview of end-to-end entity resolution for big data

6 April 2026 at 08:04 pm
1 views
An overview of end-to-end entity resolution for big data

In the realm of big data, accurately identifying and linking records that refer to the same real-world entity is a critical task. This process, known as entity resolution (ER) or record linkage, is essential for tasks such as data deduplication and integrating data from multiple sources. A recent survey by Christophides et al., published in ACM Computing Surveys in December 2020, provides an in-depth overview of end-to-end entity resolution for big data.

Entity resolution aims to uncover different descriptions of the same real-world entity, whether within a single data source or across multiple sources, in the absence of unique identifiers. When applied to records from the same data source, ER is used for deduplication, while record linkage refers to the process of joining records across different sources. Scaling this task efficiently is challenging, as it inherently requires comparing each entity to every other, resulting in a quadratic complexity with respect to input size.

At the core of an ER pipeline is the concept of entity descriptions. An individual record or document representing an entity is called an entity description. A collection of such descriptions is referred to as an entity collection. Two descriptions that correspond to the same real-world entity are termed matches or duplicates.

The general flow of an ER pipeline involves three main stages: blocking, block processing, and comparison. Blocking takes input entity descriptions and assigns them to one or more blocks based on blocking keys. The primary goal of blocking is to reduce the number of comparisons that need to be made later on. By ensuring that any two entity descriptions that have a chance of referring to the same real-world entity end up in the same block under at least one blocking key, blocking minimizes the need for detailed comparisons across blocks. The key to effective blocking is redundancy, meaning that every entity is placed into multiple blocks, thereby increasing the likelihood that matching entities co-occur in at least one block.

Following blocking, block processing aims to further reduce the number of comparisons required. This stage involves techniques such as indexing and similarity estimation to identify potential matches within each block. By narrowing down the candidates, block processing prepares the data for the final comparison stage.

The comparison stage is where detailed analysis takes place to determine whether pairs of descriptions within the same block are indeed matches. This involves various methods, including rule-based matching, probabilistic models, and machine learning algorithms. The choice of comparison techniques depends on factors such as data quality, the nature of the data, and the desired level of accuracy.

End-to-end entity resolution for big data is a complex process that requires careful consideration of each stage. The survey by Christophides et al. highlights the challenges and opportunities in this field, offering insights into the latest advancements and best practices. As big data continues to grow and become more interconnected, the need for robust and scalable entity resolution solutions will only increase. This survey serves as a valuable resource for researchers, practitioners, and anyone interested in understanding and improving entity resolution in the context of big data.

📰 Related News
Ollama 0.2.6 Released with Native Gemma 4 Support and Enhanced Performance
Ollama 0.2.6 Released with Native Gemma 4 Support and Enhanced Performance
Ollama 0.2.6 is now live, featuring native support for Google's Gemma 4 models and improved local inference performance for Windows, macOS, and Linux.
14 Apr
Weekly news roundup: Shortages spread to MLCCs; SK Hynix reportedly in talks with Microsoft and Google
Weekly news roundup: Shortages spread to MLCCs; SK Hynix reportedly in talks with Microsoft and Google
Below are the most-read DIGITIMES Asia stories from the week of April 6-April 13, 2026:
14 Apr
cutile-stencil 0.2.0
cutile-stencil 0.2.0
An xDSL-based stencil compiler that generates optimized GPU kernels via NVIDIA cuTile
14 Apr
merlin-llm added to PyPI
merlin-llm added to PyPI
Merlin — a fast local LLM for agentic coding on Apple Silicon
14 Apr
Fluent Cut - Craft and compose videos programmatically in PHP with an elegant fluent API
Fluent Cut - Craft and compose videos programmatically in PHP with an elegant fluent API
Craft and compose videos programmatically in PHP with an elegant fluent API - b7s/fluentcut
14 Apr
Crypto Investor at Center of Trump Corruption Allegations Now Sees Himself as ‘Victim’
Crypto Investor at Center of Trump Corruption Allegations Now Sees Himself as ‘Victim’
Justin Sun has accused Trump-affiliated World Liberty Financial of misconduct and a general lack of transparency.
14 Apr
nvidia-nat-weave 1.7.0a20260413
nvidia-nat-weave 1.7.0a20260413
Subpackage for Weave integration in NeMo Agent Toolkit
14 Apr
nvidia-nat-s3 1.7.0a20260413
nvidia-nat-s3 1.7.0a20260413
Subpackage for S3-compatible integration in NeMo Agent Toolkit
14 Apr
Social Security Trust Fund to Run Dry in 2032: Just 6 Years From Now
Social Security Trust Fund to Run Dry in 2032: Just 6 Years From Now
Six years. That is how much time separates retirees from a Social Security system that, by its own projections, runs out of money. If you are 56 years old...
14 Apr
cane-gpu-perf added to PyPI
cane-gpu-perf added to PyPI
GPU inference benchmarking with opinionated diagnostics
13 Apr