Data Consolidation Using Large Language Models
From Rule-Based Matching to Semantic Intelligence
The consolidation of data from different sources is one of the classic challenges of data-driven systems. It becomes particularly demanding when two (or more) data providers supply similar, but not identical, information about the same entities - such as products, companies, events, locations, or people.
In our project, we faced exactly this situation: two external APIs provided structured event data from the entertainment domain with substantial content overlap. Many events appeared in both sources, but differed in their exact naming, spelling variations, and additional metadata. Our goal was to automatically consolidate this information - accurately, scalably, and in a maintainable way.
For privacy reasons, we have modified the specific use case for this blog article and illustrate it using the following product-matching example from two suppliers: Supplier A lists the product "Smart LED Ceiling Lamp 30 × 120 cm". Supplier B offers a product called "Smart LED Panel rectangular 0.3 × 1.2 m warm white". If additional product attributes are available — such as the package weight, the manufacturer, or other specifications — these may indicate with high probability that both entries refer to the same lamp.
What initially started as a proof of concept is now running in production. After a successful evaluation phase, the system was integrated into an existing web application. From there, the customer triggers the consolidation process several times a day: the APIs of both providers are queried, the data is semantically matched using a Large Language Model (LLM), and potential duplicates are identified.
Initial Situation: Rule-Based Matching
Initially, the data matching was performed using linguistically based rules across different semantic levels, such as product descriptions or manufacturer names. This approach combined several techniques, including string normalization (e.g., lowercasing, stemming, and removing special characters), curated synonym lists, fuzzy matching with defined thresholds, and weighted similarity metrics.
While such approaches are generally well established, they quickly reach their limits — and in our specific use case, they did not achieve a sufficiently satisfactory matching rate, resulting in substantial manual post-processing effort.
As data diversity increases, maintenance overhead rises significantly. Rule sets grow more complex and harder to maintain, while new edge cases require ongoing adjustments. At the same time, semantic relationships that extend beyond purely syntactic similarity remain largely unaddressed.
Semantic Matching with the OpenAI API
Large Language Models (LLMs) now provide powerful tools capable of capturing semantic similarity — independent of exact word matches.
To implement the matching process, we chose to use the OpenAI API. It enables the deployment of high-performance LLMs without requiring the integration of additional infrastructure into the client’s tech stack or having to handle model training and maintenance. At the same time, it allows for fast and scalable integration into existing systems. Thanks to structured outputs and a pay-per-use pricing model, the solution is well suited for production environments, offering high flexibility and predictable operating costs.
Following a preliminary data preparation step, the data are transmitted to the OpenAI API. Along with the processed data, we send a carefully designed prompt that instructs the LLM to semantically analyze the data and determine which data correspond to each other and which do not. We also define the desired output format within the prompt to ensure structured downstream processing of the results.
One of our key learnings is that the quality of the matching process depends heavily on the precision of the prompt. Within the prompt, we define primary, secondary, and tertiary matching criteria as well as constraints and blockers to reduce false positives while keeping false negatives as low as possible. More specifically, the matching logic prioritizes the similarity of (product) names as the primary criterion and uses metadata — such as manufacturers or product attributes — as supporting signals (secondary and tertiary matching criteria). It also ignores certain differences or tolerable deviations in naming or technical details, some of which result from incorrect specifications. In addition, we enrich the prompt with few-shot examples that illustrate the desired decision-making behavior and define clear output instructions via Structured Outputs to ensure consistent, machine-readable results based on a JSON schema.
Since the volume of data to be checked is very large, we partition the dataset into batches based on clearly defined codes. These codes (e.g., ISO country codes) are highly standardized and therefore unlikely to contain errors, allowing discrepancies between providers to be ruled out at this stage. This reduces the comparison space and ensures that only plausible records are matched with each other.
When evaluating the matching results, we observed that despite a tightly defined prompt, some assignments were clearly incorrect. These cases can likely be attributed to the fact that LLMs, in rare situations, tend to draw inaccurate conclusions — a phenomenon often referred to as "hallucination".
By introducing a second OpenAI API call, in which the previously consolidated results are re-evaluated using an "LLM-as-a-judge" approach and supplemented with a confidence score, we were able to reduce — or at least make transparent — this behavior.
Overall, the matching process has evolved from a rule-based procedure to a semantically grounded decision logic that reliably achieves high matching rates with strong accuracy.
Architecture of the Consolidation System
The system is implemented as a FastAPI application and is containerized and deployed on Kubernetes in AWS. In simplified terms, the process consists of the following steps:
- The APIs of Provider 1 and Provider 2 are queried periodically.
- Incoming data is normalized and harmonized.
- The matching process is triggered via a button in the web application.
- The FastAPI application partitions the normalized data into batches based on reliable specification codes and submits them to the OpenAI API.
- Within each batch, the LLM identifies potential matches.
- In a second API call, the identified matches are re-evaluated using an "LLM-as-a-judge" approach and enriched with a confidence score.
- The results are displayed in the web application and augmented with additional information.
- Manual corrections can be made directly within the web interface.
- The fully consolidated and validated data is then made available for further processing.
Final Remarks
Despite its high degree of automation, the system is not designed as a black box. Within the web application, the raw data from both providers remains fully transparent and accessible at all times. This transparency is essential, as even with a very low error rate, the matching system is not perfect.
False-positive matches are relatively easy to detect and correct in practice. The confidence score helps to identify potentially incorrect assignments in the web application and, if necessary, separate them manually. Identifying unmatched data, however, is significantly more demanding. Such "false negatives" require actively searching for potential duplicates and therefore entail a higher level of manual review effort. By adopting the LLM-based approach, the number of undetected duplicates has been significantly reduced compared to the previous rule-based method.
The current implementation has a matching runtime of several minutes. For the specific use case, this performance is sufficient, though there remains potential for optimization. A substantial performance improvement could be achieved through the use of intelligent caching strategies. By caching previously evaluated matches — such as matched manufacturers or technical attributes — once could reduce both the number of LLM calls to the OpenAI API and the amount of data that needs to be transmitted. This would lower latency and costs alike. However, such an approach would introduce additional maintenance overhead for the caching infrastructure. In particular, robust cache invalidation strategies would need to be implemented to ensure consistency and up-to-date results when data changes. Here, the classic trade-off between performance optimization and system complexity becomes evident.