logo

AWS Glue Findmatches For Incremental Record Matching

Perform incremental record matching using AWS Glue FindMatches

Published Feb 6, 2024

Record matching is the task of finding records in a dataset that refer to the same entity, such as duplicate products in a product catalog or duplicate customer records. Incremental record matching allows you to find duplicate records between a base set and a new, typically smaller, set.
AWS Glue FindMatches lets customers find duplicates in their datasets, even when there are mismatched fields due to incomplete information, spelling mistakes, or different standardizations of information.
AWS Glue FindMatches enables organizations to match existing data with incremental data in various use cases, such as matching existing customer products in a catalog with new products to be added to the catalog and matching customer records from various sources even though there are mismatched fields due to incorrect spelling or incomplete information. AWS Glue FindMatches can also help with fraud detection improvement by identifying duplicate customer accounts and determining whether a newly created account matches an already existing fraudulent account.
In this blog, you'll learn how to utilize AWS Glue FindMatches to match incremental data in a separate database or table to an existing database or table using a Glue visual ETL job.

This solution involves the following:
  1. Upload the base data into Amazon S3 (in parquet or CSV format).
  2. Upload the incremental data as well to Amazon S3.
  3. Create a table in the AWS Glue Data catalog for the base data by creating a crawler to crawl the S3 bucket where you have the base data stored.
  4. Create another table in the AWS Glue Data catalog for the incremental data by creating a crawler to crawl the S3 bucket where you have the incremental data stored.
  5. Create a label set for your data and store it in an S3 bucket.
  6. Create a ML transform using the base data and teach the transform using the label set.
  7. Create and run an initial AWS Glue ETL job that uses the ML transform from the previous step and the base data and stores the results in the S3 bucket. You can name it Cleaned Base Data.
  8. Create and run an incremental AWS Glue ETL job that uses the same ML transform, the cleaned base data, and the incremental data. Store the results of this incremental job in an S3 bucket and an AWS Glue Data catalog table if you plan on using Amazon Athena to explore the results.
  9. Query the results in Athena

  1. You have already uploaded the base and incremental data into an S3 bucket and have created two tables for them in the AWS Glue data catalog. This takes care of steps 1 through 4.
  2. An IAM role that allows glue to access S3. Create a role that has at least these two policies: AWSGlueConsoleFullAccess and the appropriate S3 policy.

This is a screenshot of the sample data that will be used for this blog. This dataset describes business information obtained by crawling various online sources.
Dataset

A label set can be created manually or by allowing AWS Glue to create the label set for you. You can use this labeling guide below to learn how to create labeled data.
Note: Label data should contain both matched and unmatched data to be able to fully teach the algorithm.
This is a screenshot of what label data looks like. As we can see, the first labeling group spans rows 2 to 31. We can see that this group contains no duplicates, as all the label values are unique. The second labeling group spans rows 32 to 47, and in this one, rows 39 and 42 share the same label ("4"). By looking at the rest of the record, we can see the "primary_name" column for both businesses is "Go Travel."
labeled data

To create a ML transform and teach the transform, use the following steps:
  1. On AWS glue console, Under Data Integration and ETL, Select Record matching
  2. Click Create Transform.
  3. Under Set Transform Properties, add a name for the transform and the IAM role created earlier.
  4. Click on Next, and under Choose Table and Primary Key, select the AWS Glue catalog database where your base data table is stored.
  5. Select the base data table.
  6. Select a primary key for the base data table.
  7. Click on next. Under Choose Tuning Options, under Recall vs. Precision, select Balanced (0.5): an even tradeoff between recall and precision. Under lower cost vs. accuracy, select Favor accuracy (0.9): use more resources but potentially find more matches.
  8. Click on Next, and create a ML transform.
ML Tranform
ML Tranform2
ML Tranform3

After creating the ML transform, the next step would be to teach the newly created ML transform.
  1. Select the newly created ML transform and click on Train Model.
  2. Under Train the transform using labels, under labeling, select I have labels to use the labels you previously created.
  3. Under upload labels from S3, add the path to the S3 bucket where you have your labels.
  4. Select Overwrite my existing labels.
  5. Click on Upload labeling from S3.
  6. Click on next.
  7. Under Estimate quality metrics, click on Estimate transform quality.
  8. After the estimate quality is complete, click on Close to see the quality of the estimate. When your labels are uploaded, FindMatches will use them to try and learn patterns in your data. To do this, FindMatches uses most, but not all, of your data. FindMatches keeps some of your labels hidden during training. This is called the "holdout set." After training is complete, FindMatches will use the transform to find duplicates in this holdout set. Since you have already created labels that say whether each pair is a match or not, FindMatches can compare that to its own answer and estimate a score. FindMatches describes the transform quality using four numbers:
    • Area under the Precision-Recall Curve: A single number summarizing the performance of the transform.
    • Precision: When your transform predicts a match, how often is it correct?
    • Recall upper limit: For an actual match, how often does your transform predict a match?
    • F1: Indicates transform's accuracy. Harmonic mean of Precision and Recall
If recall or precision is close to 0, then this indicates that the transform is either biased to never find any matches or to consider everything a match, respectively. This typically indicates that your labeling set contains only unique records or only matching records, respectively.
Teaching transfrom1
Teaching transfrom2
Teaching transfrom3

  1. On the AWS Glue console, in ETL jobs, select Visual ETL.
  2. Click on Create a Visual ETL Job.
  3. Under Job Details, add a name for the job and the IAM role previously created. Select Glue version 2.0. Leave every other setting as the default and save.
  4. Under source, select AWS Glue Data Catalog; on the properties, choose the database and table for your base data. You can also just select Amazon S3 as the data source and add the path of the S3 bucket. You can name this base data.
  5. Under actions, select Record Matching and add the transform ID of the ML transform that was created previously. You can name this Base Data Record Matching.
  6. Under target, select Amazon S3 bucket, and add the path to the S3 bucket where you want to save the output results.
  7. Save and run the job.
  8. Your results will have another column, “match id,” used to identify unique matches. You can save this result as "cleaned initial base data." I attached a sample of the matched base data set below.
Initial Job1

 
Initial Job2
Initial Job3

  1. On the AWS Glue console, in ETL jobs, select Visual ETL.
  2. Click on Create a Visual ETL Job.
  3. Under Job Details, add a name for the job and the IAM role previously created. Select Glue version 2.0. Leave every other setting as the default and save.
  4. Under source, select AWS Glue Data Catalog; on the properties, choose the database and table for your cleaned base data. You can also just select Amazon S3 as the data source and add the path of the cleaned base data S3 bucket. You can name this "cleaned base data."
  5. Add a second source. Under source, select AWS Glue Data Catalog; on the properties, choose the database and table for your incremental data. You can also just select Amazon S3 as the data source and add the path of the incremental data S3 bucket. You can name this incremental data.
  6. Under action, select custom transform. Under Node parents, make sure you first select the cleaned base data before selecting the incremental data
  7. Still under the custom transform properties, under the code block. Copy and paste this code snippet into the code block. This code snippet can also be found in the link for FindMatches Incremental transformation. Replace the transformId with the previous ML transformId used in the previous job. The code to compute the confidence score was also added to the code block.
    1. def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
    2. dfs = list(dfc.values())
    3. dynf = dfs[0]
    4. inc_dynf = dfs[1]
    5. from awsglueml.transforms import FindIncrementalMatches
    6. findmatches = FindIncrementalMatches.apply(existingFrame = dynf, incrementalFrame=
    7. inc_dynf, transformId = "<your id>", computeMatchConfidenceScores=True)
    8. return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
  8. Under actions, select “Select from Collections,” and under properties, under frame index, leave it as 0. Node parents should "custom transform."
  9. Under actions, select “SQL Query.”Under properties, the input source should be “Select From Collections." You can change the SQL alias or leave it as it is. Under the SQL query, add a query for the job to only show records that have matches. Basically, in this step, we are trying to output only the duplicates that FindMatches has discovered.
    1. select *
    2. FROM myDataSource
    3. where match_id in
    4. (select match_id from myDataSource
    5. group by match_id
    6. having count(*)>1)
  10. Under target, select Amazon S3 bucket, and add the S3 bucket path for where you want to store the output result. Under data catalog options, you opt for a table to be created in the data catalog.
  11. Save and run the job. The output results will have two additional columns, "matched_id" and "match_confidence_score.”

 

The results of the output can be queried using Athena. As the table below illustrates, some records share the same match_id. This indicates that these records are a match. The algorithm assigns a confidence score to each record that shares the same "match_id," indicating the level of certainty it has in the presence of a match.

In this blog article, we covered how to use AWS Glue FindMatches for incremental record matching. In order to get the best possible outcomes, it is essential that the data that we utilize be relatively clean. AWS also provides another record matching option, AWS Entity Resolution, which may be used for incremental record matching.