Editing article

Title

Summary

Content

It is my dream to accelerate some of the tedious MDM processes with Generative AI, Embeddings, Vector Search and more… Read on.<figure><img alt="" src="https://cdn-images-1.medium.com/max/740/1*zccJwrLqwmYQ38V2yeCtRw.png" /><figcaption>Image showing a bike station with docked bikes to represent the use case</figcaption></figure><h3>Introduction</h3>At its core, Master Data Management (MDM) is about creating a single, reliable source of truth for your organization’s most critical data. Think of MDM as a meticulously curated library where every book (data point) is correctly labeled, up-to-date, and easy to find. Implementing robust MDM has always been a strategic necessity, but it often comes with complexities and resource demands. This is where the transformative power of Generative AI, particularly models like Gemini 1.0 Pro, Gemini 1.0 Pro Vision, Gemini 1.5 Pro, comes into play.<blockquote>In this article, we will demonstrate how Gemini 1.0 Pro simplifies master data management applications like enrichment and deduplication, for the citibike_stations data available in the BigQuery public dataset. For this, we will use: 1. BigQuery public dataset bigquery-public-data.new_york_citibike. 2. Gemini Function Calling (a Java Cloud Function that gets the address information using the reverse Geocoding API for the coordinates available with the citibike_stations data) that we have already created in <a href="https://medium.com/google-cloud/using-gemini-function-calling-in-java-for-deterministic-generative-ai-responses-4c86a5ab80a9">one</a> of our previous articles. 3. Vertex AI Embeddings API and Vector Search in BigQuery to identify duplicates.</blockquote><h3>Master Data Management (MDM): The Foundation for Data-Driven Decisions</h3>The key elements of master data are business entities (like customers, products, suppliers, locations, etc. which are the nouns that your business revolves around), identifiers (Unique identifiers ensure each entity is distinct and traceable across systems), attributes (These are the characteristics that describe each entity e.g. a customer’s address, a product’s price etc.). I am going to take a little space here with it because understanding the importance of MDM is paramount in the generative AI era more than ever. Why MDM Matters:<ul><li>Adds clarity to data:</li></ul>Without MDM, businesses struggle with fragmented data scattered across various systems. This leads to inconsistencies, duplicates, and sometimes ambiguous understanding of relationships between data points (e.g. is “Abirami Sukumaran” in one system the same customer as “Abirami S.” in another?).<ul><li>Helps in decision making:</li></ul>Accurate, unified master data is the foundation of informed business decisions. It answers questions like — Who are our most valuable customers? Which products are underperforming? What suppliers provide the best value?<ul><li>Serves data as a Product:</li></ul>MDM is essential for treating data as a valuable asset. Clean, integrated, and up-to-date master data forms the basis for building insightful data products that drive business outcomes.Let me paint a picture to help you understand master data better by comparing it with transactional data: Transactional data captures individual events (a purchase, a shipment etc.). Master data provides the context for those events by defining the entities involved. For example, a sales transaction links to master data for the customer, product, and salesperson. Without MDM, data tends to remain in silos hindering a holistic view, prone to quality issues and relatively heavy resource-spend on data reconciliation and consolidation on demand.<h3>The powerful synergy of MDM and Generative AI</h3>Below are some of the industry grade applications and advantages of empowering and simplifying MDM with Generative AI:<ul><li>Automated Data Cleansing: Gemini 1.0 Pro can intelligently identify and rectify inconsistencies, duplicates, and errors in master data, significantly reducing manual effort.</li><li>Intelligent Matching: Leveraging its language understanding capabilities, Generative AI capabilities, embeddings and Vector Search, come together to accurately match and merge records from disparate sources, even with variations in formatting or naming conventions.</li><li>Contextual Enrichment: Generative AI models can generate missing attributes or provide additional context to master data, enhancing its value for analysis and decision-making.</li><li>Adaptive Learning: You can leverage fine tuning and reinforcement learning to enable generative AI to continuously improve its understanding of your business entities and relationships, leading to more accurate and efficient MDM over time.</li></ul><h3>High Level Flow Diagram</h3>This diagram represents the flow of data and steps involved in the implementation.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1001/1*pMW7YoxbqYbJjYXgucff9g.png" /><figcaption>High level flow of the use case</figcaption></figure><h3>Demo</h3>The steps loosely are as follows:<ol><li>Create a BigQuery dataset for the use case. Create a landing table with data from the public dataset table bigquery-public-data.new_york_citibike.citibike_stations.</li><li>Make sure the Cloud Function that includes Gemini Function Calling for address standardization is deployed.</li><li>Store the enriched address data in the landing tables (from 2 sources for demo purpose).</li><li>Invoke Vertex AI Embeddings API from BigQuery on the address data.</li><li>Use BigQuery Vector Search to identify duplicate records.</li></ol><h3>Setup</h3><ol><li>In the <a href="https://console.cloud.google.com/">Google Cloud Console</a>, on the project selector page, select or create a Google Cloud <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects">project</a>.</li><li>Make sure that billing is enabled for your Cloud project. Learn how to <a href="https://cloud.google.com/billing/docs/how-to/verify-billing-enabled">check if billing is enabled on a project</a>.</li><li>You will use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. From the Cloud Console, click Activate Cloud Shell on the top right corner.</li><li>Enable necessary APIs for this implementation if you haven’t already. Alternative to the gcloud command is through the console using this <a href="https://pantheon.corp.google.com/apis/enableflow?apiid=cloudfunctions.googleapis.com,run.googleapis.com,bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com">link</a>.</li></ol><h3>Step 1: Create BigQuery Dataset and External Connection</h3>BigQuery <a href="https://cloud.google.com/bigquery/docs/datasets-intro">dataset</a> is a container for all the tables and objects for your application. BigQuery <a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">connection</a> is used to interact with your Cloud Function. In order to create a remote function, you must create a BigQuery connection. Let’s begin with creating the dataset and the connection.<ol><li>From the Google Cloud Console, go to the <a href="https://console.cloud.google.com/bigquery">BigQuery page</a> and click the 3 vertical dots icon next to your project id. From the list of options, select “Create data set”.</li><li>In Create data set pop up, enter the data set ID “mdm_gemini” with the region set to the default value “US (multiple regions…)”</li><li><a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">BigLake Connection</a> allows us to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API. We will use this connection to access the model from BigQuery via the Cloud Function. Follow steps below to create the BigLake Connection:</li></ol>a. Click ADD on the Explorer pane of the BigQuery page:<figure><img alt="" src="https://cdn-images-1.medium.com/max/716/0*mqX5rzEHWjkzEQIE" /><figcaption>ADD button to create connection</figcaption></figure>b. Click Connections to external data sources in the sources page.c. Enter external data source details as below in the pop up that shows up and click CREATE CONNECTION:<figure><img alt="" src="https://cdn-images-1.medium.com/proxy/1*b31hiO4ynbDLRrXWEFF4aQ.png" /><figcaption>Create Connection</figcaption></figure>d. Once the connection is created, go to the connection configuration page and copy the Service account ID for access provisioning:<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xh8uToOATp-TmC5O" /><figcaption>Connection Info</figcaption></figure>e. Open <a href="https://console.cloud.google.com/iam-admin/iam">IAM and admin</a> page, click GRANT ACCESS, enter the service account id in the new principals tab and roles as shown below and click SAVE.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fCExrL0trAtqirew" /><figcaption>Grant Access</figcaption></figure><h3>Step 2: Deploy Function Calling (Java Cloud Function)</h3><ol><li>Clone the github <a href="https://github.com/AbiramiSukumaran/GeminiFunctionCalling">repo</a> from your Cloud Shell Terminal and change the YOUR_API_KEY and YOUR_PROJECT_ID with your values</li></ol><pre>git clone https://github.com/AbiramiSukumaran/GeminiFunctionCalling</pre>2. Go to Cloud Shell terminal, navigate into the newly cloned project directory “GeminiFunctionCalling” and execute the below statement build and deploy the Cloud Function:<pre>gcloud functions deploy gemini-fn-calling --gen2 --region=us-central1 --runtime=java11 --source=. --entry-point=cloudcode.helloworld.HelloWorld --trigger-http</pre>The result for this deploy command would be a REST URL in the format as below :<a href="https://us-central1-abis-345004.cloudfunctions.net/gemini-bq-fn">https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling</a>3. Test this Cloud Function by running the following command from the terminal:<pre>gcloud functions call gemini-fn-calling - region=us-central1 - gen2 - data &#39;{&quot;calls&quot;:[[&quot;40.714224,-73.961452&quot;]]}&#39;</pre>Response for a random sample prompt:<pre> &#39;{&quot;replies&quot;:[&quot;{ \&quot;DOOR_NUMBER\&quot;: \&quot;277\&quot;, \&quot;STREET_ADDRESS\&quot;: \&quot;Bedford Ave\&quot;, \&quot;AREA\&quot;: null, \&quot;CITY\&quot;: \&quot;Brooklyn\&quot;, \&quot;TOWN\&quot;: null, \&quot;COUNTY\&quot;: \&quot;Kings County\&quot;, \&quot;STATE\&quot;: \&quot;NY\&quot;, \&quot;COUNTRY\&quot;: \&quot;USA\&quot;, \&quot;ZIPCODE\&quot;: \&quot;11211\&quot;, \&quot;LANDMARK\&quot;: null}}```&quot;]}&#39;</pre>The request and response parameters of this Cloud Function are implemented in a way that is compatible with BigQuery’s remote function invocation. It can be directly consumed from BigQuery data in-place. It means that if your data input (lat and long data) lives in BigQuery then you can call the remote function on the data and get the function response which can be stored or processed within BigQuery directly.4. Run the following DDL from BigQuery to create a remote function that invokes this deployed Cloud Function:<pre> CREATE OR REPLACE FUNCTION `mdm_gemini.MDM_GEMINI` (latlng STRING) RETURNS STRING REMOTE WITH CONNECTION `us.gemini-bq-conn` OPTIONS ( endpoint = &#39;https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling&#39;, max_batching_rows = 1 );</pre>WORKAROUNDIf you do not have the necessary key or Cloud Function deployed, feel free to jump to the landing table directly by exporting the data from the csv into your new BigQuery dataset mdm_gemini using the following command in the Cloud Shell Terminal. Remember to download the file <a href="https://github.com/AbiramiSukumaran/MDMwithGemini/blob/main/CITIBIKE_STATIONS.csv">CITIBIKE_STATIONS.csv</a> from the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> into your Cloud Shell project folder and navigate into that folder before executing the following command:<pre>bq load --source_format=CSV --skip_leading_rows=1 mdm_gemini.CITIBIKE_STATIONS ./CITIBIKE_STATIONS.csv \ name:string,latlng:string,capacity:numeric,num_bikes_available:numeric,num_docks_available:numeric,last_reported:timestamp,full_address_string:string</pre><h3>Step 3: Create Table and Enrich Address data</h3>a. If you have used the WORKAROUND approach from the last step, you can skip this step, since you have already created the table there. If NOT, proceed to the running the following DDL in BigQuery SQL Editor:<pre>CREATE TABLE mdm_gemini.CITIBIKE_STATIONS as ( select name, latitude || &#39;,&#39; || longitude as latlong, capacity, num_bikes_available, num_docks_available,last_reported, &#39;&#39; as full_address_string from bigquery-public-data.new_york_citibike.citibike_stations) ;</pre>Let’s enrich the address data by invoking the remote function on the latitude and longitude coordinates available in the table. Please note that we will update this only for data reported for the year 2024 and where number of bikes available &gt; 0 and capacity &gt; 100:<pre>update `mdm_gemini.CITIBIKE_STATIONS` set full_address_string = `mdm_gemini.MDM_GEMINI`(latlong) where EXTRACT(YEAR FROM last_reported) = 2024 and num_bikes_available &gt; 0 and capacity &gt; 100;</pre>b. Do not skip this step even if you used the WORKAROUND approach in the last step. This is where we will create a second source of bike station location data for the purpose of this use case. Afterall, MDM is bringing data from multiple sources together and identifying the golden truth.Run the following DDLs in BigQuery SQL Editor for creating the second source of location data with 2 records in it. Let’s name this table mdm_gemini.CITIBIKE_STATIONS_SOURCE2 and insert 2 records into it:<pre>CREATE TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE2 (name STRING(55), address STRING(1000), embeddings_src ARRAY&lt;FLOAT64&gt;); insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES (&#39;Location broadway and 29&#39;,&#39;{ &quot;DOOR_NUMBER&quot;: &quot;1593&quot;, &quot;STREET_ADDRESS&quot;: &quot;Broadway&quot;, &quot;AREA&quot;: null, &quot;CITY&quot;: &quot;New York&quot;, &quot;TOWN&quot;: null, &quot;COUNTY&quot;: &quot;New York County&quot;, &quot;STATE&quot;: &quot;NY&quot;, &quot;COUNTRY&quot;: &quot;USA&quot;, &quot;ZIPCODE&quot;: &quot;10019&quot;, &quot;LANDMARK&quot;: null}&#39;, null); insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES (&#39;Allen St &amp; Hester&#39;,&#39;{ &quot;DOOR_NUMBER&quot;: &quot;36&quot;, &quot;STREET_ADDRESS&quot;: &quot;Allen St&quot;, &quot;AREA&quot;: null, &quot;CITY&quot;: &quot;New York&quot;, &quot;TOWN&quot;: null, &quot;COUNTY&quot;: &quot;New York County&quot;, &quot;STATE&quot;: &quot;NY&quot;, &quot;COUNTRY&quot;: &quot;USA&quot;, &quot;ZIPCODE&quot;: &quot;10002&quot;, &quot;LANDMARK&quot;: null}&#39;, null);</pre><h3>Step 4: Generate Embeddings for Address Data</h3>Embeddings are high-dimensional numerical vectors that represent a given entity, like a piece of text or an audio file. Machine learning (ML) models use embeddings to encode semantics about such entities to make it easier to reason about and compare them. For example, a common operation in clustering, classification, and recommendation models is to measure the distance between vectors in an embedding space to find items that are most semantically similar. The Vertex AI text-embeddings API lets you create a text embedding using Generative AI on Vertex AI. Text embeddings are numerical representations of text that capture relationships between words and phrases. Read more about Vertex AI Text Embeddings <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings">here</a>.Run the below DDL to create a remote model for Vertex AI text embeddings API:<pre>CREATE OR REPLACE MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB` REMOTE WITH CONNECTION `us.gemini-bq-conn` OPTIONS (ENDPOINT = &#39;textembedding-gecko@latest&#39;);</pre>Now that the remote embeddings model is ready, let’s generate embeddings for the first source and store it in a table. You can store the embeddings result field in the same mdm_gemini.CITIBIKE_STATIONS table as before, but I am choosing to create a new one for clarity:<pre>CREATE TABLE `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` AS ( SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`, ( select name, full_address_string as content from `mdm_gemini.CITIBIKE_STATIONS` where full_address_string is not null ) ) );</pre>Let’s generate embeddings for address data in table CITIBIKE_STATIONS_SOURCE2:<pre>update `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` a set embeddings_src = ( SELECT ml_generate_embedding_result FROM ML.GENERATE_EMBEDDING( MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`, ( select name, address as content from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` )) where name = a.name) where name is not null; This should create embeddings for the second source, note that we have created the embeddings field in the same table CITIBIKE_STATIONS_SOURCE2.</pre>To visualize the embeddings are generated for the source data tables 1 and 2, run the below queries:<pre>select name,address,embeddings_src from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`; select name,content,ml_generate_embedding_result from `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`;</pre>Let’s go ahead and perform vector search to identify duplicates.<h3>Step 5: Vector Search for Flagging Duplicate Addresses</h3>In this step, we will search the address embeddings ml_generate_embedding_result column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` table for the top 2 embeddings that match each row of data in the embeddings_src column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` table:<pre>select query.name name1,base.name name2, /* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */ distance from VECTOR_SEARCH( TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1, &#39;ml_generate_embedding_result&#39;, (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2), &#39;embeddings_src&#39;, top_k =&gt; 2 ) where query.name &lt;&gt; base.name order by distance desc;</pre>Table that we are querying: mdm_gemini.CITIBIKE_STATIONS_EMBEDDINGS on the field ‘ml_generate_embedding_result’Table that we use as base: mdm_gemini.CITIBIKE_STATIONS_SOURCE2 on the field ‘embeddings_src’top_k: specifies the number of nearest neighbors to return. The default is 10. A negative value is treated as infinity, meaning that all values are counted as neighbors and returned.distance_type: specifies the type of metric to use to compute the distance between two vectors. Supported distance types are EUCLIDEAN and COSINE. The default is EUCLIDEAN.The result of the query is as follows:<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*EQJgLI91pzmTsXnW" /><figcaption>Result Set</figcaption></figure>As you can see, we have listed 2 nearest neighbors (in other words, closest duplicates) for the 2 rows in CITIBIKE_STATIONS_SOURCE2 from CITIBIKE_STATIONS_SOURCE1. Since the distance_type is unspecified, it assumes that it is EUCLIDEAN and the distance is read as the distances in address TEXT values between the two sources, lowest being the most similar address texts.Let’s set distance_type to COSINE:<pre>select query.name name1,base.name name2, /* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */ distance from VECTOR_SEARCH( TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1, &#39;ml_generate_embedding_result&#39;, (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2), &#39;embeddings_src&#39;, top_k =&gt; 2,distance_type =&gt; &#39;COSINE&#39; ) where query.name &lt;&gt; base.name order by distance desc;</pre>Result:<figure><img alt="" src="https://cdn-images-1.medium.com/max/905/0*lkeIwXC98QwSoqoq" /><figcaption>Result set</figcaption></figure>Both queries (of both distance types) are ordered by distance DESCENDING which means we want to list the results in the order of decreasing distance. But you will notice that the second query’s distance order is reversed. Guess why?Yes!! You got it right! When we say COSINE, it returns the similarity. So bigger the number, closer the similarity and hence lesser the distance. In EUCLIDEAN, bigger the number, farther the distance of values (in this case text).<blockquote>Tips to understand the difference and applications of EUCLIDEAN and COSINE:</blockquote><blockquote>Euclidean Distance measures the straight-line distance between two points in a multi-dimensional space and Cosine Similarity measures the cosine of the angle between two vectors.</blockquote><blockquote>Scale Invariance: Cosine similarity is not affected by the magnitude (length) of vectors, only their direction. This is useful when comparing documents of different lengths or user preferences with varying intensities.</blockquote><blockquote>Curse of Dimensionality: Euclidean distance can become less informative in very high-dimensional spaces, while cosine similarity tends to hold up better.</blockquote><blockquote>When to Use Which: Similar Scale, Absolute Differences Matter: Euclidean distanceDifferent Scales, Direction Matters More: Cosine similarity</blockquote><blockquote>5. Example: Imagine two users’ movie ratings: User 1: [5, 4, 3] (action, comedy, drama) User 2: [10, 8, 6] (same genres, but rated higher)</blockquote><blockquote>Euclidean distance would be large due to the difference in rating scale. Cosine similarity would be high, indicating similar taste despite the rating difference.</blockquote><blockquote>Please note in our example, we have used location text similarity — it doesn’t mean we are tracing the absolute geocoding distance between the 2 locations by value. We are only finding their similarity by the address text values.</blockquote>I built a UI around the Vector Search data from BigQuery for visualizing nearest neighbors based on their similarity in address string, NOT THE PHYSICAL DISTANCE, for a slightly larger dataset in a simple Java Spring Boot application and deployed it in Cloud Run. I created a <a href="https://github.com/AbiramiSukumaran/mdm_gemini_web">sample app</a>, edit the SQL in the controller class to match your distance results table:<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*X5vJaFouSO6L2e97C-fnoA.gif" /><figcaption>A demo of Bike Station Location UI that lists stations with similar addresses (not actual distance)</figcaption></figure><h3>Conclusion</h3>This project has demonstrated the power of using Gemini 1.0 Pro and Function Calling in transforming a few MDM activities into simplified yet powerful, deterministic and reliable generative AI capabilities. Now that you know, feel free to identify other ways of implementing the same use case or other MDM functionalities. Are there datasets you could validate, information gaps you could fill, or tasks that could be automated with structured calls embedded within your generative AI responses? Here is the link to the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> and for further reading, refer to the documentation for <a href="https://cloud.google.com/vertex-ai">Vertex AI</a>, <a href="https://cloud.google.com/bigquery/docs/remote-functions">BigQuery Remote Functions</a>, and <a href="https://cloud.google.com/functions">Cloud Functions</a>, <a href="https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial#create_the_remote_model_for_text_embedding_generation">Embeddings</a>, <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search">Vector Search</a> for more in-depth guidance.<img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=35fd35d306c2" width="1" height="1" alt=""><hr><a href="https://medium.com/google-cloud/master-data-management-simplified-match-merge-with-generative-ai-35fd35d306c2">Master Data Management Simplified: Match &amp; Merge with Generative AI!</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.

Author

Link

Published date

Image url

Feed url

Guid

Hidden blurb

--- !ruby/object:Feedjira::Parser::RSSEntry
title: 'Master Data Management Simplified: Match & Merge with Generative AI!'
published: 2024-04-17 01:27:38.000000000 Z
categories:
- mdm
- bigquery
- gemini
- google-cloud-platform
- generative-ai
url: https://medium.com/google-cloud/master-data-management-simplified-match-merge-with-generative-ai-35fd35d306c2?source=rss----e52cf94d98af---4
entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier
 is_perma_link: 'false'
 guid: https://medium.com/p/35fd35d306c2
carlessian_info:
 news_filer_version: 2
 newspaper: Google Cloud - Medium
 macro_region: Blogs
content: 'It is my dream to accelerate some of the tedious MDM processes with Generative
 AI, Embeddings, Vector Search and more… Read on.<figure><img alt="" src="https://cdn-images-1.medium.com/max/740/1*zccJwrLqwmYQ38V2yeCtRw.png"
 /><figcaption>Image showing a bike station with docked bikes to represent the use case</figcaption></figure><h3>Introduction</h3>At
 its core, Master Data Management (MDM) is about creating a single, reliable source
 of truth for your organization’s most critical data. Think of MDM as a meticulously
 curated library where every book (data point) is correctly labeled, up-to-date,
 and easy to find. Implementing robust MDM has always been a strategic necessity,
 but it often comes with complexities and resource demands. This is where the transformative
 power of Generative AI, particularly models like Gemini 1.0 Pro, Gemini 1.0 Pro
 Vision, Gemini 1.5 Pro, comes into play.<blockquote>In this article,
 we will demonstrate how Gemini 1.0 Pro simplifies master data management applications
 like enrichment and deduplication, for the citibike_stations data available in the
 BigQuery public dataset. For this, we will use: 1. BigQuery public dataset bigquery-public-data.new_york_citibike. 2.
 Gemini Function Calling (a Java Cloud Function that gets the address information
 using the reverse Geocoding API for the coordinates available with the citibike_stations
 data) that we have already created in <a href="https://medium.com/google-cloud/using-gemini-function-calling-in-java-for-deterministic-generative-ai-responses-4c86a5ab80a9">one</a>
 of our previous articles. 3. Vertex AI Embeddings API and Vector Search in BigQuery
 to identify duplicates.</blockquote><h3>Master Data Management (MDM): The Foundation
 for Data-Driven Decisions</h3>The key elements of master data are business entities
 (like customers, products, suppliers, locations, etc. which are the nouns that your
 business revolves around), identifiers (Unique identifiers ensure each entity is
 distinct and traceable across systems), attributes (These are the characteristics
 that describe each entity e.g. a customer’s address, a product’s price etc.). I
 am going to take a little space here with it because understanding the importance
 of MDM is paramount in the generative AI era more than ever. Why MDM Matters:<ul><li>Adds
 clarity to data:</li></ul>Without MDM, businesses struggle with fragmented
 data scattered across various systems. This leads to inconsistencies, duplicates,
 and sometimes ambiguous understanding of relationships between data points (e.g.
 is “Abirami Sukumaran” in one system the same customer as “Abirami S.” in another?).<ul><li>Helps
 in decision making:</li></ul>Accurate, unified master data is the foundation
 of informed business decisions. It answers questions like — Who are our most valuable
 customers? Which products are underperforming? What suppliers provide the best value?<ul><li>Serves
 data as a Product:</li></ul>MDM is essential for treating data as a
 valuable asset. Clean, integrated, and up-to-date master data forms the basis for
 building insightful data products that drive business outcomes.Let
 me paint a picture to help you understand master data better by comparing it with
 transactional data: Transactional data captures individual events (a purchase,
 a shipment etc.). Master data provides the context for those events by defining
 the entities involved. For example, a sales transaction links to master data for
 the customer, product, and salesperson. Without MDM, data tends to remain in silos
 hindering a holistic view, prone to quality issues and relatively heavy resource-spend
 on data reconciliation and consolidation on demand.<h3>The powerful synergy
 of MDM and Generative AI</h3>Below are some of the industry grade applications
 and advantages of empowering and simplifying MDM with Generative AI:<ul><li>Automated
 Data Cleansing: Gemini 1.0 Pro can intelligently identify and rectify inconsistencies,
 duplicates, and errors in master data, significantly reducing manual effort.</li><li>Intelligent
 Matching: Leveraging its language understanding capabilities, Generative
 AI capabilities, embeddings and Vector Search, come together to accurately match
 and merge records from disparate sources, even with variations in formatting or
 naming conventions.</li><li>Contextual Enrichment: Generative AI
 models can generate missing attributes or provide additional context to master data,
 enhancing its value for analysis and decision-making.</li><li>Adaptive Learning:
 You can leverage fine tuning and reinforcement learning to enable generative AI
 to continuously improve its understanding of your business entities and relationships,
 leading to more accurate and efficient MDM over time.</li></ul><h3>High Level Flow Diagram</h3>This
 diagram represents the flow of data and steps involved in the implementation.<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/1001/1*pMW7YoxbqYbJjYXgucff9g.png"
 /><figcaption>High level flow of the use case</figcaption></figure><h3>Demo</h3>The
 steps loosely are as follows:<ol><li>Create a BigQuery dataset for the use case.
 Create a landing table with data from the public dataset table bigquery-public-data.new_york_citibike.citibike_stations.</li><li>Make
 sure the Cloud Function that includes Gemini Function Calling for address standardization
 is deployed.</li><li>Store the enriched address data in the landing tables (from
 2 sources for demo purpose).</li><li>Invoke Vertex AI Embeddings API from BigQuery
 on the address data.</li><li>Use BigQuery Vector Search to identify duplicate records.</li></ol><h3>Setup</h3><ol><li>In
 the <a href="https://console.cloud.google.com/">Google Cloud Console</a>, on the
 project selector page, select or create a Google Cloud <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects">project</a>.</li><li>Make
 sure that billing is enabled for your Cloud project. Learn how to <a href="https://cloud.google.com/billing/docs/how-to/verify-billing-enabled">check
 if billing is enabled on a project</a>.</li><li>You will use Cloud Shell, a command-line
 environment running in Google Cloud that comes preloaded with bq. From the Cloud
 Console, click Activate Cloud Shell on the top right corner.</li><li>Enable necessary
 APIs for this implementation if you haven’t already. Alternative to the gcloud command
 is through the console using this <a href="https://pantheon.corp.google.com/apis/enableflow?apiid=cloudfunctions.googleapis.com,run.googleapis.com,bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com">link</a>.</li></ol><h3>Step
 1: Create BigQuery Dataset and External Connection</h3>BigQuery <a href="https://cloud.google.com/bigquery/docs/datasets-intro">dataset</a>
 is a container for all the tables and objects for your application. BigQuery <a
 href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">connection</a>
 is used to interact with your Cloud Function. In order to create a remote function,
 you must create a BigQuery connection. Let’s begin with creating the dataset and
 the connection.<ol><li>From the Google Cloud Console, go to the <a href="https://console.cloud.google.com/bigquery">BigQuery
 page</a> and click the 3 vertical dots icon next to your project id. From the list
 of options, select “Create data set”.</li><li>In Create data set pop up, enter the
 data set ID “mdm_gemini” with the region set to the default value “US (multiple
 regions…)”</li><li><a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">BigLake
 Connection</a> allows us to connect the external data source while retaining fine-grained
 BigQuery access control and security, which in our case is the Vertex AI Gemini
 Pro API. We will use this connection to access the model from BigQuery via the Cloud
 Function. Follow steps below to create the BigLake Connection:</li></ol>a. Click
 ADD on the Explorer pane of the BigQuery page:<figure><img alt="" src="https://cdn-images-1.medium.com/max/716/0*mqX5rzEHWjkzEQIE"
 /><figcaption>ADD button to create connection</figcaption></figure>b. Click Connections
 to external data sources in the sources page.c. Enter external data source
 details as below in the pop up that shows up and click CREATE CONNECTION:<figure><img
 alt="" src="https://cdn-images-1.medium.com/proxy/1*b31hiO4ynbDLRrXWEFF4aQ.png"
 /><figcaption>Create Connection</figcaption></figure>d. Once the connection is
 created, go to the connection configuration page and copy the Service account ID
 for access provisioning:<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xh8uToOATp-TmC5O"
 /><figcaption>Connection Info</figcaption></figure>e. Open <a href="https://console.cloud.google.com/iam-admin/iam">IAM
 and admin</a> page, click GRANT ACCESS, enter the service account id in the new
 principals tab and roles as shown below and click SAVE.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fCExrL0trAtqirew"
 /><figcaption>Grant Access</figcaption></figure><h3>Step 2: Deploy Function Calling
 (Java Cloud Function)</h3><ol><li>Clone the github <a href="https://github.com/AbiramiSukumaran/GeminiFunctionCalling">repo</a>
 from your Cloud Shell Terminal and change the YOUR_API_KEY and YOUR_PROJECT_ID with
 your values</li></ol><pre>git clone https://github.com/AbiramiSukumaran/GeminiFunctionCalling</pre>2.
 Go to Cloud Shell terminal, navigate into the newly cloned project directory “GeminiFunctionCalling”
 and execute the below statement build and deploy the Cloud Function:<pre>gcloud
 functions deploy gemini-fn-calling --gen2 --region=us-central1 --runtime=java11
 --source=. --entry-point=cloudcode.helloworld.HelloWorld --trigger-http</pre>The
 result for this deploy command would be a REST URL in the format as below :<a
 href="https://us-central1-abis-345004.cloudfunctions.net/gemini-bq-fn">https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling</a>3.
 Test this Cloud Function by running the following command from the terminal:<pre>gcloud
 functions call gemini-fn-calling - region=us-central1 - gen2 - data &#39;{&quot;calls&quot;:[[&quot;40.714224,-73.961452&quot;]]}&#39;</pre>Response
 for a random sample prompt:<pre> &#39;{&quot;replies&quot;:[&quot;{ \&quot;DOOR_NUMBER\&quot;:
 \&quot;277\&quot;, \&quot;STREET_ADDRESS\&quot;: \&quot;Bedford Ave\&quot;, \&quot;AREA\&quot;: null,
 \&quot;CITY\&quot;: \&quot;Brooklyn\&quot;, \&quot;TOWN\&quot;: null, \&quot;COUNTY\&quot;:
 \&quot;Kings County\&quot;, \&quot;STATE\&quot;: \&quot;NY\&quot;, \&quot;COUNTRY\&quot;:
 \&quot;USA\&quot;, \&quot;ZIPCODE\&quot;: \&quot;11211\&quot;, \&quot;LANDMARK\&quot;:
 null}}```&quot;]}&#39;</pre>The request and response parameters of this Cloud
 Function are implemented in a way that is compatible with BigQuery’s remote function
 invocation. It can be directly consumed from BigQuery data in-place. It means that
 if your data input (lat and long data) lives in BigQuery then you can call the remote
 function on the data and get the function response which can be stored or processed
 within BigQuery directly.4. Run the following DDL from BigQuery to create
 a remote function that invokes this deployed Cloud Function:<pre> CREATE OR
 REPLACE FUNCTION `mdm_gemini.MDM_GEMINI` (latlng STRING) RETURNS STRING REMOTE
 WITH CONNECTION `us.gemini-bq-conn` OPTIONS ( endpoint = &#39;https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling&#39;,
 max_batching_rows = 1 );</pre>WORKAROUNDIf you do
 not have the necessary key or Cloud Function deployed, feel free to jump to the
 landing table directly by exporting the data from the csv into your new BigQuery
 dataset mdm_gemini using the following command in the Cloud Shell Terminal. Remember
 to download the file <a href="https://github.com/AbiramiSukumaran/MDMwithGemini/blob/main/CITIBIKE_STATIONS.csv">CITIBIKE_STATIONS.csv</a>
 from the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> into
 your Cloud Shell project folder and navigate into that folder before executing the
 following command:<pre>bq load --source_format=CSV --skip_leading_rows=1 mdm_gemini.CITIBIKE_STATIONS
 ./CITIBIKE_STATIONS.csv \ name:string,latlng:string,capacity:numeric,num_bikes_available:numeric,num_docks_available:numeric,last_reported:timestamp,full_address_string:string</pre><h3>Step
 3: Create Table and Enrich Address data</h3>a. If you have used the WORKAROUND
 approach from the last step, you can skip this step, since you have already created
 the table there. If NOT, proceed to the running the following DDL in BigQuery SQL Editor:<pre>CREATE
 TABLE mdm_gemini.CITIBIKE_STATIONS as ( select name, latitude || &#39;,&#39;
 || longitude as latlong, capacity, num_bikes_available, num_docks_available,last_reported, &#39;&#39;
 as full_address_string from bigquery-public-data.new_york_citibike.citibike_stations)
 ;</pre>Let’s enrich the address data by invoking the remote function on the latitude
 and longitude coordinates available in the table. Please note that we will update
 this only for data reported for the year 2024 and where number of bikes available
 &gt; 0 and capacity &gt; 100:<pre>update `mdm_gemini.CITIBIKE_STATIONS` set
 full_address_string = `mdm_gemini.MDM_GEMINI`(latlong) where EXTRACT(YEAR FROM
 last_reported) = 2024 and num_bikes_available &gt; 0 and capacity &gt; 100;</pre>b.
 Do not skip this step even if you used the WORKAROUND approach in the last step.
 This is where we will create a second source of bike station location data for the
 purpose of this use case. Afterall, MDM is bringing data from multiple sources together
 and identifying the golden truth.Run the following DDLs in BigQuery SQL Editor
 for creating the second source of location data with 2 records in it. Let’s name
 this table mdm_gemini.CITIBIKE_STATIONS_SOURCE2 and insert 2 records into it:<pre>CREATE
 TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE2 (name STRING(55), address STRING(1000),
 embeddings_src ARRAY&lt;FLOAT64&gt;); insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2
 VALUES (&#39;Location broadway and 29&#39;,&#39;{ &quot;DOOR_NUMBER&quot;: &quot;1593&quot;,
 &quot;STREET_ADDRESS&quot;: &quot;Broadway&quot;, &quot;AREA&quot;: null, &quot;CITY&quot;:
 &quot;New York&quot;, &quot;TOWN&quot;: null, &quot;COUNTY&quot;: &quot;New York
 County&quot;, &quot;STATE&quot;: &quot;NY&quot;, &quot;COUNTRY&quot;: &quot;USA&quot;,
 &quot;ZIPCODE&quot;: &quot;10019&quot;, &quot;LANDMARK&quot;: null}&#39;, null); insert
 into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES (&#39;Allen St &amp; Hester&#39;,&#39;{
 &quot;DOOR_NUMBER&quot;: &quot;36&quot;, &quot;STREET_ADDRESS&quot;: &quot;Allen
 St&quot;, &quot;AREA&quot;: null, &quot;CITY&quot;: &quot;New York&quot;, &quot;TOWN&quot;:
 null, &quot;COUNTY&quot;: &quot;New York County&quot;, &quot;STATE&quot;: &quot;NY&quot;,
 &quot;COUNTRY&quot;: &quot;USA&quot;, &quot;ZIPCODE&quot;: &quot;10002&quot;, &quot;LANDMARK&quot;:
 null}&#39;, null);</pre><h3>Step 4: Generate Embeddings for Address Data</h3>Embeddings
 are high-dimensional numerical vectors that represent a given entity, like a piece
 of text or an audio file. Machine learning (ML) models use embeddings to encode
 semantics about such entities to make it easier to reason about and compare them.
 For example, a common operation in clustering, classification, and recommendation
 models is to measure the distance between vectors in an embedding space to find
 items that are most semantically similar. The Vertex AI text-embeddings API lets
 you create a text embedding using Generative AI on Vertex AI. Text embeddings are
 numerical representations of text that capture relationships between words and phrases.
 Read more about Vertex AI Text Embeddings <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings">here</a>.Run
 the below DDL to create a remote model for Vertex AI text embeddings API:<pre>CREATE
 OR REPLACE MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB` REMOTE WITH CONNECTION
 `us.gemini-bq-conn` OPTIONS (ENDPOINT = &#39;textembedding-gecko@latest&#39;);</pre>Now
 that the remote embeddings model is ready, let’s generate embeddings for the first
 source and store it in a table. You can store the embeddings result field in the
 same mdm_gemini.CITIBIKE_STATIONS table as before, but I am choosing to create a
 new one for clarity:<pre>CREATE TABLE `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`
 AS ( SELECT * FROM ML.GENERATE_EMBEDDING( MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`, (
 select name, full_address_string as content from `mdm_gemini.CITIBIKE_STATIONS`
 where full_address_string is not null ) ) );</pre>Let’s generate
 embeddings for address data in table CITIBIKE_STATIONS_SOURCE2:<pre>update `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`
 a set embeddings_src = ( SELECT ml_generate_embedding_result FROM ML.GENERATE_EMBEDDING( MODEL
 `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`, ( select name, address as content
 from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` )) where name = a.name) where name
 is not null; This should create embeddings for the second source, note that we
 have created the embeddings field in the same table CITIBIKE_STATIONS_SOURCE2.</pre>To
 visualize the embeddings are generated for the source data tables 1 and 2, run the
 below queries:<pre>select name,address,embeddings_src from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`; select
 name,content,ml_generate_embedding_result from `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`;</pre>Let’s
 go ahead and perform vector search to identify duplicates.<h3>Step 5: Vector
 Search for Flagging Duplicate Addresses</h3>In this step, we will search the
 address embeddings ml_generate_embedding_result column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`
 table for the top 2 embeddings that match each row of data in the embeddings_src
 column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` table:<pre>select query.name
 name1,base.name name2, /* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2
 where name = query.name) content1, base.content content2, */ distance from
 VECTOR_SEARCH( TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1, &#39;ml_generate_embedding_result&#39;, (SELECT
 * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2), &#39;embeddings_src&#39;, top_k
 =&gt; 2 ) where query.name &lt;&gt; base.name order by distance desc;</pre>Table
 that we are querying: mdm_gemini.CITIBIKE_STATIONS_EMBEDDINGS on the field
 ‘ml_generate_embedding_result’Table that we use as base:
 mdm_gemini.CITIBIKE_STATIONS_SOURCE2 on the field ‘embeddings_src’top_k:
 specifies the number of nearest neighbors to return. The default is 10. A negative
 value is treated as infinity, meaning that all values are counted as neighbors and
 returned.distance_type: specifies the type of metric to
 use to compute the distance between two vectors. Supported distance types are EUCLIDEAN
 and COSINE. The default is EUCLIDEAN.The result of the query is as follows:<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/1024/0*EQJgLI91pzmTsXnW" /><figcaption>Result
 Set</figcaption></figure>As you can see, we have listed 2 nearest neighbors (in
 other words, closest duplicates) for the 2 rows in CITIBIKE_STATIONS_SOURCE2 from
 CITIBIKE_STATIONS_SOURCE1. Since the distance_type is unspecified, it assumes that
 it is EUCLIDEAN and the distance is read as the distances in address TEXT values
 between the two sources, lowest being the most similar address texts.Let’s
 set distance_type to COSINE:<pre>select query.name name1,base.name name2, /*
 (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name)
 content1, base.content content2, */ distance from VECTOR_SEARCH( TABLE
 mdm_gemini.CITIBIKE_STATIONS_SOURCE1, &#39;ml_generate_embedding_result&#39;, (SELECT
 * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2), &#39;embeddings_src&#39;, top_k
 =&gt; 2,distance_type =&gt; &#39;COSINE&#39; ) where query.name &lt;&gt; base.name order
 by distance desc;</pre>Result:<figure><img alt="" src="https://cdn-images-1.medium.com/max/905/0*lkeIwXC98QwSoqoq"
 /><figcaption>Result set</figcaption></figure>Both queries (of both distance
 types) are ordered by distance DESCENDING which means we want to list the results
 in the order of decreasing distance. But you will notice that the second query’s
 distance order is reversed. Guess why?Yes!! You got it right! When we say
 COSINE, it returns the similarity. So bigger the number, closer the similarity and
 hence lesser the distance. In EUCLIDEAN, bigger the number, farther the distance
 of values (in this case text).<blockquote>Tips to understand the difference
 and applications of EUCLIDEAN and COSINE:</blockquote><blockquote>Euclidean
 Distance measures the straight-line distance between two points in a multi-dimensional
 space and Cosine Similarity measures the cosine of the angle between
 two vectors.</blockquote><blockquote>Scale Invariance: Cosine similarity is not
 affected by the magnitude (length) of vectors, only their direction. This is useful
 when comparing documents of different lengths or user preferences with varying intensities.</blockquote><blockquote>Curse
 of Dimensionality: Euclidean distance can become less informative in very high-dimensional
 spaces, while cosine similarity tends to hold up better.</blockquote><blockquote>When
 to Use Which: Similar Scale, Absolute Differences Matter: Euclidean distanceDifferent
 Scales, Direction Matters More: Cosine similarity</blockquote><blockquote>5. Example: Imagine
 two users’ movie ratings: User 1: [5, 4, 3] (action, comedy, drama) User 2:
 [10, 8, 6] (same genres, but rated higher)</blockquote><blockquote>Euclidean distance
 would be large due to the difference in rating scale. Cosine similarity would
 be high, indicating similar taste despite the rating difference.</blockquote><blockquote>Please
 note in our example, we have used location text similarity — it doesn’t mean we
 are tracing the absolute geocoding distance between the 2 locations by value. We
 are only finding their similarity by the address text values.</blockquote>I built
 a UI around the Vector Search data from BigQuery for visualizing nearest neighbors
 based on their similarity in address string, NOT THE PHYSICAL DISTANCE, for a slightly
 larger dataset in a simple Java Spring Boot application and deployed it in Cloud
 Run. I created a <a href="https://github.com/AbiramiSukumaran/mdm_gemini_web">sample
 app</a>, edit the SQL in the controller class to match your distance results table:<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/1024/1*X5vJaFouSO6L2e97C-fnoA.gif"
 /><figcaption>A demo of Bike Station Location UI that lists stations with similar
 addresses (not actual distance)</figcaption></figure><h3>Conclusion</h3>This
 project has demonstrated the power of using Gemini 1.0 Pro and Function Calling
 in transforming a few MDM activities into simplified yet powerful, deterministic
 and reliable generative AI capabilities. Now that you know, feel free to identify
 other ways of implementing the same use case or other MDM functionalities. Are there
 datasets you could validate, information gaps you could fill, or tasks that could
 be automated with structured calls embedded within your generative AI responses?
 Here is the link to the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a>
 and for further reading, refer to the documentation for <a href="https://cloud.google.com/vertex-ai">Vertex
 AI</a>, <a href="https://cloud.google.com/bigquery/docs/remote-functions">BigQuery
 Remote Functions</a>, and <a href="https://cloud.google.com/functions">Cloud Functions</a>,
 <a href="https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial#create_the_remote_model_for_text_embedding_generation">Embeddings</a>,
 <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search">Vector
 Search</a> for more in-depth guidance.<img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=35fd35d306c2"
 width="1" height="1" alt=""><hr><a href="https://medium.com/google-cloud/master-data-management-simplified-match-merge-with-generative-ai-35fd35d306c2">Master
 Data Management Simplified: Match &amp; Merge with Generative AI!</a> was originally
 published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a>
 on Medium, where people are continuing the conversation by highlighting and responding
 to this story.'
rss_fields:
- title
- published
- categories
- url
- entry_id
- content
- author
author: Abirami Sukumaran

Language

Active

Ricc internal notes

Ricc source

Show this article Back to articles