♊️ GemiNews 🗞️
(dev)
🏡
📰 Articles
🏷️ Tags
🧠 Queries
📈 Graphs
☁️ Stats
💁🏻 Assistant
💬
🎙️
Demo 1: Embeddings + Recommendation
Demo 2: Bella RAGa
Demo 3: NewRetriever
Demo 4: Assistant function calling
Editing article
Title
Summary
Content
<p>It is my dream to accelerate some of the tedious MDM processes with Generative AI, Embeddings, Vector Search and more… Read on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/740/1*zccJwrLqwmYQ38V2yeCtRw.png" /><figcaption>Image showing a bike station with docked bikes to represent the use case</figcaption></figure><h3>Introduction</h3><p>At its core, Master Data Management (MDM) is about creating a single, reliable source of truth for your organization’s most critical data. Think of MDM as a meticulously curated library where every book (data point) is correctly labeled, up-to-date, and easy to find. Implementing robust MDM has always been a strategic necessity, but it often comes with complexities and resource demands. This is where the transformative power of Generative AI, particularly models like Gemini 1.0 Pro, Gemini 1.0 Pro Vision, Gemini 1.5 Pro, comes into play.</p><blockquote><strong>In this article</strong>, we will demonstrate how Gemini 1.0 Pro simplifies master data management applications like enrichment and deduplication, for the citibike_stations data available in the BigQuery public dataset. For this, we will use:<br>1. BigQuery public dataset bigquery-public-data.new_york_citibike.<br>2. Gemini Function Calling (a Java Cloud Function that gets the address information using the reverse Geocoding API for the coordinates available with the citibike_stations data) that we have already created in <a href="https://medium.com/google-cloud/using-gemini-function-calling-in-java-for-deterministic-generative-ai-responses-4c86a5ab80a9">one</a> of our previous articles.<br>3. Vertex AI Embeddings API and Vector Search in BigQuery to identify duplicates.</blockquote><h3>Master Data Management (MDM): The Foundation for Data-Driven Decisions</h3><p>The key elements of master data are business entities (like customers, products, suppliers, locations, etc. which are the nouns that your business revolves around), identifiers (Unique identifiers ensure each entity is distinct and traceable across systems), attributes (These are the characteristics that describe each entity e.g. a customer’s address, a product’s price etc.). I am going to take a little space here with it because understanding the importance of MDM is paramount in the generative AI era more than ever. Why MDM Matters:</p><ul><li><strong>Adds clarity to data:</strong></li></ul><p>Without MDM, businesses struggle with fragmented data scattered across various systems. This leads to inconsistencies, duplicates, and sometimes ambiguous understanding of relationships between data points (e.g. is “Abirami Sukumaran” in one system the same customer as “Abirami S.” in another?).</p><ul><li><strong>Helps in decision making:</strong></li></ul><p>Accurate, unified master data is the foundation of informed business decisions. It answers questions like — Who are our most valuable customers? Which products are underperforming? What suppliers provide the best value?</p><ul><li><strong>Serves data as a Product:</strong></li></ul><p>MDM is essential for treating data as a valuable asset. Clean, integrated, and up-to-date master data forms the basis for building insightful data products that drive business outcomes.</p><p><strong>Let me paint a picture to help you understand master data better by comparing it with transactional data: </strong>Transactional data captures individual events (a purchase, a shipment etc.). Master data provides the context for those events by defining the entities involved. For example, a sales transaction links to master data for the customer, product, and salesperson. Without MDM, data tends to remain in silos hindering a holistic view, prone to quality issues and relatively heavy resource-spend on data reconciliation and consolidation on demand.</p><h3>The powerful synergy of MDM and Generative AI</h3><p>Below are some of the industry grade applications and advantages of empowering and simplifying MDM with Generative AI:</p><ul><li><strong>Automated Data Cleansing:</strong> Gemini 1.0 Pro can intelligently identify and rectify inconsistencies, duplicates, and errors in master data, significantly reducing manual effort.</li><li><strong>Intelligent Matching:</strong> Leveraging its language understanding capabilities, Generative AI capabilities, embeddings and Vector Search, come together to accurately match and merge records from disparate sources, even with variations in formatting or naming conventions.</li><li><strong>Contextual Enrichment:</strong> Generative AI models can generate missing attributes or provide additional context to master data, enhancing its value for analysis and decision-making.</li><li><strong>Adaptive Learning</strong>: You can leverage fine tuning and reinforcement learning to enable generative AI to continuously improve its understanding of your business entities and relationships, leading to more accurate and efficient MDM over time.</li></ul><h3>High Level Flow Diagram</h3><p>This diagram represents the flow of data and steps involved in the implementation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1001/1*pMW7YoxbqYbJjYXgucff9g.png" /><figcaption>High level flow of the use case</figcaption></figure><h3>Demo</h3><p>The steps loosely are as follows:</p><ol><li>Create a BigQuery dataset for the use case. Create a landing table with data from the public dataset table bigquery-public-data.new_york_citibike.citibike_stations.</li><li>Make sure the Cloud Function that includes Gemini Function Calling for address standardization is deployed.</li><li>Store the enriched address data in the landing tables (from 2 sources for demo purpose).</li><li>Invoke Vertex AI Embeddings API from BigQuery on the address data.</li><li>Use BigQuery Vector Search to identify duplicate records.</li></ol><h3>Setup</h3><ol><li>In the <a href="https://console.cloud.google.com/">Google Cloud Console</a>, on the project selector page, select or create a Google Cloud <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects">project</a>.</li><li>Make sure that billing is enabled for your Cloud project. Learn how to <a href="https://cloud.google.com/billing/docs/how-to/verify-billing-enabled">check if billing is enabled on a project</a>.</li><li>You will use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. From the Cloud Console, click Activate Cloud Shell on the top right corner.</li><li>Enable necessary APIs for this implementation if you haven’t already. Alternative to the gcloud command is through the console using this <a href="https://pantheon.corp.google.com/apis/enableflow?apiid=cloudfunctions.googleapis.com,run.googleapis.com,bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com">link</a>.</li></ol><h3>Step 1: Create BigQuery Dataset and External Connection</h3><p>BigQuery <a href="https://cloud.google.com/bigquery/docs/datasets-intro">dataset</a> is a container for all the tables and objects for your application. BigQuery <a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">connection</a> is used to interact with your Cloud Function. In order to create a remote function, you must create a BigQuery connection. Let’s begin with creating the dataset and the connection.</p><ol><li>From the Google Cloud Console, go to the <a href="https://console.cloud.google.com/bigquery">BigQuery page</a> and click the 3 vertical dots icon next to your project id. From the list of options, select “Create data set”.</li><li>In Create data set pop up, enter the data set ID “mdm_gemini” with the region set to the default value “US (multiple regions…)”</li><li><a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">BigLake Connection</a> allows us to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API. We will use this connection to access the model from BigQuery via the Cloud Function. Follow steps below to create the BigLake Connection:</li></ol><p>a. Click ADD on the Explorer pane of the BigQuery page:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/716/0*mqX5rzEHWjkzEQIE" /><figcaption>ADD button to create connection</figcaption></figure><p>b. Click Connections to external data sources in the sources page.</p><p>c. Enter external data source details as below in the pop up that shows up and click CREATE CONNECTION:</p><figure><img alt="" src="https://cdn-images-1.medium.com/proxy/1*b31hiO4ynbDLRrXWEFF4aQ.png" /><figcaption>Create Connection</figcaption></figure><p>d. Once the connection is created, go to the connection configuration page and copy the Service account ID for access provisioning:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xh8uToOATp-TmC5O" /><figcaption>Connection Info</figcaption></figure><p>e. Open <a href="https://console.cloud.google.com/iam-admin/iam">IAM and admin</a> page, click GRANT ACCESS, enter the service account id in the new principals tab and roles as shown below and click SAVE.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fCExrL0trAtqirew" /><figcaption>Grant Access</figcaption></figure><h3>Step 2: Deploy Function Calling (Java Cloud Function)</h3><ol><li>Clone the github <a href="https://github.com/AbiramiSukumaran/GeminiFunctionCalling">repo</a> from your Cloud Shell Terminal and change the YOUR_API_KEY and YOUR_PROJECT_ID with your values</li></ol><pre>git clone https://github.com/AbiramiSukumaran/GeminiFunctionCalling</pre><p>2. Go to Cloud Shell terminal, navigate into the newly cloned project directory “GeminiFunctionCalling” and execute the below statement build and deploy the Cloud Function:</p><pre>gcloud functions deploy gemini-fn-calling --gen2 --region=us-central1 --runtime=java11 --source=. --entry-point=cloudcode.helloworld.HelloWorld --trigger-http</pre><p>The result for this deploy command would be a REST URL in the format as below :</p><p><a href="https://us-central1-abis-345004.cloudfunctions.net/gemini-bq-fn"><strong>https://us-central1-</strong>YOUR_PROJECT_ID<strong>.cloudfunctions.net/gemini-fn-calling</strong></a></p><p>3. Test this Cloud Function by running the following command from the terminal:</p><pre>gcloud functions call gemini-fn-calling - region=us-central1 - gen2 - data '{"calls":[["40.714224,-73.961452"]]}'</pre><p>Response for a random sample prompt:</p><pre> '{"replies":["{ \"DOOR_NUMBER\": \"277\", \"STREET_ADDRESS\": \"Bedford Ave\", \"AREA\":<br> null, \"CITY\": \"Brooklyn\", \"TOWN\": null, \"COUNTY\": \"Kings County\", \"STATE\":<br> \"NY\", \"COUNTRY\": \"USA\", \"ZIPCODE\": \"11211\", \"LANDMARK\": null}}```"]}'</pre><p>The request and response parameters of this Cloud Function are implemented in a way that is compatible with BigQuery’s remote function invocation. It can be directly consumed from BigQuery data in-place. It means that if your data input (lat and long data) lives in BigQuery then you can call the remote function on the data and get the function response which can be stored or processed within BigQuery directly.</p><p>4. Run the following DDL from BigQuery to create a remote function that invokes this deployed Cloud Function:</p><pre> CREATE OR REPLACE FUNCTION <br> `mdm_gemini.MDM_GEMINI` (latlng STRING) RETURNS STRING<br> REMOTE WITH CONNECTION `us.gemini-bq-conn`<br> OPTIONS (<br> endpoint = 'https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling', max_batching_rows = 1<br> );</pre><p><strong>WORKAROUND</strong></p><p>If you do not have the necessary key or Cloud Function deployed, feel free to jump to the landing table directly by exporting the data from the csv into your new BigQuery dataset mdm_gemini using the following command in the Cloud Shell Terminal. <strong>Remember</strong> to download the file <a href="https://github.com/AbiramiSukumaran/MDMwithGemini/blob/main/CITIBIKE_STATIONS.csv">CITIBIKE_STATIONS.csv</a> from the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> into your Cloud Shell project folder and navigate into that folder before executing the following command:</p><pre>bq load --source_format=CSV --skip_leading_rows=1 mdm_gemini.CITIBIKE_STATIONS ./CITIBIKE_STATIONS.csv \ name:string,latlng:string,capacity:numeric,num_bikes_available:numeric,num_docks_available:numeric,last_reported:timestamp,full_address_string:string</pre><h3>Step 3: Create Table and Enrich Address data</h3><p>a. If you have used the WORKAROUND approach from the last step, you can skip this step, since you have already created the table there. If NOT, proceed to the running the following DDL in BigQuery SQL Editor:</p><pre>CREATE TABLE mdm_gemini.CITIBIKE_STATIONS as (<br>select name, latitude || ',' || longitude as latlong, capacity, num_bikes_available, num_docks_available,last_reported,<br>'' as full_address_string <br> from bigquery-public-data.new_york_citibike.citibike_stations) ;</pre><p>Let’s enrich the address data by invoking the remote function on the latitude and longitude coordinates available in the table. Please note that we will update this only for data reported for the year 2024 and where number of bikes available > 0 and capacity > 100:</p><pre>update `mdm_gemini.CITIBIKE_STATIONS` <br>set full_address_string = `mdm_gemini.MDM_GEMINI`(latlong) <br>where EXTRACT(YEAR FROM last_reported) = 2024 and num_bikes_available > 0 and capacity > 100;</pre><p>b. Do not skip this step even if you used the WORKAROUND approach in the last step. This is where we will create a second source of bike station location data for the purpose of this use case. Afterall, MDM is bringing data from multiple sources together and identifying the golden truth.</p><p>Run the following DDLs in BigQuery SQL Editor for creating the second source of location data with 2 records in it. Let’s name this table mdm_gemini.CITIBIKE_STATIONS_SOURCE2 and insert 2 records into it:</p><pre>CREATE TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE2 (name STRING(55), address STRING(1000), embeddings_src ARRAY<FLOAT64>);<br><br>insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Location broadway and 29','{ "DOOR_NUMBER": "1593", "STREET_ADDRESS": "Broadway", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10019", "LANDMARK": null}', null);<br><br>insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Allen St & Hester','{ "DOOR_NUMBER": "36", "STREET_ADDRESS": "Allen St", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10002", "LANDMARK": null}', null);</pre><h3>Step 4: Generate Embeddings for Address Data</h3><p>Embeddings are high-dimensional numerical vectors that represent a given entity, like a piece of text or an audio file. Machine learning (ML) models use embeddings to encode semantics about such entities to make it easier to reason about and compare them. For example, a common operation in clustering, classification, and recommendation models is to measure the distance between vectors in an embedding space to find items that are most semantically similar. The Vertex AI text-embeddings API lets you create a text embedding using Generative AI on Vertex AI. Text embeddings are numerical representations of text that capture relationships between words and phrases. Read more about Vertex AI Text Embeddings <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings">here</a>.</p><p>Run the below DDL to create a remote model for Vertex AI text embeddings API:</p><pre>CREATE OR REPLACE MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`<br>REMOTE WITH CONNECTION `us.gemini-bq-conn`<br>OPTIONS (ENDPOINT = 'textembedding-gecko@latest');</pre><p>Now that the remote embeddings model is ready, let’s generate embeddings for the first source and store it in a table. You can store the embeddings result field in the same mdm_gemini.CITIBIKE_STATIONS table as before, but I am choosing to create a new one for clarity:</p><pre>CREATE TABLE `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` AS (<br>SELECT *<br>FROM ML.GENERATE_EMBEDDING(<br> MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,<br> ( select name, full_address_string as content from `mdm_gemini.CITIBIKE_STATIONS` <br> where full_address_string is not null )<br> )<br>);</pre><p>Let’s generate embeddings for address data in table CITIBIKE_STATIONS_SOURCE2:</p><pre>update `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` a set embeddings_src =<br>(<br>SELECT ml_generate_embedding_result<br>FROM ML.GENERATE_EMBEDDING(<br> MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,<br> ( select name, address as content from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` ))<br>where name = a.name) where name is not null;<br>This should create embeddings for the second source, note that we have created the embeddings field in the same table CITIBIKE_STATIONS_SOURCE2.</pre><p>To visualize the embeddings are generated for the source data tables 1 and 2, run the below queries:</p><pre>select name,address,embeddings_src from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`;<br>select name,content,ml_generate_embedding_result from `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`;</pre><p>Let’s go ahead and perform vector search to identify duplicates.</p><h3>Step 5: Vector Search for Flagging Duplicate Addresses</h3><p>In this step, we will search the address embeddings ml_generate_embedding_result column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` table for the top 2 embeddings that match each row of data in the embeddings_src column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` table:</p><pre>select query.name name1,base.name name2, <br>/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */<br> distance<br> from VECTOR_SEARCH(<br> TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,<br> 'ml_generate_embedding_result',<br> (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),<br> 'embeddings_src',<br> top_k => 2 <br>) where query.name <> base.name<br>order by distance desc;</pre><p><strong>Table that we are querying:</strong> mdm_gemini.CITIBIKE_STATIONS_EMBEDDINGS on the field ‘ml_generate_embedding_result’</p><p><strong>Table that we use as base:</strong> mdm_gemini.CITIBIKE_STATIONS_SOURCE2 on the field ‘embeddings_src’</p><p><strong>top_k:</strong> specifies the number of nearest neighbors to return. The default is 10. A negative value is treated as infinity, meaning that all values are counted as neighbors and returned.</p><p><strong>distance_type:</strong> specifies the type of metric to use to compute the distance between two vectors. Supported distance types are EUCLIDEAN and COSINE. The default is EUCLIDEAN.</p><p>The result of the query is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*EQJgLI91pzmTsXnW" /><figcaption>Result Set</figcaption></figure><p>As you can see, we have listed 2 nearest neighbors (in other words, closest duplicates) for the 2 rows in CITIBIKE_STATIONS_SOURCE2 from CITIBIKE_STATIONS_SOURCE1. Since the distance_type is unspecified, it assumes that it is EUCLIDEAN and the distance is read as the distances in address TEXT values between the two sources, lowest being the most similar address texts.</p><p>Let’s set distance_type to COSINE:</p><pre>select query.name name1,base.name name2, <br>/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */<br> distance<br> from VECTOR_SEARCH(<br> TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,<br> 'ml_generate_embedding_result',<br> (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),<br> 'embeddings_src',<br> top_k => 2,distance_type => 'COSINE'<br>) where query.name <> base.name<br>order by distance desc;</pre><p>Result:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/905/0*lkeIwXC98QwSoqoq" /><figcaption>Result set</figcaption></figure><p>Both queries (of both distance types) are ordered by distance DESCENDING which means we want to list the results in the order of decreasing distance. But you will notice that the second query’s distance order is reversed. Guess why?</p><p>Yes!! You got it right! When we say COSINE, it returns the similarity. So bigger the number, closer the similarity and hence lesser the distance. In EUCLIDEAN, bigger the number, farther the distance of values (in this case text).</p><blockquote><strong>Tips to understand the difference and applications of EUCLIDEAN and COSINE:</strong></blockquote><blockquote><strong>Euclidean Distance</strong> measures the straight-line distance between two points in a multi-dimensional space and<strong> Cosine Similarity</strong> measures the cosine of the angle between two vectors.</blockquote><blockquote>Scale Invariance: Cosine similarity is not affected by the magnitude (length) of vectors, only their direction. This is useful when comparing documents of different lengths or user preferences with varying intensities.</blockquote><blockquote>Curse of Dimensionality: Euclidean distance can become less informative in very high-dimensional spaces, while cosine similarity tends to hold up better.</blockquote><blockquote>When to Use Which:<br>Similar Scale, Absolute Differences Matter: Euclidean distanceDifferent Scales, Direction Matters More: Cosine similarity</blockquote><blockquote>5. Example:<br>Imagine two users’ movie ratings:<br>User 1: [5, 4, 3] (action, comedy, drama)<br>User 2: [10, 8, 6] (same genres, but rated higher)</blockquote><blockquote>Euclidean distance would be large due to the difference in rating scale.<br>Cosine similarity would be high, indicating similar taste despite the rating difference.</blockquote><blockquote>Please note in our example, we have used location text similarity — it doesn’t mean we are tracing the absolute geocoding distance between the 2 locations by value. We are only finding their similarity by the address text values.</blockquote><p>I built a UI around the Vector Search data from BigQuery for visualizing nearest neighbors based on their similarity in address string, NOT THE PHYSICAL DISTANCE, for a slightly larger dataset in a simple Java Spring Boot application and deployed it in Cloud Run. I created a <a href="https://github.com/AbiramiSukumaran/mdm_gemini_web">sample app</a>, edit the SQL in the controller class to match your distance results table:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*X5vJaFouSO6L2e97C-fnoA.gif" /><figcaption>A demo of Bike Station Location UI that lists stations with similar addresses (not actual distance)</figcaption></figure><h3>Conclusion</h3><p>This project has demonstrated the power of using Gemini 1.0 Pro and Function Calling in transforming a few MDM activities into simplified yet powerful, deterministic and reliable generative AI capabilities. Now that you know, feel free to identify other ways of implementing the same use case or other MDM functionalities. Are there datasets you could validate, information gaps you could fill, or tasks that could be automated with structured calls embedded within your generative AI responses? Here is the link to the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> and for further reading, refer to the documentation for <a href="https://cloud.google.com/vertex-ai">Vertex AI</a>, <a href="https://cloud.google.com/bigquery/docs/remote-functions">BigQuery Remote Functions</a>, and <a href="https://cloud.google.com/functions">Cloud Functions</a>, <a href="https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial#create_the_remote_model_for_text_embedding_generation">Embeddings</a>, <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search">Vector Search</a> for more in-depth guidance.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=35fd35d306c2" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/master-data-management-simplified-match-merge-with-generative-ai-35fd35d306c2">Master Data Management Simplified: Match & Merge with Generative AI!</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>
Author
Link
Published date
Image url
Feed url
Guid
Hidden blurb
--- !ruby/object:Feedjira::Parser::RSSEntry title: 'Master Data Management Simplified: Match & Merge with Generative AI!' published: 2024-04-17 01:27:38.000000000 Z categories: - mdm - bigquery - gemini - google-cloud-platform - generative-ai url: https://medium.com/google-cloud/master-data-management-simplified-match-merge-with-generative-ai-35fd35d306c2?source=rss----e52cf94d98af---4 entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier is_perma_link: 'false' guid: https://medium.com/p/35fd35d306c2 carlessian_info: news_filer_version: 2 newspaper: Google Cloud - Medium macro_region: Blogs content: '<p>It is my dream to accelerate some of the tedious MDM processes with Generative AI, Embeddings, Vector Search and more… Read on.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/740/1*zccJwrLqwmYQ38V2yeCtRw.png" /><figcaption>Image showing a bike station with docked bikes to represent the use case</figcaption></figure><h3>Introduction</h3><p>At its core, Master Data Management (MDM) is about creating a single, reliable source of truth for your organization’s most critical data. Think of MDM as a meticulously curated library where every book (data point) is correctly labeled, up-to-date, and easy to find. Implementing robust MDM has always been a strategic necessity, but it often comes with complexities and resource demands. This is where the transformative power of Generative AI, particularly models like Gemini 1.0 Pro, Gemini 1.0 Pro Vision, Gemini 1.5 Pro, comes into play.</p><blockquote><strong>In this article</strong>, we will demonstrate how Gemini 1.0 Pro simplifies master data management applications like enrichment and deduplication, for the citibike_stations data available in the BigQuery public dataset. For this, we will use:<br>1. BigQuery public dataset bigquery-public-data.new_york_citibike.<br>2. Gemini Function Calling (a Java Cloud Function that gets the address information using the reverse Geocoding API for the coordinates available with the citibike_stations data) that we have already created in <a href="https://medium.com/google-cloud/using-gemini-function-calling-in-java-for-deterministic-generative-ai-responses-4c86a5ab80a9">one</a> of our previous articles.<br>3. Vertex AI Embeddings API and Vector Search in BigQuery to identify duplicates.</blockquote><h3>Master Data Management (MDM): The Foundation for Data-Driven Decisions</h3><p>The key elements of master data are business entities (like customers, products, suppliers, locations, etc. which are the nouns that your business revolves around), identifiers (Unique identifiers ensure each entity is distinct and traceable across systems), attributes (These are the characteristics that describe each entity e.g. a customer’s address, a product’s price etc.). I am going to take a little space here with it because understanding the importance of MDM is paramount in the generative AI era more than ever. Why MDM Matters:</p><ul><li><strong>Adds clarity to data:</strong></li></ul><p>Without MDM, businesses struggle with fragmented data scattered across various systems. This leads to inconsistencies, duplicates, and sometimes ambiguous understanding of relationships between data points (e.g. is “Abirami Sukumaran” in one system the same customer as “Abirami S.” in another?).</p><ul><li><strong>Helps in decision making:</strong></li></ul><p>Accurate, unified master data is the foundation of informed business decisions. It answers questions like — Who are our most valuable customers? Which products are underperforming? What suppliers provide the best value?</p><ul><li><strong>Serves data as a Product:</strong></li></ul><p>MDM is essential for treating data as a valuable asset. Clean, integrated, and up-to-date master data forms the basis for building insightful data products that drive business outcomes.</p><p><strong>Let me paint a picture to help you understand master data better by comparing it with transactional data: </strong>Transactional data captures individual events (a purchase, a shipment etc.). Master data provides the context for those events by defining the entities involved. For example, a sales transaction links to master data for the customer, product, and salesperson. Without MDM, data tends to remain in silos hindering a holistic view, prone to quality issues and relatively heavy resource-spend on data reconciliation and consolidation on demand.</p><h3>The powerful synergy of MDM and Generative AI</h3><p>Below are some of the industry grade applications and advantages of empowering and simplifying MDM with Generative AI:</p><ul><li><strong>Automated Data Cleansing:</strong> Gemini 1.0 Pro can intelligently identify and rectify inconsistencies, duplicates, and errors in master data, significantly reducing manual effort.</li><li><strong>Intelligent Matching:</strong> Leveraging its language understanding capabilities, Generative AI capabilities, embeddings and Vector Search, come together to accurately match and merge records from disparate sources, even with variations in formatting or naming conventions.</li><li><strong>Contextual Enrichment:</strong> Generative AI models can generate missing attributes or provide additional context to master data, enhancing its value for analysis and decision-making.</li><li><strong>Adaptive Learning</strong>: You can leverage fine tuning and reinforcement learning to enable generative AI to continuously improve its understanding of your business entities and relationships, leading to more accurate and efficient MDM over time.</li></ul><h3>High Level Flow Diagram</h3><p>This diagram represents the flow of data and steps involved in the implementation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1001/1*pMW7YoxbqYbJjYXgucff9g.png" /><figcaption>High level flow of the use case</figcaption></figure><h3>Demo</h3><p>The steps loosely are as follows:</p><ol><li>Create a BigQuery dataset for the use case. Create a landing table with data from the public dataset table bigquery-public-data.new_york_citibike.citibike_stations.</li><li>Make sure the Cloud Function that includes Gemini Function Calling for address standardization is deployed.</li><li>Store the enriched address data in the landing tables (from 2 sources for demo purpose).</li><li>Invoke Vertex AI Embeddings API from BigQuery on the address data.</li><li>Use BigQuery Vector Search to identify duplicate records.</li></ol><h3>Setup</h3><ol><li>In the <a href="https://console.cloud.google.com/">Google Cloud Console</a>, on the project selector page, select or create a Google Cloud <a href="https://cloud.google.com/resource-manager/docs/creating-managing-projects">project</a>.</li><li>Make sure that billing is enabled for your Cloud project. Learn how to <a href="https://cloud.google.com/billing/docs/how-to/verify-billing-enabled">check if billing is enabled on a project</a>.</li><li>You will use Cloud Shell, a command-line environment running in Google Cloud that comes preloaded with bq. From the Cloud Console, click Activate Cloud Shell on the top right corner.</li><li>Enable necessary APIs for this implementation if you haven’t already. Alternative to the gcloud command is through the console using this <a href="https://pantheon.corp.google.com/apis/enableflow?apiid=cloudfunctions.googleapis.com,run.googleapis.com,bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com">link</a>.</li></ol><h3>Step 1: Create BigQuery Dataset and External Connection</h3><p>BigQuery <a href="https://cloud.google.com/bigquery/docs/datasets-intro">dataset</a> is a container for all the tables and objects for your application. BigQuery <a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">connection</a> is used to interact with your Cloud Function. In order to create a remote function, you must create a BigQuery connection. Let’s begin with creating the dataset and the connection.</p><ol><li>From the Google Cloud Console, go to the <a href="https://console.cloud.google.com/bigquery">BigQuery page</a> and click the 3 vertical dots icon next to your project id. From the list of options, select “Create data set”.</li><li>In Create data set pop up, enter the data set ID “mdm_gemini” with the region set to the default value “US (multiple regions…)”</li><li><a href="https://cloud.google.com/bigquery/docs/remote-functions#create_a_connection">BigLake Connection</a> allows us to connect the external data source while retaining fine-grained BigQuery access control and security, which in our case is the Vertex AI Gemini Pro API. We will use this connection to access the model from BigQuery via the Cloud Function. Follow steps below to create the BigLake Connection:</li></ol><p>a. Click ADD on the Explorer pane of the BigQuery page:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/716/0*mqX5rzEHWjkzEQIE" /><figcaption>ADD button to create connection</figcaption></figure><p>b. Click Connections to external data sources in the sources page.</p><p>c. Enter external data source details as below in the pop up that shows up and click CREATE CONNECTION:</p><figure><img alt="" src="https://cdn-images-1.medium.com/proxy/1*b31hiO4ynbDLRrXWEFF4aQ.png" /><figcaption>Create Connection</figcaption></figure><p>d. Once the connection is created, go to the connection configuration page and copy the Service account ID for access provisioning:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xh8uToOATp-TmC5O" /><figcaption>Connection Info</figcaption></figure><p>e. Open <a href="https://console.cloud.google.com/iam-admin/iam">IAM and admin</a> page, click GRANT ACCESS, enter the service account id in the new principals tab and roles as shown below and click SAVE.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fCExrL0trAtqirew" /><figcaption>Grant Access</figcaption></figure><h3>Step 2: Deploy Function Calling (Java Cloud Function)</h3><ol><li>Clone the github <a href="https://github.com/AbiramiSukumaran/GeminiFunctionCalling">repo</a> from your Cloud Shell Terminal and change the YOUR_API_KEY and YOUR_PROJECT_ID with your values</li></ol><pre>git clone https://github.com/AbiramiSukumaran/GeminiFunctionCalling</pre><p>2. Go to Cloud Shell terminal, navigate into the newly cloned project directory “GeminiFunctionCalling” and execute the below statement build and deploy the Cloud Function:</p><pre>gcloud functions deploy gemini-fn-calling --gen2 --region=us-central1 --runtime=java11 --source=. --entry-point=cloudcode.helloworld.HelloWorld --trigger-http</pre><p>The result for this deploy command would be a REST URL in the format as below :</p><p><a href="https://us-central1-abis-345004.cloudfunctions.net/gemini-bq-fn"><strong>https://us-central1-</strong>YOUR_PROJECT_ID<strong>.cloudfunctions.net/gemini-fn-calling</strong></a></p><p>3. Test this Cloud Function by running the following command from the terminal:</p><pre>gcloud functions call gemini-fn-calling - region=us-central1 - gen2 - data '{"calls":[["40.714224,-73.961452"]]}'</pre><p>Response for a random sample prompt:</p><pre> '{"replies":["{ \"DOOR_NUMBER\": \"277\", \"STREET_ADDRESS\": \"Bedford Ave\", \"AREA\":<br> null, \"CITY\": \"Brooklyn\", \"TOWN\": null, \"COUNTY\": \"Kings County\", \"STATE\":<br> \"NY\", \"COUNTRY\": \"USA\", \"ZIPCODE\": \"11211\", \"LANDMARK\": null}}```"]}'</pre><p>The request and response parameters of this Cloud Function are implemented in a way that is compatible with BigQuery’s remote function invocation. It can be directly consumed from BigQuery data in-place. It means that if your data input (lat and long data) lives in BigQuery then you can call the remote function on the data and get the function response which can be stored or processed within BigQuery directly.</p><p>4. Run the following DDL from BigQuery to create a remote function that invokes this deployed Cloud Function:</p><pre> CREATE OR REPLACE FUNCTION <br> `mdm_gemini.MDM_GEMINI` (latlng STRING) RETURNS STRING<br> REMOTE WITH CONNECTION `us.gemini-bq-conn`<br> OPTIONS (<br> endpoint = 'https://us-central1-YOUR_PROJECT_ID.cloudfunctions.net/gemini-fn-calling', max_batching_rows = 1<br> );</pre><p><strong>WORKAROUND</strong></p><p>If you do not have the necessary key or Cloud Function deployed, feel free to jump to the landing table directly by exporting the data from the csv into your new BigQuery dataset mdm_gemini using the following command in the Cloud Shell Terminal. <strong>Remember</strong> to download the file <a href="https://github.com/AbiramiSukumaran/MDMwithGemini/blob/main/CITIBIKE_STATIONS.csv">CITIBIKE_STATIONS.csv</a> from the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> into your Cloud Shell project folder and navigate into that folder before executing the following command:</p><pre>bq load --source_format=CSV --skip_leading_rows=1 mdm_gemini.CITIBIKE_STATIONS ./CITIBIKE_STATIONS.csv \ name:string,latlng:string,capacity:numeric,num_bikes_available:numeric,num_docks_available:numeric,last_reported:timestamp,full_address_string:string</pre><h3>Step 3: Create Table and Enrich Address data</h3><p>a. If you have used the WORKAROUND approach from the last step, you can skip this step, since you have already created the table there. If NOT, proceed to the running the following DDL in BigQuery SQL Editor:</p><pre>CREATE TABLE mdm_gemini.CITIBIKE_STATIONS as (<br>select name, latitude || ',' || longitude as latlong, capacity, num_bikes_available, num_docks_available,last_reported,<br>'' as full_address_string <br> from bigquery-public-data.new_york_citibike.citibike_stations) ;</pre><p>Let’s enrich the address data by invoking the remote function on the latitude and longitude coordinates available in the table. Please note that we will update this only for data reported for the year 2024 and where number of bikes available > 0 and capacity > 100:</p><pre>update `mdm_gemini.CITIBIKE_STATIONS` <br>set full_address_string = `mdm_gemini.MDM_GEMINI`(latlong) <br>where EXTRACT(YEAR FROM last_reported) = 2024 and num_bikes_available > 0 and capacity > 100;</pre><p>b. Do not skip this step even if you used the WORKAROUND approach in the last step. This is where we will create a second source of bike station location data for the purpose of this use case. Afterall, MDM is bringing data from multiple sources together and identifying the golden truth.</p><p>Run the following DDLs in BigQuery SQL Editor for creating the second source of location data with 2 records in it. Let’s name this table mdm_gemini.CITIBIKE_STATIONS_SOURCE2 and insert 2 records into it:</p><pre>CREATE TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE2 (name STRING(55), address STRING(1000), embeddings_src ARRAY<FLOAT64>);<br><br>insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Location broadway and 29','{ "DOOR_NUMBER": "1593", "STREET_ADDRESS": "Broadway", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10019", "LANDMARK": null}', null);<br><br>insert into mdm_gemini.CITIBIKE_STATIONS_SOURCE2 VALUES ('Allen St & Hester','{ "DOOR_NUMBER": "36", "STREET_ADDRESS": "Allen St", "AREA": null, "CITY": "New York", "TOWN": null, "COUNTY": "New York County", "STATE": "NY", "COUNTRY": "USA", "ZIPCODE": "10002", "LANDMARK": null}', null);</pre><h3>Step 4: Generate Embeddings for Address Data</h3><p>Embeddings are high-dimensional numerical vectors that represent a given entity, like a piece of text or an audio file. Machine learning (ML) models use embeddings to encode semantics about such entities to make it easier to reason about and compare them. For example, a common operation in clustering, classification, and recommendation models is to measure the distance between vectors in an embedding space to find items that are most semantically similar. The Vertex AI text-embeddings API lets you create a text embedding using Generative AI on Vertex AI. Text embeddings are numerical representations of text that capture relationships between words and phrases. Read more about Vertex AI Text Embeddings <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings">here</a>.</p><p>Run the below DDL to create a remote model for Vertex AI text embeddings API:</p><pre>CREATE OR REPLACE MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`<br>REMOTE WITH CONNECTION `us.gemini-bq-conn`<br>OPTIONS (ENDPOINT = 'textembedding-gecko@latest');</pre><p>Now that the remote embeddings model is ready, let’s generate embeddings for the first source and store it in a table. You can store the embeddings result field in the same mdm_gemini.CITIBIKE_STATIONS table as before, but I am choosing to create a new one for clarity:</p><pre>CREATE TABLE `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` AS (<br>SELECT *<br>FROM ML.GENERATE_EMBEDDING(<br> MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,<br> ( select name, full_address_string as content from `mdm_gemini.CITIBIKE_STATIONS` <br> where full_address_string is not null )<br> )<br>);</pre><p>Let’s generate embeddings for address data in table CITIBIKE_STATIONS_SOURCE2:</p><pre>update `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` a set embeddings_src =<br>(<br>SELECT ml_generate_embedding_result<br>FROM ML.GENERATE_EMBEDDING(<br> MODEL `mdm_gemini.CITIBIKE_STATIONS_ADDRESS_EMB`,<br> ( select name, address as content from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` ))<br>where name = a.name) where name is not null;<br>This should create embeddings for the second source, note that we have created the embeddings field in the same table CITIBIKE_STATIONS_SOURCE2.</pre><p>To visualize the embeddings are generated for the source data tables 1 and 2, run the below queries:</p><pre>select name,address,embeddings_src from `mdm_gemini.CITIBIKE_STATIONS_SOURCE2`;<br>select name,content,ml_generate_embedding_result from `mdm_gemini.CITIBIKE_STATIONS_SOURCE1`;</pre><p>Let’s go ahead and perform vector search to identify duplicates.</p><h3>Step 5: Vector Search for Flagging Duplicate Addresses</h3><p>In this step, we will search the address embeddings ml_generate_embedding_result column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE1` table for the top 2 embeddings that match each row of data in the embeddings_src column of the `mdm_gemini.CITIBIKE_STATIONS_SOURCE2` table:</p><pre>select query.name name1,base.name name2, <br>/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */<br> distance<br> from VECTOR_SEARCH(<br> TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,<br> 'ml_generate_embedding_result',<br> (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),<br> 'embeddings_src',<br> top_k => 2 <br>) where query.name <> base.name<br>order by distance desc;</pre><p><strong>Table that we are querying:</strong> mdm_gemini.CITIBIKE_STATIONS_EMBEDDINGS on the field ‘ml_generate_embedding_result’</p><p><strong>Table that we use as base:</strong> mdm_gemini.CITIBIKE_STATIONS_SOURCE2 on the field ‘embeddings_src’</p><p><strong>top_k:</strong> specifies the number of nearest neighbors to return. The default is 10. A negative value is treated as infinity, meaning that all values are counted as neighbors and returned.</p><p><strong>distance_type:</strong> specifies the type of metric to use to compute the distance between two vectors. Supported distance types are EUCLIDEAN and COSINE. The default is EUCLIDEAN.</p><p>The result of the query is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*EQJgLI91pzmTsXnW" /><figcaption>Result Set</figcaption></figure><p>As you can see, we have listed 2 nearest neighbors (in other words, closest duplicates) for the 2 rows in CITIBIKE_STATIONS_SOURCE2 from CITIBIKE_STATIONS_SOURCE1. Since the distance_type is unspecified, it assumes that it is EUCLIDEAN and the distance is read as the distances in address TEXT values between the two sources, lowest being the most similar address texts.</p><p>Let’s set distance_type to COSINE:</p><pre>select query.name name1,base.name name2, <br>/* (select address from mdm_gemini.CITIBIKE_STATIONS_SOURCE2 where name = query.name) content1, base.content content2, */<br> distance<br> from VECTOR_SEARCH(<br> TABLE mdm_gemini.CITIBIKE_STATIONS_SOURCE1,<br> 'ml_generate_embedding_result',<br> (SELECT * FROM mdm_gemini.CITIBIKE_STATIONS_SOURCE2),<br> 'embeddings_src',<br> top_k => 2,distance_type => 'COSINE'<br>) where query.name <> base.name<br>order by distance desc;</pre><p>Result:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/905/0*lkeIwXC98QwSoqoq" /><figcaption>Result set</figcaption></figure><p>Both queries (of both distance types) are ordered by distance DESCENDING which means we want to list the results in the order of decreasing distance. But you will notice that the second query’s distance order is reversed. Guess why?</p><p>Yes!! You got it right! When we say COSINE, it returns the similarity. So bigger the number, closer the similarity and hence lesser the distance. In EUCLIDEAN, bigger the number, farther the distance of values (in this case text).</p><blockquote><strong>Tips to understand the difference and applications of EUCLIDEAN and COSINE:</strong></blockquote><blockquote><strong>Euclidean Distance</strong> measures the straight-line distance between two points in a multi-dimensional space and<strong> Cosine Similarity</strong> measures the cosine of the angle between two vectors.</blockquote><blockquote>Scale Invariance: Cosine similarity is not affected by the magnitude (length) of vectors, only their direction. This is useful when comparing documents of different lengths or user preferences with varying intensities.</blockquote><blockquote>Curse of Dimensionality: Euclidean distance can become less informative in very high-dimensional spaces, while cosine similarity tends to hold up better.</blockquote><blockquote>When to Use Which:<br>Similar Scale, Absolute Differences Matter: Euclidean distanceDifferent Scales, Direction Matters More: Cosine similarity</blockquote><blockquote>5. Example:<br>Imagine two users’ movie ratings:<br>User 1: [5, 4, 3] (action, comedy, drama)<br>User 2: [10, 8, 6] (same genres, but rated higher)</blockquote><blockquote>Euclidean distance would be large due to the difference in rating scale.<br>Cosine similarity would be high, indicating similar taste despite the rating difference.</blockquote><blockquote>Please note in our example, we have used location text similarity — it doesn’t mean we are tracing the absolute geocoding distance between the 2 locations by value. We are only finding their similarity by the address text values.</blockquote><p>I built a UI around the Vector Search data from BigQuery for visualizing nearest neighbors based on their similarity in address string, NOT THE PHYSICAL DISTANCE, for a slightly larger dataset in a simple Java Spring Boot application and deployed it in Cloud Run. I created a <a href="https://github.com/AbiramiSukumaran/mdm_gemini_web">sample app</a>, edit the SQL in the controller class to match your distance results table:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*X5vJaFouSO6L2e97C-fnoA.gif" /><figcaption>A demo of Bike Station Location UI that lists stations with similar addresses (not actual distance)</figcaption></figure><h3>Conclusion</h3><p>This project has demonstrated the power of using Gemini 1.0 Pro and Function Calling in transforming a few MDM activities into simplified yet powerful, deterministic and reliable generative AI capabilities. Now that you know, feel free to identify other ways of implementing the same use case or other MDM functionalities. Are there datasets you could validate, information gaps you could fill, or tasks that could be automated with structured calls embedded within your generative AI responses? Here is the link to the <a href="https://github.com/AbiramiSukumaran/MDMwithGemini">repo</a> and for further reading, refer to the documentation for <a href="https://cloud.google.com/vertex-ai">Vertex AI</a>, <a href="https://cloud.google.com/bigquery/docs/remote-functions">BigQuery Remote Functions</a>, and <a href="https://cloud.google.com/functions">Cloud Functions</a>, <a href="https://cloud.google.com/bigquery/docs/vector-index-text-search-tutorial#create_the_remote_model_for_text_embedding_generation">Embeddings</a>, <a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search">Vector Search</a> for more in-depth guidance.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=35fd35d306c2" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/master-data-management-simplified-match-merge-with-generative-ai-35fd35d306c2">Master Data Management Simplified: Match & Merge with Generative AI!</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>' rss_fields: - title - published - categories - url - entry_id - content - author author: Abirami Sukumaran
Language
Active
Ricc internal notes
Imported via /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/import-feedjira.rb on 2024-04-18 09:13:11 +0200. Content is EMPTY here. Entried: title,published,categories,url,entry_id,content,author. TODO add Newspaper: filename = /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/../../../crawler/out/feedjira/Blogs/Google Cloud - Medium/2024-04-17-Master_Data_Management_Simplified:_Match_&_Merge_with_Generative-v2.yaml
Ricc source
Show this article
Back to articles