♊️ GemiNews 🗞️
(dev)
🏡
📰 Articles
🏷️ Tags
🧠 Queries
📈 Graphs
☁️ Stats
💁🏻 Assistant
💬
🎙️
Demo 1: Embeddings + Recommendation
Demo 2: Bella RAGa
Demo 3: NewRetriever
Demo 4: Assistant function calling
Editing article
Title
Summary
Content
<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JxjbysaQHi7-vJ5eQFbiJA.jpeg" /></figure><p>Data augmentation is a technique used in machine learning to increase the size of a dataset by creating new data out of existing data. This technique can help models <a href="https://developers.google.com/machine-learning/crash-course/generalization/video-lecture">generalize</a> better, avoiding <a href="https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting">overfitting</a> on the data it was trained on.</p><p>You often think of doing this in visual data, by rotating data, flipping, cropping, and so forth. PyTorch has a very useful <a href="https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py">transforms</a> package that allows you to apply random transformations to your dataset with just a few lines of code. While this may reduce accuracy on the training set, it often results in <a href="https://www.sciencedirect.com/science/article/abs/pii/S0957417420305200">improved accuracy</a> on the test set of unseen data — which is what really matters!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/640/0*wWTbJJ62O5oO4lMZ" /><figcaption><strong>Example: Data Augmentations of Rock Images, </strong>Credit: TseKiChun, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons</figcaption></figure><p>We can apply this same technique to text data. It provides the same benefits of stretching your existing dataset, and making your model more robust to noise and outliers. There are proven <a href="https://link.springer.com/article/10.1186/s40537-021-00492-0/figures/1">benefits</a> to data augmentation of all datasets, which are particularly beneficial for small datasets.</p><p>Let’s explore a few examples using popular techniques:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/697/1*2sbkuAY-tFZgsIZ6H7SWDA.png" /></figure><p>There are a number of ways you can apply these techniques manually. Let’s say you want to apply the random deletion technique with <em>p</em>=0.1 of tokens deleted. You can tokenize the text and then add back tokens with (1-<em>p</em>) probability. Or, for back-translation, you can call the <a href="https://cloud.google.com/translate">Translation API</a> once for the target language, and then a second time to translate back to the original language. For synonyms, you could use a <a href="https://github.com/goodmami/wn">WordNet API</a> on random tokens.</p><p>With a powerful LLM like <a href="https://ai.google.dev/docs/migrate_to_cloud">Gemini</a>, you have a bag of tricks at your fingertips. You can easily make these modifications and much more in one toolset. No need to cobble together multiple tools any longer.</p><p>Let’s look at how to apply these techniques on a real world dataset of <a href="https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow">Stack Overflow questions and answers</a>. All of the details are provided in this <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>, and I’ll point out the highlights here.</p><p>You can use <a href="https://cloud.google.com/python/docs/reference/bigframes/latest">BigQuery DataFrames</a> for all kinds of problems, but it will make text augmentation on our BigQuery dataset particularly straightforward. It provides a pandas-compatible DataFrame and scikit-learn-like ML API that enables us to query Gemini directly. It can handle batch jobs on massive datasets, as all DataFrame storage is in BigQuery.</p><p>So, let’s get started with one of these techniques, synonym replacement. First, we can query for accepted Stack Overflow Python answers since 2020, and put it into a BigQuery DataFrame:</p><pre>stack_overflow_df = bpd.read_gbq_query(<br> """SELECT<br> CONCAT(q.title, q.body) AS input_text,<br> a.body AS output_text<br> FROM `bigquery-public-data.stackoverflow.posts_questions` q<br> JOIN `bigquery-public-data.stackoverflow.posts_answers` a<br> ON q.accepted_answer_id = a.id<br> WHERE q.accepted_answer_id IS NOT NULL<br> AND REGEXP_CONTAINS(q.tags, "python")<br> AND a.creation_date >= "2020-01-01"<br> LIMIT 550<br> """)</pre><p>Here’s a sneak peek of the Q&A DataFrame:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/704/0*wxaiojCKMWB4qMCY" /></figure><p>Let’s now randomly sample a number of rows from the dataframe. Set <em>n_rows</em> to the number of new samples you’d like:</p><pre>df = stack_overflow_df.sample(n_rows)</pre><p>We can then define a Gemini text generator <a href="https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.GeminiTextGenerator">model</a> like this:</p><pre>model = GeminiTextGenerator()</pre><p>Next, let’s create two columns: a <strong>prompt column</strong> with synonym replacement instructions concatenated with the input text, and a <strong>result column</strong> with the synonym replacement applied.</p><pre># Create a prompt with the synonym replacement instructions and the input text<br>df["synonym_prompt"] = (<br>f"Replace {n_replacement_words} words from the input text with synonyms, "<br>+ "keeping the overall meaning as close to the original text as possible."<br>+ "Only provide the synonymized text, with no additional explanation."<br>+ "Preserve the original formatting.\n\nInput text: "<br>+ df["input_text"])<br><br># Run batch job and assign to a new column<br>df["input_text_with_synonyms"] = model.predict(<br>df["synonym_prompt"]<br>).ml_generate_text_llm_result<br><br># Compare the original and new columns<br>df.peek()[["input_text", "input_text_with_synonyms"]]</pre><p>Here are the results! Notice the subtle changes in the text with synonym replacement.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/717/0*aifrCJkADKM8zuaO" /></figure><p>Using this framework, it is simple to apply all kinds of batch transformations to augment your data. In the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>, you’ll see more prompts you can use for back translation and noise injection. You’ve also seen how easy it is to enhance datasets with <a href="https://cloud.google.com/bigquery/docs/dataframes-quickstart">BigQuery DataFrames</a>. We hope this helps you in your data science journey using <a href="https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro">Gemini on Google Cloud</a>!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=347bc6378413" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/how-to-augment-text-data-with-gemini-through-bigquery-dataframes-347bc6378413">How to Augment Text Data with Gemini through BigQuery DataFrames</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>
Author
Link
Published date
Image url
Feed url
Guid
Hidden blurb
--- !ruby/object:Feedjira::Parser::RSSEntry title: How to Augment Text Data with Gemini through BigQuery DataFrames url: https://medium.com/google-cloud/how-to-augment-text-data-with-gemini-through-bigquery-dataframes-347bc6378413?source=rss----e52cf94d98af---4 author: Karl Weinmeister categories: - bigquery - gemini - machine-learning - google-cloud-platform - generative-ai published: 2024-04-08 06:26:44.000000000 Z entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier is_perma_link: 'false' guid: https://medium.com/p/347bc6378413 carlessian_info: news_filer_version: 2 newspaper: Google Cloud - Medium macro_region: Blogs rss_fields: - title - url - author - categories - published - entry_id - content content: '<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JxjbysaQHi7-vJ5eQFbiJA.jpeg" /></figure><p>Data augmentation is a technique used in machine learning to increase the size of a dataset by creating new data out of existing data. This technique can help models <a href="https://developers.google.com/machine-learning/crash-course/generalization/video-lecture">generalize</a> better, avoiding <a href="https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting">overfitting</a> on the data it was trained on.</p><p>You often think of doing this in visual data, by rotating data, flipping, cropping, and so forth. PyTorch has a very useful <a href="https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py">transforms</a> package that allows you to apply random transformations to your dataset with just a few lines of code. While this may reduce accuracy on the training set, it often results in <a href="https://www.sciencedirect.com/science/article/abs/pii/S0957417420305200">improved accuracy</a> on the test set of unseen data — which is what really matters!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/640/0*wWTbJJ62O5oO4lMZ" /><figcaption><strong>Example: Data Augmentations of Rock Images, </strong>Credit: TseKiChun, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons</figcaption></figure><p>We can apply this same technique to text data. It provides the same benefits of stretching your existing dataset, and making your model more robust to noise and outliers. There are proven <a href="https://link.springer.com/article/10.1186/s40537-021-00492-0/figures/1">benefits</a> to data augmentation of all datasets, which are particularly beneficial for small datasets.</p><p>Let’s explore a few examples using popular techniques:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/697/1*2sbkuAY-tFZgsIZ6H7SWDA.png" /></figure><p>There are a number of ways you can apply these techniques manually. Let’s say you want to apply the random deletion technique with <em>p</em>=0.1 of tokens deleted. You can tokenize the text and then add back tokens with (1-<em>p</em>) probability. Or, for back-translation, you can call the <a href="https://cloud.google.com/translate">Translation API</a> once for the target language, and then a second time to translate back to the original language. For synonyms, you could use a <a href="https://github.com/goodmami/wn">WordNet API</a> on random tokens.</p><p>With a powerful LLM like <a href="https://ai.google.dev/docs/migrate_to_cloud">Gemini</a>, you have a bag of tricks at your fingertips. You can easily make these modifications and much more in one toolset. No need to cobble together multiple tools any longer.</p><p>Let’s look at how to apply these techniques on a real world dataset of <a href="https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow">Stack Overflow questions and answers</a>. All of the details are provided in this <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>, and I’ll point out the highlights here.</p><p>You can use <a href="https://cloud.google.com/python/docs/reference/bigframes/latest">BigQuery DataFrames</a> for all kinds of problems, but it will make text augmentation on our BigQuery dataset particularly straightforward. It provides a pandas-compatible DataFrame and scikit-learn-like ML API that enables us to query Gemini directly. It can handle batch jobs on massive datasets, as all DataFrame storage is in BigQuery.</p><p>So, let’s get started with one of these techniques, synonym replacement. First, we can query for accepted Stack Overflow Python answers since 2020, and put it into a BigQuery DataFrame:</p><pre>stack_overflow_df = bpd.read_gbq_query(<br> """SELECT<br> CONCAT(q.title, q.body) AS input_text,<br> a.body AS output_text<br> FROM `bigquery-public-data.stackoverflow.posts_questions` q<br> JOIN `bigquery-public-data.stackoverflow.posts_answers` a<br> ON q.accepted_answer_id = a.id<br> WHERE q.accepted_answer_id IS NOT NULL<br> AND REGEXP_CONTAINS(q.tags, "python")<br> AND a.creation_date >= "2020-01-01"<br> LIMIT 550<br> """)</pre><p>Here’s a sneak peek of the Q&A DataFrame:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/704/0*wxaiojCKMWB4qMCY" /></figure><p>Let’s now randomly sample a number of rows from the dataframe. Set <em>n_rows</em> to the number of new samples you’d like:</p><pre>df = stack_overflow_df.sample(n_rows)</pre><p>We can then define a Gemini text generator <a href="https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.GeminiTextGenerator">model</a> like this:</p><pre>model = GeminiTextGenerator()</pre><p>Next, let’s create two columns: a <strong>prompt column</strong> with synonym replacement instructions concatenated with the input text, and a <strong>result column</strong> with the synonym replacement applied.</p><pre># Create a prompt with the synonym replacement instructions and the input text<br>df["synonym_prompt"] = (<br>f"Replace {n_replacement_words} words from the input text with synonyms, "<br>+ "keeping the overall meaning as close to the original text as possible."<br>+ "Only provide the synonymized text, with no additional explanation."<br>+ "Preserve the original formatting.\n\nInput text: "<br>+ df["input_text"])<br><br># Run batch job and assign to a new column<br>df["input_text_with_synonyms"] = model.predict(<br>df["synonym_prompt"]<br>).ml_generate_text_llm_result<br><br># Compare the original and new columns<br>df.peek()[["input_text", "input_text_with_synonyms"]]</pre><p>Here are the results! Notice the subtle changes in the text with synonym replacement.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/717/0*aifrCJkADKM8zuaO" /></figure><p>Using this framework, it is simple to apply all kinds of batch transformations to augment your data. In the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>, you’ll see more prompts you can use for back translation and noise injection. You’ve also seen how easy it is to enhance datasets with <a href="https://cloud.google.com/bigquery/docs/dataframes-quickstart">BigQuery DataFrames</a>. We hope this helps you in your data science journey using <a href="https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro">Gemini on Google Cloud</a>!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=347bc6378413" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/how-to-augment-text-data-with-gemini-through-bigquery-dataframes-347bc6378413">How to Augment Text Data with Gemini through BigQuery DataFrames</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>'
Language
Active
Ricc internal notes
Imported via /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/import-feedjira.rb on 2024-04-09 05:41:39 -0700. Content is EMPTY here. Entried: title,url,author,categories,published,entry_id,content. TODO add Newspaper: filename = /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/../../../crawler/out/feedjira/Blogs/Google Cloud - Medium/2024-04-08-How_to_Augment_Text_Data_with_Gemini_through_BigQuery_DataFrames-v2.yaml
Ricc source
Show this article
Back to articles