Editing article

Title

Summary

Content

<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JxjbysaQHi7-vJ5eQFbiJA.jpeg" /></figure>Data augmentation is a technique used in machine learning to increase the size of a dataset by creating new data out of existing data. This technique can help models <a href="https://developers.google.com/machine-learning/crash-course/generalization/video-lecture">generalize</a> better, avoiding <a href="https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting">overfitting</a> on the data it was trained on.You often think of doing this in visual data, by rotating data, flipping, cropping, and so forth. PyTorch has a very useful <a href="https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py">transforms</a> package that allows you to apply random transformations to your dataset with just a few lines of code. While this may reduce accuracy on the training set, it often results in <a href="https://www.sciencedirect.com/science/article/abs/pii/S0957417420305200">improved accuracy</a> on the test set of unseen data — which is what really matters!<figure><img alt="" src="https://cdn-images-1.medium.com/max/640/0*wWTbJJ62O5oO4lMZ" /><figcaption>Example: Data Augmentations of Rock Images, Credit: TseKiChun, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons</figcaption></figure>We can apply this same technique to text data. It provides the same benefits of stretching your existing dataset, and making your model more robust to noise and outliers. There are proven <a href="https://link.springer.com/article/10.1186/s40537-021-00492-0/figures/1">benefits</a> to data augmentation of all datasets, which are particularly beneficial for small datasets.Let’s explore a few examples using popular techniques:<figure><img alt="" src="https://cdn-images-1.medium.com/max/697/1*2sbkuAY-tFZgsIZ6H7SWDA.png" /></figure>There are a number of ways you can apply these techniques manually. Let’s say you want to apply the random deletion technique with p=0.1 of tokens deleted. You can tokenize the text and then add back tokens with (1-p) probability. Or, for back-translation, you can call the <a href="https://cloud.google.com/translate">Translation API</a> once for the target language, and then a second time to translate back to the original language. For synonyms, you could use a <a href="https://github.com/goodmami/wn">WordNet API</a> on random tokens.With a powerful LLM like <a href="https://ai.google.dev/docs/migrate_to_cloud">Gemini</a>, you have a bag of tricks at your fingertips. You can easily make these modifications and much more in one toolset. No need to cobble together multiple tools any longer.Let’s look at how to apply these techniques on a real world dataset of <a href="https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow">Stack Overflow questions and answers</a>. All of the details are provided in this <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>, and I’ll point out the highlights here.You can use <a href="https://cloud.google.com/python/docs/reference/bigframes/latest">BigQuery DataFrames</a> for all kinds of problems, but it will make text augmentation on our BigQuery dataset particularly straightforward. It provides a pandas-compatible DataFrame and scikit-learn-like ML API that enables us to query Gemini directly. It can handle batch jobs on massive datasets, as all DataFrame storage is in BigQuery.So, let’s get started with one of these techniques, synonym replacement. First, we can query for accepted Stack Overflow Python answers since 2020, and put it into a BigQuery DataFrame:<pre>stack_overflow_df = bpd.read_gbq_query( &quot;&quot;&quot;SELECT CONCAT(q.title, q.body) AS input_text, a.body AS output_text FROM `bigquery-public-data.stackoverflow.posts_questions` q JOIN `bigquery-public-data.stackoverflow.posts_answers` a ON q.accepted_answer_id = a.id WHERE q.accepted_answer_id IS NOT NULL AND REGEXP_CONTAINS(q.tags, &quot;python&quot;) AND a.creation_date &gt;= &quot;2020-01-01&quot; LIMIT 550 &quot;&quot;&quot;)</pre>Here’s a sneak peek of the Q&amp;A DataFrame:<figure><img alt="" src="https://cdn-images-1.medium.com/max/704/0*wxaiojCKMWB4qMCY" /></figure>Let’s now randomly sample a number of rows from the dataframe. Set n_rows to the number of new samples you’d like:<pre>df = stack_overflow_df.sample(n_rows)</pre>We can then define a Gemini text generator <a href="https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.GeminiTextGenerator">model</a> like this:<pre>model = GeminiTextGenerator()</pre>Next, let’s create two columns: a prompt column with synonym replacement instructions concatenated with the input text, and a result column with the synonym replacement applied.<pre># Create a prompt with the synonym replacement instructions and the input text df[&quot;synonym_prompt&quot;] = ( f&quot;Replace {n_replacement_words} words from the input text with synonyms, &quot; + &quot;keeping the overall meaning as close to the original text as possible.&quot; + &quot;Only provide the synonymized text, with no additional explanation.&quot; + &quot;Preserve the original formatting.\n\nInput text: &quot; + df[&quot;input_text&quot;]) # Run batch job and assign to a new column df[&quot;input_text_with_synonyms&quot;] = model.predict( df[&quot;synonym_prompt&quot;] ).ml_generate_text_llm_result # Compare the original and new columns df.peek()[[&quot;input_text&quot;, &quot;input_text_with_synonyms&quot;]]</pre>Here are the results! Notice the subtle changes in the text with synonym replacement.<figure><img alt="" src="https://cdn-images-1.medium.com/max/717/0*aifrCJkADKM8zuaO" /></figure>Using this framework, it is simple to apply all kinds of batch transformations to augment your data. In the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>, you’ll see more prompts you can use for back translation and noise injection. You’ve also seen how easy it is to enhance datasets with <a href="https://cloud.google.com/bigquery/docs/dataframes-quickstart">BigQuery DataFrames</a>. We hope this helps you in your data science journey using <a href="https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro">Gemini on Google Cloud</a>!<img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=347bc6378413" width="1" height="1" alt=""><hr><a href="https://medium.com/google-cloud/how-to-augment-text-data-with-gemini-through-bigquery-dataframes-347bc6378413">How to Augment Text Data with Gemini through BigQuery DataFrames</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.

Author

Link

Published date

Image url

Feed url

Guid

Hidden blurb

--- !ruby/object:Feedjira::Parser::RSSEntry
title: How to Augment Text Data with Gemini through BigQuery DataFrames
url: https://medium.com/google-cloud/how-to-augment-text-data-with-gemini-through-bigquery-dataframes-347bc6378413?source=rss----e52cf94d98af---4
author: Karl Weinmeister
categories:
- bigquery
- gemini
- machine-learning
- google-cloud-platform
- generative-ai
published: 2024-04-08 06:26:44.000000000 Z
entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier
 is_perma_link: 'false'
 guid: https://medium.com/p/347bc6378413
carlessian_info:
 news_filer_version: 2
 newspaper: Google Cloud - Medium
 macro_region: Blogs
rss_fields:
- title
- url
- author
- categories
- published
- entry_id
- content
content: '<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JxjbysaQHi7-vJ5eQFbiJA.jpeg"
 /></figure>Data augmentation is a technique used in machine learning to increase
 the size of a dataset by creating new data out of existing data. This technique
 can help models <a href="https://developers.google.com/machine-learning/crash-course/generalization/video-lecture">generalize</a>
 better, avoiding <a href="https://developers.google.com/machine-learning/crash-course/generalization/peril-of-overfitting">overfitting</a>
 on the data it was trained on.You often think of doing this in visual data,
 by rotating data, flipping, cropping, and so forth. PyTorch has a very useful <a
 href="https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py">transforms</a>
 package that allows you to apply random transformations to your dataset with just
 a few lines of code. While this may reduce accuracy on the training set, it often
 results in <a href="https://www.sciencedirect.com/science/article/abs/pii/S0957417420305200">improved
 accuracy</a> on the test set of unseen data — which is what really matters!<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/640/0*wWTbJJ62O5oO4lMZ" /><figcaption>Example:
 Data Augmentations of Rock Images, Credit: TseKiChun, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC
 BY-SA 4.0</a>, via Wikimedia Commons</figcaption></figure>We can apply this same
 technique to text data. It provides the same benefits of stretching your existing
 dataset, and making your model more robust to noise and outliers. There are proven
 <a href="https://link.springer.com/article/10.1186/s40537-021-00492-0/figures/1">benefits</a>
 to data augmentation of all datasets, which are particularly beneficial for small
 datasets.Let’s explore a few examples using popular techniques:<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/697/1*2sbkuAY-tFZgsIZ6H7SWDA.png"
 /></figure>There are a number of ways you can apply these techniques manually.
 Let’s say you want to apply the random deletion technique with p=0.1 of
 tokens deleted. You can tokenize the text and then add back tokens with (1-p)
 probability. Or, for back-translation, you can call the <a href="https://cloud.google.com/translate">Translation
 API</a> once for the target language, and then a second time to translate back to
 the original language. For synonyms, you could use a <a href="https://github.com/goodmami/wn">WordNet
 API</a> on random tokens.With a powerful LLM like <a href="https://ai.google.dev/docs/migrate_to_cloud">Gemini</a>,
 you have a bag of tricks at your fingertips. You can easily make these modifications
 and much more in one toolset. No need to cobble together multiple tools any longer.Let’s
 look at how to apply these techniques on a real world dataset of <a href="https://console.cloud.google.com/marketplace/product/stack-exchange/stack-overflow">Stack
 Overflow questions and answers</a>. All of the details are provided in this <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>,
 and I’ll point out the highlights here.You can use <a href="https://cloud.google.com/python/docs/reference/bigframes/latest">BigQuery
 DataFrames</a> for all kinds of problems, but it will make text augmentation on
 our BigQuery dataset particularly straightforward. It provides a pandas-compatible
 DataFrame and scikit-learn-like ML API that enables us to query Gemini directly.
 It can handle batch jobs on massive datasets, as all DataFrame storage is in BigQuery.So,
 let’s get started with one of these techniques, synonym replacement. First, we can
 query for accepted Stack Overflow Python answers since 2020, and put it into a BigQuery
 DataFrame:<pre>stack_overflow_df = bpd.read_gbq_query( &quot;&quot;&quot;SELECT CONCAT(q.title,
 q.body) AS input_text, a.body AS output_text FROM `bigquery-public-data.stackoverflow.posts_questions`
 q JOIN `bigquery-public-data.stackoverflow.posts_answers` a ON
 q.accepted_answer_id = a.id WHERE q.accepted_answer_id IS NOT NULL AND
 REGEXP_CONTAINS(q.tags, &quot;python&quot;) AND a.creation_date &gt;=
 &quot;2020-01-01&quot; LIMIT 550 &quot;&quot;&quot;)</pre>Here’s
 a sneak peek of the Q&amp;A DataFrame:<figure><img alt="" src="https://cdn-images-1.medium.com/max/704/0*wxaiojCKMWB4qMCY"
 /></figure>Let’s now randomly sample a number of rows from the dataframe. Set
 n_rows to the number of new samples you’d like:<pre>df = stack_overflow_df.sample(n_rows)</pre>We
 can then define a Gemini text generator <a href="https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm.GeminiTextGenerator">model</a>
 like this:<pre>model = GeminiTextGenerator()</pre>Next, let’s create two
 columns: a prompt column with synonym replacement instructions
 concatenated with the input text, and a result column with the
 synonym replacement applied.<pre># Create a prompt with the synonym replacement
 instructions and the input text df[&quot;synonym_prompt&quot;] = ( f&quot;Replace
 {n_replacement_words} words from the input text with synonyms, &quot; + &quot;keeping
 the overall meaning as close to the original text as possible.&quot; + &quot;Only
 provide the synonymized text, with no additional explanation.&quot; + &quot;Preserve
 the original formatting.\n\nInput text: &quot; + df[&quot;input_text&quot;]) #
 Run batch job and assign to a new column df[&quot;input_text_with_synonyms&quot;]
 = model.predict( df[&quot;synonym_prompt&quot;] ).ml_generate_text_llm_result #
 Compare the original and new columns df.peek()[[&quot;input_text&quot;, &quot;input_text_with_synonyms&quot;]]</pre>Here
 are the results! Notice the subtle changes in the text with synonym replacement.<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/717/0*aifrCJkADKM8zuaO" /></figure>Using
 this framework, it is simple to apply all kinds of batch transformations to augment
 your data. In the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/data-augmentation/data_augmentation_for_text.ipynb">notebook</a>,
 you’ll see more prompts you can use for back translation and noise injection. You’ve
 also seen how easy it is to enhance datasets with <a href="https://cloud.google.com/bigquery/docs/dataframes-quickstart">BigQuery
 DataFrames</a>. We hope this helps you in your data science journey using <a href="https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro">Gemini
 on Google Cloud</a>!<img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=347bc6378413"
 width="1" height="1" alt=""><hr><a href="https://medium.com/google-cloud/how-to-augment-text-data-with-gemini-through-bigquery-dataframes-347bc6378413">How
 to Augment Text Data with Gemini through BigQuery DataFrames</a> was originally
 published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a>
 on Medium, where people are continuing the conversation by highlighting and responding
 to this story.'

Language

Active

Ricc internal notes

Ricc source

Show this article Back to articles