♊️ GemiNews 🗞️
(dev)
🏡
📰 Articles
🏷️ Tags
🧠 Queries
📈 Graphs
☁️ Stats
💁🏻 Assistant
💬
🎙️
Demo 1: Embeddings + Recommendation
Demo 2: Bella RAGa
Demo 3: NewRetriever
Demo 4: Assistant function calling
Editing article
Title
Summary
Content
<p>In this brief article, i am going to show you how to leverage the langchain framework with OpenAI (gpt-4) to work with Google clouds BigQuery vector search offering.</p><p>We are going to use a PDF file which provides a comprehensive overview of trends in AI research and development as of 2023. It covers various aspects of AI advancements including the growth in AI publications, the evolution of machine learning systems, and significant trends in AI conference attendance and open-source AI software. Key highlights include detailed statistics on AI journal, conference, and repository publications categorized by type, field of study, and geographic area.</p><p>This PDF will be converted to text embeddings after which i will show you how to retrieve them using langchain’s <strong>ConversationalRetrievalChain with memory </strong>by creating a retriever object which will point to the embeddings and eventually talk to the PDF using simple search queries.</p><p>So lets begin.</p><p><strong>Note</strong>- You need an active GCP account for this tutorial, even a trial account will do.</p><h4>Step-1: Install the necessary modules in your local environment</h4><blockquote>pip3 install — upgrade langchain langchain_google_vertexai</blockquote><blockquote>pip3 install — upgrade — quiet google-cloud-storage</blockquote><blockquote>pip3 install pypdf</blockquote><h4>Step-2: Create a BigQuery Schema and download credentials file from GCP Account</h4><p>Head over to bigquery, open up an editor and create a schema. Call it <strong>bq_vectordb </strong>and this is the schema where the table which will store our vector embeddings will be created.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-LkhEbUf5INjqNH7uA5jjQ.png" /></figure><p>Now, navigate to <strong>IAM</strong> from the GCP console and select <strong>Service Accounts </strong>from the left navigation. Here we will create and download the permissions json file containing the private key which we will use in the Python script. This json file grants our local environment access to the services in our GCP account on a project level.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*prwgQH8LWrlaiDLv5rO76g.png" /></figure><p>Click on <strong>Manage keys</strong> and then select <strong>ADD KEY</strong> followed by <strong>Create new key. </strong>Thats it, select the key type as JSON and a file will be automatically downloaded to your system.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tMZn_SLrTwt2BRYUpQieEQ.png" /></figure><p>Rename and copy this file to your current working directory.</p><p>That was as far as the environment setup goes, now we can get to the execution part.</p><h4>Step-3: Create and Ingest Embeddings using VertexAIEmbeddings, GCSFileLoader & BigQueryVectorSearch</h4><p>First, we need to create embeddings from the PDF File: <strong>example.pdf</strong> using <strong>VertexAIEmbeddings</strong>. To do that, we load this pdf file from a GCS bucket using <strong>GCSFileLoader</strong> from <strong>langchain</strong> and use the <strong>RecursiveCharacterTextSplitter</strong> to split this pdf into several chunks with an overlap size set to 100.</p><p><strong>NOTE: </strong>Before you execute the below code, make sure to upload example.pdf to a gcs bucket and change the path values accordingly.</p><blockquote>from langchain_google_vertexai import VertexAIEmbeddings</blockquote><blockquote>from langchain_community.vectorstores import BigQueryVectorSearch</blockquote><blockquote>from langchain.document_loaders import GCSFileLoader</blockquote><blockquote>from langchain_community.document_loaders import PyPDFLoader</blockquote><blockquote>from langchain.text_splitter import RecursiveCharacterTextSplitter</blockquote><blockquote>import os</blockquote><blockquote>os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘your-json-filename.json’</blockquote><blockquote>PROJECT_ID = “{project-id}”</blockquote><blockquote>embedding = VertexAIEmbeddings(</blockquote><blockquote>model_name=”textembedding-gecko@latest”, project=PROJECT_ID</blockquote><blockquote>)</blockquote><blockquote>gcs_bucket_name = “your-bucket-name”</blockquote><blockquote>pdf_filename = “test_data/example.pdf”</blockquote><blockquote>def load_pdf(file_path):</blockquote><blockquote>return PyPDFLoader(file_path)</blockquote><blockquote>loader = GCSFileLoader(</blockquote><blockquote>project_name=PROJECT_ID, bucket=gcs_bucket_name, blob=pdf_filename, loader_func=load_pdf</blockquote><blockquote>)</blockquote><blockquote>documents = loader.load()</blockquote><blockquote>text_splitter = RecursiveCharacterTextSplitter(</blockquote><blockquote>chunk_size=10000,</blockquote><blockquote>chunk_overlap=100,</blockquote><blockquote>separators=[“\n\n”, “\n”, “.”, “!”, “?”, “,”, “ “, “”],</blockquote><blockquote>)</blockquote><blockquote>doc_splits = text_splitter.split_documents(documents)</blockquote><blockquote>for idx, split in enumerate(doc_splits):</blockquote><blockquote>split.metadata[“chunk”] = idx</blockquote><blockquote>print(f”# of documents = {len(doc_splits)}”)</blockquote><p>Once you have chunked your PDF data, now its time to ingest it into BigQuery vector search.</p><p>Define your dataset (created in the first step) and table name. The table will be created at run time. Next, create an object <strong>BigQueryVectorSearch </strong>and use this object to invoke the <strong>add_documents</strong> method.</p><blockquote>DATASET = “bq_vectordb”</blockquote><blockquote>TABLE = “bq_vectors” # You can come up with a more innovative name here</blockquote><blockquote>bq_object = BigQueryVectorSearch(</blockquote><blockquote>project_id=PROJECT_ID,</blockquote><blockquote>dataset_name=DATASET,</blockquote><blockquote>table_name=TABLE,</blockquote><blockquote>location=”US”,</blockquote><blockquote>embedding=embedding,</blockquote><blockquote>)</blockquote><blockquote>bq_object.add_documents(doc_splits)</blockquote><p>You can execute the entire <strong>bq_ingest_data.py </strong>script<strong> </strong>as a single python script.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ck70GOSwdEcUkUsXMHZHSA.png" /></figure><p>Once the execution is complete, you can head back to Bigquery and refresh your schema. You should see a table <strong>bq_vectors </strong>with the below columns and data. This means your embeddings have been created and are now stored in a BigQuery table.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iBOif5duBgOT-W3BcKMaPw.png" /></figure><h4>Step-4: Retrieve embeddings & use langchain with OpenAI to chat with your data</h4><p>Most of the below code is self-explanatory. We import the necessary libraries and use langchains <strong>ConversationBufferMemory</strong> which will retain the history of the chat in the subsequent messages which, is quite important if you are building a chatbot.</p><p>Make sure to use the actual values in the below script before executing it.</p><blockquote>from langchain_community.vectorstores import BigQueryVectorSearch<br>from langchain_google_vertexai import VertexAIEmbeddings<br>from langchain_google_vertexai import VertexAI<br>from langchain.chains import RetrievalQA<br>from langchain.chains import ConversationalRetrievalChain<br>from langchain.memory import ConversationBufferMemory<br>from langchain.chat_models import ChatOpenAI<br>import pandas as pd <br>import os</blockquote><blockquote>api_key = “your-openai-api-key”<br>os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘json-filename.json’</blockquote><blockquote>DATASET = “bq_vectordb”<br>TABLE = “bq_vectors”<br>PROJECT_ID = “project-id”</blockquote><blockquote>embedding = VertexAIEmbeddings(<br> model_name=”textembedding-gecko@latest”, project=PROJECT_ID<br>)</blockquote><blockquote>memory = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True,output_key=’answer’)</blockquote><blockquote>bq_object = BigQueryVectorSearch(<br> project_id=PROJECT_ID,<br> dataset_name=DATASET,<br> table_name=TABLE,<br> location=”US”,<br> embedding=embedding,<br>)</blockquote><p>You can execute this code inside a jupyter notebook.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KaswBhP0_5ocyFZi7GdrOA.png" /></figure><p>We now define our <strong>llm model </strong>and create a <strong>retriever</strong> object which will point to the embeddings stored in the bigquery table.</p><blockquote>llm_openai = ChatOpenAI(model=”gpt-4-turbo-2024–04–09",api_key=api_key)<br>retriever = bq_object.as_retriever()</blockquote><blockquote>conversational_retrieval = ConversationalRetrievalChain.from_llm(<br> llm=llm_openai,retriever=retriever, memory=memory,verbose=False<br>)</blockquote><p>Define a function which will simply accept a user query and return the answer from the bigquery vector table.</p><blockquote>def QaWithMemory(query):<br> return conversational_retrieval.invoke(query)[“answer”]</blockquote><p>Now lets ask a question : “ <strong>What was the rate of growth in AI research publications from 2010 to 2021,<br> and which type of AI publication saw the most significant increase in this period?</strong>”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m3ZA7ojEsmKHbYq11NL84Q.png" /></figure><p>You can see the response. Its quite accurate if you read the PDF content. You can ask a followup question now without giving too many details, such as <strong>“and how might this growth impact the future of AI research priorities?”</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Zivc9_q_VhoaA9U154SFUg.png" /></figure><p>Alright, that was it for this tutorial. Hope you enjoyed it :-) . Stay tuned for more. Cheers</p><p><strong>Full Source code: </strong><a href="https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai">https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4eac63140211" width="1" height="1" alt="">
Author
Link
Published date
Image url
Feed url
Guid
Hidden blurb
--- !ruby/object:Feedjira::Parser::RSSEntry title: Talk to pdf using Bigquery Vectors, GPT4-Turbo & langchain url: https://siddoncloud.medium.com/talk-to-pdf-using-bigquery-vectors-gpt4-turbo-langchain-4eac63140211?source=rss----e52cf94d98af---4 author: Sid categories: - langchain - chatgpt - generative-ai - bigquery - google-cloud-platform published: 2024-05-12 20:52:04.000000000 Z entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier is_perma_link: 'false' guid: https://medium.com/p/4eac63140211 carlessian_info: news_filer_version: 2 newspaper: Google Cloud - Medium macro_region: Blogs rss_fields: - title - url - author - categories - published - entry_id - content content: '<p>In this brief article, i am going to show you how to leverage the langchain framework with OpenAI (gpt-4) to work with Google clouds BigQuery vector search offering.</p><p>We are going to use a PDF file which provides a comprehensive overview of trends in AI research and development as of 2023. It covers various aspects of AI advancements including the growth in AI publications, the evolution of machine learning systems, and significant trends in AI conference attendance and open-source AI software. Key highlights include detailed statistics on AI journal, conference, and repository publications categorized by type, field of study, and geographic area.</p><p>This PDF will be converted to text embeddings after which i will show you how to retrieve them using langchain’s <strong>ConversationalRetrievalChain with memory </strong>by creating a retriever object which will point to the embeddings and eventually talk to the PDF using simple search queries.</p><p>So lets begin.</p><p><strong>Note</strong>- You need an active GCP account for this tutorial, even a trial account will do.</p><h4>Step-1: Install the necessary modules in your local environment</h4><blockquote>pip3 install — upgrade langchain langchain_google_vertexai</blockquote><blockquote>pip3 install — upgrade — quiet google-cloud-storage</blockquote><blockquote>pip3 install pypdf</blockquote><h4>Step-2: Create a BigQuery Schema and download credentials file from GCP Account</h4><p>Head over to bigquery, open up an editor and create a schema. Call it <strong>bq_vectordb </strong>and this is the schema where the table which will store our vector embeddings will be created.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-LkhEbUf5INjqNH7uA5jjQ.png" /></figure><p>Now, navigate to <strong>IAM</strong> from the GCP console and select <strong>Service Accounts </strong>from the left navigation. Here we will create and download the permissions json file containing the private key which we will use in the Python script. This json file grants our local environment access to the services in our GCP account on a project level.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*prwgQH8LWrlaiDLv5rO76g.png" /></figure><p>Click on <strong>Manage keys</strong> and then select <strong>ADD KEY</strong> followed by <strong>Create new key. </strong>Thats it, select the key type as JSON and a file will be automatically downloaded to your system.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tMZn_SLrTwt2BRYUpQieEQ.png" /></figure><p>Rename and copy this file to your current working directory.</p><p>That was as far as the environment setup goes, now we can get to the execution part.</p><h4>Step-3: Create and Ingest Embeddings using VertexAIEmbeddings, GCSFileLoader & BigQueryVectorSearch</h4><p>First, we need to create embeddings from the PDF File: <strong>example.pdf</strong> using <strong>VertexAIEmbeddings</strong>. To do that, we load this pdf file from a GCS bucket using <strong>GCSFileLoader</strong> from <strong>langchain</strong> and use the <strong>RecursiveCharacterTextSplitter</strong> to split this pdf into several chunks with an overlap size set to 100.</p><p><strong>NOTE: </strong>Before you execute the below code, make sure to upload example.pdf to a gcs bucket and change the path values accordingly.</p><blockquote>from langchain_google_vertexai import VertexAIEmbeddings</blockquote><blockquote>from langchain_community.vectorstores import BigQueryVectorSearch</blockquote><blockquote>from langchain.document_loaders import GCSFileLoader</blockquote><blockquote>from langchain_community.document_loaders import PyPDFLoader</blockquote><blockquote>from langchain.text_splitter import RecursiveCharacterTextSplitter</blockquote><blockquote>import os</blockquote><blockquote>os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘your-json-filename.json’</blockquote><blockquote>PROJECT_ID = “{project-id}”</blockquote><blockquote>embedding = VertexAIEmbeddings(</blockquote><blockquote>model_name=”textembedding-gecko@latest”, project=PROJECT_ID</blockquote><blockquote>)</blockquote><blockquote>gcs_bucket_name = “your-bucket-name”</blockquote><blockquote>pdf_filename = “test_data/example.pdf”</blockquote><blockquote>def load_pdf(file_path):</blockquote><blockquote>return PyPDFLoader(file_path)</blockquote><blockquote>loader = GCSFileLoader(</blockquote><blockquote>project_name=PROJECT_ID, bucket=gcs_bucket_name, blob=pdf_filename, loader_func=load_pdf</blockquote><blockquote>)</blockquote><blockquote>documents = loader.load()</blockquote><blockquote>text_splitter = RecursiveCharacterTextSplitter(</blockquote><blockquote>chunk_size=10000,</blockquote><blockquote>chunk_overlap=100,</blockquote><blockquote>separators=[“\n\n”, “\n”, “.”, “!”, “?”, “,”, “ “, “”],</blockquote><blockquote>)</blockquote><blockquote>doc_splits = text_splitter.split_documents(documents)</blockquote><blockquote>for idx, split in enumerate(doc_splits):</blockquote><blockquote>split.metadata[“chunk”] = idx</blockquote><blockquote>print(f”# of documents = {len(doc_splits)}”)</blockquote><p>Once you have chunked your PDF data, now its time to ingest it into BigQuery vector search.</p><p>Define your dataset (created in the first step) and table name. The table will be created at run time. Next, create an object <strong>BigQueryVectorSearch </strong>and use this object to invoke the <strong>add_documents</strong> method.</p><blockquote>DATASET = “bq_vectordb”</blockquote><blockquote>TABLE = “bq_vectors” # You can come up with a more innovative name here</blockquote><blockquote>bq_object = BigQueryVectorSearch(</blockquote><blockquote>project_id=PROJECT_ID,</blockquote><blockquote>dataset_name=DATASET,</blockquote><blockquote>table_name=TABLE,</blockquote><blockquote>location=”US”,</blockquote><blockquote>embedding=embedding,</blockquote><blockquote>)</blockquote><blockquote>bq_object.add_documents(doc_splits)</blockquote><p>You can execute the entire <strong>bq_ingest_data.py </strong>script<strong> </strong>as a single python script.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ck70GOSwdEcUkUsXMHZHSA.png" /></figure><p>Once the execution is complete, you can head back to Bigquery and refresh your schema. You should see a table <strong>bq_vectors </strong>with the below columns and data. This means your embeddings have been created and are now stored in a BigQuery table.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iBOif5duBgOT-W3BcKMaPw.png" /></figure><h4>Step-4: Retrieve embeddings & use langchain with OpenAI to chat with your data</h4><p>Most of the below code is self-explanatory. We import the necessary libraries and use langchains <strong>ConversationBufferMemory</strong> which will retain the history of the chat in the subsequent messages which, is quite important if you are building a chatbot.</p><p>Make sure to use the actual values in the below script before executing it.</p><blockquote>from langchain_community.vectorstores import BigQueryVectorSearch<br>from langchain_google_vertexai import VertexAIEmbeddings<br>from langchain_google_vertexai import VertexAI<br>from langchain.chains import RetrievalQA<br>from langchain.chains import ConversationalRetrievalChain<br>from langchain.memory import ConversationBufferMemory<br>from langchain.chat_models import ChatOpenAI<br>import pandas as pd <br>import os</blockquote><blockquote>api_key = “your-openai-api-key”<br>os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘json-filename.json’</blockquote><blockquote>DATASET = “bq_vectordb”<br>TABLE = “bq_vectors”<br>PROJECT_ID = “project-id”</blockquote><blockquote>embedding = VertexAIEmbeddings(<br> model_name=”textembedding-gecko@latest”, project=PROJECT_ID<br>)</blockquote><blockquote>memory = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True,output_key=’answer’)</blockquote><blockquote>bq_object = BigQueryVectorSearch(<br> project_id=PROJECT_ID,<br> dataset_name=DATASET,<br> table_name=TABLE,<br> location=”US”,<br> embedding=embedding,<br>)</blockquote><p>You can execute this code inside a jupyter notebook.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KaswBhP0_5ocyFZi7GdrOA.png" /></figure><p>We now define our <strong>llm model </strong>and create a <strong>retriever</strong> object which will point to the embeddings stored in the bigquery table.</p><blockquote>llm_openai = ChatOpenAI(model=”gpt-4-turbo-2024–04–09",api_key=api_key)<br>retriever = bq_object.as_retriever()</blockquote><blockquote>conversational_retrieval = ConversationalRetrievalChain.from_llm(<br> llm=llm_openai,retriever=retriever, memory=memory,verbose=False<br>)</blockquote><p>Define a function which will simply accept a user query and return the answer from the bigquery vector table.</p><blockquote>def QaWithMemory(query):<br> return conversational_retrieval.invoke(query)[“answer”]</blockquote><p>Now lets ask a question : “ <strong>What was the rate of growth in AI research publications from 2010 to 2021,<br> and which type of AI publication saw the most significant increase in this period?</strong>”</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m3ZA7ojEsmKHbYq11NL84Q.png" /></figure><p>You can see the response. Its quite accurate if you read the PDF content. You can ask a followup question now without giving too many details, such as <strong>“and how might this growth impact the future of AI research priorities?”</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Zivc9_q_VhoaA9U154SFUg.png" /></figure><p>Alright, that was it for this tutorial. Hope you enjoyed it :-) . Stay tuned for more. Cheers</p><p><strong>Full Source code: </strong><a href="https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai">https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4eac63140211" width="1" height="1" alt="">'
Language
Active
Ricc internal notes
Imported via /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/import-feedjira.rb on 2024-05-13 20:10:24 +0200. Content is EMPTY here. Entried: title,url,author,categories,published,entry_id,content. TODO add Newspaper: filename = /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/../../../crawler/out/feedjira/Blogs/Google Cloud - Medium/2024-05-12-Talk_to_pdf_using_Bigquery_Vectors,_GPT4-Turbo_&_langchain-v2.yaml
Ricc source
Show this article
Back to articles