Editing article

Title

Summary

Content

In this brief article, i am going to show you how to leverage the langchain framework with OpenAI (gpt-4) to work with Google clouds BigQuery vector search offering.We are going to use a PDF file which provides a comprehensive overview of trends in AI research and development as of 2023. It covers various aspects of AI advancements including the growth in AI publications, the evolution of machine learning systems, and significant trends in AI conference attendance and open-source AI software. Key highlights include detailed statistics on AI journal, conference, and repository publications categorized by type, field of study, and geographic area.This PDF will be converted to text embeddings after which i will show you how to retrieve them using langchain’s ConversationalRetrievalChain with memory by creating a retriever object which will point to the embeddings and eventually talk to the PDF using simple search queries.So lets begin.Note- You need an active GCP account for this tutorial, even a trial account will do.<h4>Step-1: Install the necessary modules in your local environment</h4><blockquote>pip3 install — upgrade langchain langchain_google_vertexai</blockquote><blockquote>pip3 install — upgrade — quiet google-cloud-storage</blockquote><blockquote>pip3 install pypdf</blockquote><h4>Step-2: Create a BigQuery Schema and download credentials file from GCP Account</h4>Head over to bigquery, open up an editor and create a schema. Call it bq_vectordb and this is the schema where the table which will store our vector embeddings will be created.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-LkhEbUf5INjqNH7uA5jjQ.png" /></figure>Now, navigate to IAM from the GCP console and select Service Accounts from the left navigation. Here we will create and download the permissions json file containing the private key which we will use in the Python script. This json file grants our local environment access to the services in our GCP account on a project level.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*prwgQH8LWrlaiDLv5rO76g.png" /></figure>Click on Manage keys and then select ADD KEY followed by Create new key. Thats it, select the key type as JSON and a file will be automatically downloaded to your system.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tMZn_SLrTwt2BRYUpQieEQ.png" /></figure>Rename and copy this file to your current working directory.That was as far as the environment setup goes, now we can get to the execution part.<h4>Step-3: Create and Ingest Embeddings using VertexAIEmbeddings, GCSFileLoader &amp; BigQueryVectorSearch</h4>First, we need to create embeddings from the PDF File: example.pdf using VertexAIEmbeddings. To do that, we load this pdf file from a GCS bucket using GCSFileLoader from langchain and use the RecursiveCharacterTextSplitter to split this pdf into several chunks with an overlap size set to 100.NOTE: Before you execute the below code, make sure to upload example.pdf to a gcs bucket and change the path values accordingly.<blockquote>from langchain_google_vertexai import VertexAIEmbeddings</blockquote><blockquote>from langchain_community.vectorstores import BigQueryVectorSearch</blockquote><blockquote>from langchain.document_loaders import GCSFileLoader</blockquote><blockquote>from langchain_community.document_loaders import PyPDFLoader</blockquote><blockquote>from langchain.text_splitter import RecursiveCharacterTextSplitter</blockquote><blockquote>import os</blockquote><blockquote>os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘your-json-filename.json’</blockquote><blockquote>PROJECT_ID = “{project-id}”</blockquote><blockquote>embedding = VertexAIEmbeddings(</blockquote><blockquote>model_name=”textembedding-gecko@latest”, project=PROJECT_ID</blockquote><blockquote>)</blockquote><blockquote>gcs_bucket_name = “your-bucket-name”</blockquote><blockquote>pdf_filename = “test_data/example.pdf”</blockquote><blockquote>def load_pdf(file_path):</blockquote><blockquote>return PyPDFLoader(file_path)</blockquote><blockquote>loader = GCSFileLoader(</blockquote><blockquote>project_name=PROJECT_ID, bucket=gcs_bucket_name, blob=pdf_filename, loader_func=load_pdf</blockquote><blockquote>)</blockquote><blockquote>documents = loader.load()</blockquote><blockquote>text_splitter = RecursiveCharacterTextSplitter(</blockquote><blockquote>chunk_size=10000,</blockquote><blockquote>chunk_overlap=100,</blockquote><blockquote>separators=[“\n\n”, “\n”, “.”, “!”, “?”, “,”, “ “, “”],</blockquote><blockquote>)</blockquote><blockquote>doc_splits = text_splitter.split_documents(documents)</blockquote><blockquote>for idx, split in enumerate(doc_splits):</blockquote><blockquote>split.metadata[“chunk”] = idx</blockquote><blockquote>print(f”# of documents = {len(doc_splits)}”)</blockquote>Once you have chunked your PDF data, now its time to ingest it into BigQuery vector search.Define your dataset (created in the first step) and table name. The table will be created at run time. Next, create an object BigQueryVectorSearch and use this object to invoke the add_documents method.<blockquote>DATASET = “bq_vectordb”</blockquote><blockquote>TABLE = “bq_vectors” # You can come up with a more innovative name here</blockquote><blockquote>bq_object = BigQueryVectorSearch(</blockquote><blockquote>project_id=PROJECT_ID,</blockquote><blockquote>dataset_name=DATASET,</blockquote><blockquote>table_name=TABLE,</blockquote><blockquote>location=”US”,</blockquote><blockquote>embedding=embedding,</blockquote><blockquote>)</blockquote><blockquote>bq_object.add_documents(doc_splits)</blockquote>You can execute the entire bq_ingest_data.py script as a single python script.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ck70GOSwdEcUkUsXMHZHSA.png" /></figure>Once the execution is complete, you can head back to Bigquery and refresh your schema. You should see a table bq_vectors with the below columns and data. This means your embeddings have been created and are now stored in a BigQuery table.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iBOif5duBgOT-W3BcKMaPw.png" /></figure><h4>Step-4: Retrieve embeddings &amp; use langchain with OpenAI to chat with your data</h4>Most of the below code is self-explanatory. We import the necessary libraries and use langchains ConversationBufferMemory which will retain the history of the chat in the subsequent messages which, is quite important if you are building a chatbot.Make sure to use the actual values in the below script before executing it.<blockquote>from langchain_community.vectorstores import BigQueryVectorSearch from langchain_google_vertexai import VertexAIEmbeddings from langchain_google_vertexai import VertexAI from langchain.chains import RetrievalQA from langchain.chains import ConversationalRetrievalChain from langchain.memory import ConversationBufferMemory from langchain.chat_models import ChatOpenAI import pandas as pd import os</blockquote><blockquote>api_key = “your-openai-api-key” os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘json-filename.json’</blockquote><blockquote>DATASET = “bq_vectordb” TABLE = “bq_vectors” PROJECT_ID = “project-id”</blockquote><blockquote>embedding = VertexAIEmbeddings( model_name=”textembedding-gecko@latest”, project=PROJECT_ID )</blockquote><blockquote>memory = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True,output_key=’answer’)</blockquote><blockquote>bq_object = BigQueryVectorSearch( project_id=PROJECT_ID, dataset_name=DATASET, table_name=TABLE, location=”US”, embedding=embedding, )</blockquote>You can execute this code inside a jupyter notebook.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KaswBhP0_5ocyFZi7GdrOA.png" /></figure>We now define our llm model and create a retriever object which will point to the embeddings stored in the bigquery table.<blockquote>llm_openai = ChatOpenAI(model=”gpt-4-turbo-2024–04–09&quot;,api_key=api_key) retriever = bq_object.as_retriever()</blockquote><blockquote>conversational_retrieval = ConversationalRetrievalChain.from_llm( llm=llm_openai,retriever=retriever, memory=memory,verbose=False )</blockquote>Define a function which will simply accept a user query and return the answer from the bigquery vector table.<blockquote>def QaWithMemory(query): return conversational_retrieval.invoke(query)[“answer”]</blockquote>Now lets ask a question : “ What was the rate of growth in AI research publications from 2010 to 2021, and which type of AI publication saw the most significant increase in this period?”<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m3ZA7ojEsmKHbYq11NL84Q.png" /></figure>You can see the response. Its quite accurate if you read the PDF content. You can ask a followup question now without giving too many details, such as “and how might this growth impact the future of AI research priorities?”<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Zivc9_q_VhoaA9U154SFUg.png" /></figure>Alright, that was it for this tutorial. Hope you enjoyed it :-) . Stay tuned for more. CheersFull Source code: <a href="https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai">https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai</a><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4eac63140211" width="1" height="1" alt="">

Author

Link

Published date

Image url

Feed url

Guid

Hidden blurb

--- !ruby/object:Feedjira::Parser::RSSEntry
title: Talk to pdf using Bigquery Vectors, GPT4-Turbo & langchain
url: https://siddoncloud.medium.com/talk-to-pdf-using-bigquery-vectors-gpt4-turbo-langchain-4eac63140211?source=rss----e52cf94d98af---4
author: Sid
categories:
- langchain
- chatgpt
- generative-ai
- bigquery
- google-cloud-platform
published: 2024-05-12 20:52:04.000000000 Z
entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier
 is_perma_link: 'false'
 guid: https://medium.com/p/4eac63140211
carlessian_info:
 news_filer_version: 2
 newspaper: Google Cloud - Medium
 macro_region: Blogs
rss_fields:
- title
- url
- author
- categories
- published
- entry_id
- content
content: 'In this brief article, i am going to show you how to leverage the langchain
 framework with OpenAI (gpt-4) to work with Google clouds BigQuery vector search
 offering.We are going to use a PDF file which provides a comprehensive overview
 of trends in AI research and development as of 2023. It covers various aspects of
 AI advancements including the growth in AI publications, the evolution of machine
 learning systems, and significant trends in AI conference attendance and open-source
 AI software. Key highlights include detailed statistics on AI journal, conference,
 and repository publications categorized by type, field of study, and geographic
 area.This PDF will be converted to text embeddings after which i will show
 you how to retrieve them using langchain’s ConversationalRetrievalChain
 with memory by creating a retriever object which will point to the embeddings
 and eventually talk to the PDF using simple search queries.So lets begin.Note-
 You need an active GCP account for this tutorial, even a trial account will do.<h4>Step-1:
 Install the necessary modules in your local environment</h4><blockquote>pip3 install — upgrade
 langchain langchain_google_vertexai</blockquote><blockquote>pip3 install — upgrade — quiet
 google-cloud-storage</blockquote><blockquote>pip3 install pypdf</blockquote><h4>Step-2:
 Create a BigQuery Schema and download credentials file from GCP Account</h4>Head
 over to bigquery, open up an editor and create a schema. Call it bq_vectordb
 and this is the schema where the table which will store our vector embeddings
 will be created.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-LkhEbUf5INjqNH7uA5jjQ.png"
 /></figure>Now, navigate to IAM from the GCP console and select
 Service Accounts from the left navigation. Here we will create
 and download the permissions json file containing the private key which we will
 use in the Python script. This json file grants our local environment access to
 the services in our GCP account on a project level.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*prwgQH8LWrlaiDLv5rO76g.png"
 /></figure>Click on Manage keys and then select ADD
 KEY followed by Create new key. Thats it, select the key
 type as JSON and a file will be automatically downloaded to your system.<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/1024/1*tMZn_SLrTwt2BRYUpQieEQ.png"
 /></figure>Rename and copy this file to your current working directory.That
 was as far as the environment setup goes, now we can get to the execution part.<h4>Step-3:
 Create and Ingest Embeddings using VertexAIEmbeddings, GCSFileLoader &amp; BigQueryVectorSearch</h4>First,
 we need to create embeddings from the PDF File: example.pdf using
 VertexAIEmbeddings. To do that, we load this pdf file from a GCS
 bucket using GCSFileLoader from langchain and
 use the RecursiveCharacterTextSplitter to split this pdf into several
 chunks with an overlap size set to 100.NOTE: Before you
 execute the below code, make sure to upload example.pdf to a gcs bucket and change
 the path values accordingly.<blockquote>from langchain_google_vertexai import
 VertexAIEmbeddings</blockquote><blockquote>from langchain_community.vectorstores
 import BigQueryVectorSearch</blockquote><blockquote>from langchain.document_loaders
 import GCSFileLoader</blockquote><blockquote>from langchain_community.document_loaders
 import PyPDFLoader</blockquote><blockquote>from langchain.text_splitter import RecursiveCharacterTextSplitter</blockquote><blockquote>import
 os</blockquote><blockquote>os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’] = ‘your-json-filename.json’</blockquote><blockquote>PROJECT_ID
 = “{project-id}”</blockquote><blockquote>embedding = VertexAIEmbeddings(</blockquote><blockquote>model_name=”textembedding-gecko@latest”,
 project=PROJECT_ID</blockquote><blockquote>)</blockquote><blockquote>gcs_bucket_name
 = “your-bucket-name”</blockquote><blockquote>pdf_filename = “test_data/example.pdf”</blockquote><blockquote>def
 load_pdf(file_path):</blockquote><blockquote>return PyPDFLoader(file_path)</blockquote><blockquote>loader
 = GCSFileLoader(</blockquote><blockquote>project_name=PROJECT_ID, bucket=gcs_bucket_name,
 blob=pdf_filename, loader_func=load_pdf</blockquote><blockquote>)</blockquote><blockquote>documents
 = loader.load()</blockquote><blockquote>text_splitter = RecursiveCharacterTextSplitter(</blockquote><blockquote>chunk_size=10000,</blockquote><blockquote>chunk_overlap=100,</blockquote><blockquote>separators=[“\n\n”,
 “\n”, “.”, “!”, “?”, “,”, “ “, “”],</blockquote><blockquote>)</blockquote><blockquote>doc_splits
 = text_splitter.split_documents(documents)</blockquote><blockquote>for idx, split
 in enumerate(doc_splits):</blockquote><blockquote>split.metadata[“chunk”] = idx</blockquote><blockquote>print(f”#
 of documents = {len(doc_splits)}”)</blockquote>Once you have chunked your PDF
 data, now its time to ingest it into BigQuery vector search.Define your dataset
 (created in the first step) and table name. The table will be created at run time.
 Next, create an object BigQueryVectorSearch and use this object
 to invoke the add_documents method.<blockquote>DATASET = “bq_vectordb”</blockquote><blockquote>TABLE
 = “bq_vectors” # You can come up with a more innovative name here</blockquote><blockquote>bq_object
 = BigQueryVectorSearch(</blockquote><blockquote>project_id=PROJECT_ID,</blockquote><blockquote>dataset_name=DATASET,</blockquote><blockquote>table_name=TABLE,</blockquote><blockquote>location=”US”,</blockquote><blockquote>embedding=embedding,</blockquote><blockquote>)</blockquote><blockquote>bq_object.add_documents(doc_splits)</blockquote>You
 can execute the entire bq_ingest_data.py script as
 a single python script.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ck70GOSwdEcUkUsXMHZHSA.png"
 /></figure>Once the execution is complete, you can head back to Bigquery and
 refresh your schema. You should see a table bq_vectors with the
 below columns and data. This means your embeddings have been created and are now
 stored in a BigQuery table.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iBOif5duBgOT-W3BcKMaPw.png"
 /></figure><h4>Step-4: Retrieve embeddings &amp; use langchain with OpenAI to chat
 with your data</h4>Most of the below code is self-explanatory. We import the
 necessary libraries and use langchains ConversationBufferMemory
 which will retain the history of the chat in the subsequent messages which, is quite
 important if you are building a chatbot.Make sure to use the actual values
 in the below script before executing it.<blockquote>from langchain_community.vectorstores
 import BigQueryVectorSearch from langchain_google_vertexai import VertexAIEmbeddings from
 langchain_google_vertexai import VertexAI from langchain.chains import RetrievalQA from
 langchain.chains import ConversationalRetrievalChain from langchain.memory import
 ConversationBufferMemory from langchain.chat_models import ChatOpenAI import
 pandas as pd import os</blockquote><blockquote>api_key = “your-openai-api-key” os.environ[‘GOOGLE_APPLICATION_CREDENTIALS’]
 = ‘json-filename.json’</blockquote><blockquote>DATASET = “bq_vectordb” TABLE
 = “bq_vectors” PROJECT_ID = “project-id”</blockquote><blockquote>embedding =
 VertexAIEmbeddings( model_name=”textembedding-gecko@latest”, project=PROJECT_ID )</blockquote><blockquote>memory
 = ConversationBufferMemory(memory_key=”chat_history”, return_messages=True,output_key=’answer’)</blockquote><blockquote>bq_object
 = BigQueryVectorSearch( project_id=PROJECT_ID, dataset_name=DATASET, 
 table_name=TABLE, location=”US”, embedding=embedding, )</blockquote>You
 can execute this code inside a jupyter notebook.<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KaswBhP0_5ocyFZi7GdrOA.png"
 /></figure>We now define our llm model and create a retriever
 object which will point to the embeddings stored in the bigquery table.<blockquote>llm_openai
 = ChatOpenAI(model=”gpt-4-turbo-2024–04–09&quot;,api_key=api_key) retriever =
 bq_object.as_retriever()</blockquote><blockquote>conversational_retrieval = ConversationalRetrievalChain.from_llm( 
 llm=llm_openai,retriever=retriever, memory=memory,verbose=False )</blockquote>Define
 a function which will simply accept a user query and return the answer from the
 bigquery vector table.<blockquote>def QaWithMemory(query): return conversational_retrieval.invoke(query)[“answer”]</blockquote>Now
 lets ask a question : “ What was the rate of growth in AI research publications
 from 2010 to 2021, and which type of AI publication saw the most significant
 increase in this period?”<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*m3ZA7ojEsmKHbYq11NL84Q.png"
 /></figure>You can see the response. Its quite accurate if you read the PDF content.
 You can ask a followup question now without giving too many details, such as “and
 how might this growth impact the future of AI research priorities?”<figure><img
 alt="" src="https://cdn-images-1.medium.com/max/1024/1*Zivc9_q_VhoaA9U154SFUg.png"
 /></figure>Alright, that was it for this tutorial. Hope you enjoyed it :-) .
 Stay tuned for more. CheersFull Source code: <a href="https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai">https://github.com/sidoncloud/gcp-use-cases/tree/main/langchain-bq-openai</a><img
 src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4eac63140211"
 width="1" height="1" alt="">'

Language

Active

Ricc internal notes

Ricc source

Show this article Back to articles