♊️ GemiNews 🗞️ (dev)

Demo 1: Embeddings + Recommendation Demo 2: Bella RAGa Demo 3: NewRetriever Demo 4: Assistant function calling

🗞️Automating data extraction from SEC 10-K forms using Document AI and Generative AI

🗿Semantically Similar Articles (by :title_embedding)

Automating data extraction from SEC 10-K forms using Document AI and Generative AI

2024-04-18 - Harish Verma (from Google Cloud - Medium)

SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.Solution ArchitectureThe solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.Solution ArchitectureThe solution consists of the following components:Document AI Custom Document Splitter (CDS): Given a Sec 10-K document it splits the SEC 10-K form into individual sections.Document AI Custom Document Extractor (CDE): Extracts key information present in tabular form from different sections of the SEC 10-K form.Generative AI: Extracts text-based information from the SEC 10-K form.BigQuery: Stores the extracted dataData and Model TrainingThe solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset SEC Edgar Annual Financial Filings — 2021 for Sec10K form dataset.For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.Snapshot of Custom Document Splitter developedFor Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.Snapshot of fields for Custom Document Extractor developedBelow is a sample page having a Consolidated Balance Sheet table in a Sec10k form.Consolidated Balance Sheet table in Sec10K form (Source)ResultsThe solution was evaluated on a test set of 20 documents and has demonstrated impressive results.95%+ accuracy on Document Splitter to identify different sections of the forms90%+ accuracy on field extraction of tabular data using Custom Document Extractor99%+ accuracy on field extraction of textual data using Generative AIWe tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available here. Below is the snapshot of the 50 pager document.SourceHere is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043','company_name': 'Alphabet Inc.','company_phone': '(650) 253-0000','fiscal_year': 'March 31, 2023','form_type': '10-Q',"chief_financial_officer": "Ruth M. Porat",'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}'total_net_sales': {'previous': '68,011', 'current': '69,787', 'description': 'Revenues'}ConclusionThe integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.Learn more about the products used in the solution from links below:Document AIVertex AICloud StorageAutomating data extraction from SEC 10-K forms using Document AI and Generative AI was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

[Blogs] 🌎 https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167?source=rss----e52cf94d98af---4 [🧠] [v2] article_embedding_description: {:llm_project_id=>"Unavailable", :llm_dimensions=>nil, :article_size=>7214, :llm_embeddings_model_name=>"textembedding-gecko"}
[🧠] [v1/3] title_embedding_description: {:ricc_notes=>"[embed-v3] Fixed on 9oct24. Only seems incompatible at first glance with embed v1.", :llm_project_id=>"unavailable possibly not using Vertex", :llm_dimensions=>nil, :article_size=>7214, :poly_field=>"title", :llm_embeddings_model_name=>"textembedding-gecko"}
[🧠] [v1/3] summary_embedding_description:
[🧠] As per bug https://github.com/palladius/gemini-news-crawler/issues/4 we can state this article belongs to titile/summary version: v3 (very few articles updated on 9oct24)

🗿article.to_s

------------------------------
Title: Automating data extraction from SEC 10-K forms using Document AI and Generative AI
[content]
SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.Solution ArchitectureThe solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.Solution ArchitectureThe solution consists of the following components:Document AI Custom Document Splitter (CDS): Given a Sec 10-K document it splits the SEC 10-K form into individual sections.Document AI Custom Document Extractor (CDE): Extracts key information present in tabular form from different sections of the SEC 10-K form.Generative AI: Extracts text-based information from the SEC 10-K form.BigQuery: Stores the extracted dataData and Model TrainingThe solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset SEC Edgar Annual Financial Filings — 2021 for Sec10K form dataset.For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.Snapshot of Custom Document Splitter developedFor Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.Snapshot of fields for Custom Document Extractor developedBelow is a sample page having a Consolidated Balance Sheet table in a Sec10k form.Consolidated Balance Sheet table in Sec10K form (Source)ResultsThe solution was evaluated on a test set of 20 documents and has demonstrated impressive results.95%+ accuracy on Document Splitter to identify different sections of the forms90%+ accuracy on field extraction of tabular data using Custom Document Extractor99%+ accuracy on field extraction of textual data using Generative AIWe tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available here. Below is the snapshot of the 50 pager document.SourceHere is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043','company_name': 'Alphabet Inc.','company_phone': '(650) 253-0000','fiscal_year': 'March 31, 2023','form_type': '10-Q',"chief_financial_officer": "Ruth M. Porat",'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}'total_net_sales': {'previous': '68,011',   'current': '69,787',  'description': 'Revenues'}ConclusionThe integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.Learn more about the products used in the solution from links below:Document AIVertex AICloud StorageAutomating data extraction from SEC 10-K forms using Document AI and Generative AI was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
[/content]

Author: Harish Verma
PublishedDate: 2024-04-18
Category: Blogs
NewsPaper: Google Cloud - Medium
Tags: genai, google-cloud-platform, automation, document-ai, machine-learning
{"id"=>10165,
"title"=>"Automating data extraction from SEC 10-K forms using Document AI and Generative AI",
"summary"=>nil,
"content"=>"

SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.

In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.

Solution Architecture

The solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.

\"\"
Solution Architecture

The solution consists of the following components:

  • Document AI Custom Document Splitter (CDS): Given a Sec 10-K document it splits the SEC 10-K form into individual sections.
  • Document AI Custom Document Extractor (CDE): Extracts key information present in tabular form from different sections of the SEC 10-K form.
  • Generative AI: Extracts text-based information from the SEC 10-K form.
  • BigQuery: Stores the extracted data

Data and Model Training

The solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset SEC Edgar Annual Financial Filings — 2021 for Sec10K form dataset.

For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.

For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.

\"\"
Snapshot of Custom Document Splitter developed

For Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.

\"\"
Snapshot of fields for Custom Document Extractor developed

Below is a sample page having a Consolidated Balance Sheet table in a Sec10k form.

\"\"
Consolidated Balance Sheet table in Sec10K form (Source)

Results

The solution was evaluated on a test set of 20 documents and has demonstrated impressive results.

  • 95%+ accuracy on Document Splitter to identify different sections of the forms
  • 90%+ accuracy on field extraction of tabular data using Custom Document Extractor
  • 99%+ accuracy on field extraction of textual data using Generative AI

We tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available here. Below is the snapshot of the 50 pager document.

\"\"
Source

Here is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.

{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043',
'company_name': 'Alphabet Inc.',
'company_phone': '(650) 253-0000',
'fiscal_year': 'March 31, 2023',
'form_type': '10-Q',
"chief_financial_officer": "Ruth M. Porat",
'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}
'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}
'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}
'total_net_sales': {'previous': '68,011', 'current': '69,787', 'description': 'Revenues'}

Conclusion

The integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.

Learn more about the products used in the solution from links below:

\"\"

Automating data extraction from SEC 10-K forms using Document AI and Generative AI was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

",
"author"=>"Harish Verma",
"link"=>"https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167?source=rss----e52cf94d98af---4",
"published_date"=>Thu, 18 Apr 2024 06:15:18.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167?source=rss----e52cf94d98af---4",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Thu, 18 Apr 2024 07:13:15.785029000 UTC +00:00,
"updated_at"=>Mon, 21 Oct 2024 20:24:33.477575000 UTC +00:00,
"newspaper"=>"Google Cloud - Medium",
"macro_region"=>"Blogs"}
Edit this article
Back to articles