♊️ GemiNews 🗞️
(dev)
🏡
📰 Articles
🏷️ Tags
🧠 Queries
📈 Graphs
☁️ Stats
💁🏻 Assistant
💬
🎙️
Demo 1: Embeddings + Recommendation
Demo 2: Bella RAGa
Demo 3: NewRetriever
Demo 4: Assistant function calling
Editing article
Title
Summary
Content
<p>SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.</p><p>In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.</p><h3>Solution Architecture</h3><p>The solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zoXK4BtnBH0mRdUC" /><figcaption><em>Solution Architecture</em></figcaption></figure><p>The solution consists of the following components:</p><ul><li>Document AI <a href="https://cloud.google.com/document-ai/docs/workbench/build-custom-splitter-processor">Custom Document Splitter (CDS)</a>: Given a Sec 10-K document it splits the SEC 10-K form into individual sections.</li><li>Document AI <a href="https://cloud.google.com/document-ai/docs/workbench/build-custom-processor">Custom Document Extractor (CDE)</a>: Extracts key information present in tabular form from different sections of the SEC 10-K form.</li><li>Generative AI: Extracts text-based information from the SEC 10-K form.</li><li>BigQuery: Stores the extracted data</li></ul><h3>Data and Model Training</h3><p>The solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset <a href="https://www.kaggle.com/datasets/pranjalverma08/sec-edgar-annual-financial-filings-2021?resource=download">SEC Edgar Annual Financial Filings — 2021</a> for Sec10K form dataset.</p><p>For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.</p><p>For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*A7xu-j3WS4zCOgxZ" /><figcaption><em>Snapshot of Custom Document Splitter developed</em></figcaption></figure><p>For Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*IiDpojwfGtVvG5nP" /><figcaption><em>Snapshot of fields for Custom Document Extractor developed</em></figcaption></figure><p>Below is a sample page having a Consolidated Balance Sheet table in a Sec10k form.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rBJZwtTJjYax9op4" /><figcaption><em>Consolidated Balance Sheet table in Sec10K form (</em><a href="https://www.sec.gov/Archives/edgar/data/1652044/000165204423000045/goog-20230331.htm#i0c9da2ad630a471ab2611da204d142c8_19"><em>Source</em></a><em>)</em></figcaption></figure><h3>Results</h3><p>The solution was evaluated on a test set of 20 documents and has demonstrated impressive results.</p><ul><li><strong>95%+</strong> accuracy on Document Splitter to identify different sections of the forms</li><li><strong>90%+</strong> accuracy on field extraction of tabular data using Custom Document Extractor</li><li><strong>99%+</strong> accuracy on field extraction of textual data using Generative AI</li></ul><p>We tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available <a href="https://www.sec.gov/Archives/edgar/data/1652044/000165204423000045/goog-20230331.htm">here</a>. Below is the snapshot of the 50 pager document.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YhZdo6JY_JR1u9CG" /><figcaption><a href="https://www.sec.gov/Archives/edgar/data/1652044/000165204423000045/goog-20230331.htm#i0c9da2ad630a471ab2611da204d142c8_19">Source</a></figcaption></figure><p>Here is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.</p><pre>{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043',<br>'company_name': 'Alphabet Inc.',<br>'company_phone': '(650) 253-0000',<br>'fiscal_year': 'March 31, 2023',<br>'form_type': '10-Q',<br>"chief_financial_officer": "Ruth M. Porat",<br>'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}<br>'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}<br>'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}<br>'total_net_sales': {'previous': '68,011', 'current': '69,787', 'description': 'Revenues'}</pre><h3>Conclusion</h3><p>The integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.</p><p>Learn more about the products used in the solution from links below:</p><ul><li><a href="https://cloud.google.com/document-ai?hl=en">Document AI</a></li><li><a href="https://cloud.google.com/vertex-ai?hl=en">Vertex AI</a></li><li><a href="https://cloud.google.com/storage?hl=en">Cloud Storage</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6b2a086d6167" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167">Automating data extraction from SEC 10-K forms using Document AI and Generative AI</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>
Author
Link
Published date
Image url
Feed url
Guid
Hidden blurb
--- !ruby/object:Feedjira::Parser::RSSEntry title: Automating data extraction from SEC 10-K forms using Document AI and Generative AI published: 2024-04-18 06:15:18.000000000 Z categories: - genai - google-cloud-platform - automation - document-ai - machine-learning url: https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167?source=rss----e52cf94d98af---4 entry_id: !ruby/object:Feedjira::Parser::GloballyUniqueIdentifier is_perma_link: 'false' guid: https://medium.com/p/6b2a086d6167 carlessian_info: news_filer_version: 2 newspaper: Google Cloud - Medium macro_region: Blogs content: '<p>SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.</p><p>In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.</p><h3>Solution Architecture</h3><p>The solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zoXK4BtnBH0mRdUC" /><figcaption><em>Solution Architecture</em></figcaption></figure><p>The solution consists of the following components:</p><ul><li>Document AI <a href="https://cloud.google.com/document-ai/docs/workbench/build-custom-splitter-processor">Custom Document Splitter (CDS)</a>: Given a Sec 10-K document it splits the SEC 10-K form into individual sections.</li><li>Document AI <a href="https://cloud.google.com/document-ai/docs/workbench/build-custom-processor">Custom Document Extractor (CDE)</a>: Extracts key information present in tabular form from different sections of the SEC 10-K form.</li><li>Generative AI: Extracts text-based information from the SEC 10-K form.</li><li>BigQuery: Stores the extracted data</li></ul><h3>Data and Model Training</h3><p>The solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset <a href="https://www.kaggle.com/datasets/pranjalverma08/sec-edgar-annual-financial-filings-2021?resource=download">SEC Edgar Annual Financial Filings — 2021</a> for Sec10K form dataset.</p><p>For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.</p><p>For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*A7xu-j3WS4zCOgxZ" /><figcaption><em>Snapshot of Custom Document Splitter developed</em></figcaption></figure><p>For Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*IiDpojwfGtVvG5nP" /><figcaption><em>Snapshot of fields for Custom Document Extractor developed</em></figcaption></figure><p>Below is a sample page having a Consolidated Balance Sheet table in a Sec10k form.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*rBJZwtTJjYax9op4" /><figcaption><em>Consolidated Balance Sheet table in Sec10K form (</em><a href="https://www.sec.gov/Archives/edgar/data/1652044/000165204423000045/goog-20230331.htm#i0c9da2ad630a471ab2611da204d142c8_19"><em>Source</em></a><em>)</em></figcaption></figure><h3>Results</h3><p>The solution was evaluated on a test set of 20 documents and has demonstrated impressive results.</p><ul><li><strong>95%+</strong> accuracy on Document Splitter to identify different sections of the forms</li><li><strong>90%+</strong> accuracy on field extraction of tabular data using Custom Document Extractor</li><li><strong>99%+</strong> accuracy on field extraction of textual data using Generative AI</li></ul><p>We tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available <a href="https://www.sec.gov/Archives/edgar/data/1652044/000165204423000045/goog-20230331.htm">here</a>. Below is the snapshot of the 50 pager document.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YhZdo6JY_JR1u9CG" /><figcaption><a href="https://www.sec.gov/Archives/edgar/data/1652044/000165204423000045/goog-20230331.htm#i0c9da2ad630a471ab2611da204d142c8_19">Source</a></figcaption></figure><p>Here is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.</p><pre>{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043',<br>'company_name': 'Alphabet Inc.',<br>'company_phone': '(650) 253-0000',<br>'fiscal_year': 'March 31, 2023',<br>'form_type': '10-Q',<br>"chief_financial_officer": "Ruth M. Porat",<br>'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}<br>'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}<br>'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}<br>'total_net_sales': {'previous': '68,011', 'current': '69,787', 'description': 'Revenues'}</pre><h3>Conclusion</h3><p>The integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.</p><p>Learn more about the products used in the solution from links below:</p><ul><li><a href="https://cloud.google.com/document-ai?hl=en">Document AI</a></li><li><a href="https://cloud.google.com/vertex-ai?hl=en">Vertex AI</a></li><li><a href="https://cloud.google.com/storage?hl=en">Cloud Storage</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6b2a086d6167" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167">Automating data extraction from SEC 10-K forms using Document AI and Generative AI</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>' rss_fields: - title - published - categories - url - entry_id - content - author author: Harish Verma
Language
Active
Ricc internal notes
Imported via /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/import-feedjira.rb on 2024-04-18 09:13:14 +0200. Content is EMPTY here. Entried: title,published,categories,url,entry_id,content,author. TODO add Newspaper: filename = /Users/ricc/git/gemini-news-crawler/webapp/db/seeds.d/../../../crawler/out/feedjira/Blogs/Google Cloud - Medium/2024-04-18-Automating_data_extraction_from_SEC_10-K_forms_using_Document_AI-v2.yaml
Ricc source
Show this article
Back to articles