"title"=>"Automating data extraction from SEC 10-K forms using Document AI and Generative AI",
"summary"=>nil,
"content"=>"
SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.
In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.
Solution Architecture
The solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.
The solution consists of the following components:
- Document AI Custom Document Splitter (CDS): Given a Sec 10-K document it splits the SEC 10-K form into individual sections.
- Document AI Custom Document Extractor (CDE): Extracts key information present in tabular form from different sections of the SEC 10-K form.
- Generative AI: Extracts text-based information from the SEC 10-K form.
- BigQuery: Stores the extracted data
Data and Model Training
The solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset SEC Edgar Annual Financial Filings — 2021 for Sec10K form dataset.
For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.
For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.
For Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.
Below is a sample page having a Consolidated Balance Sheet table in a Sec10k form.
Results
The solution was evaluated on a test set of 20 documents and has demonstrated impressive results.
- 95%+ accuracy on Document Splitter to identify different sections of the forms
- 90%+ accuracy on field extraction of tabular data using Custom Document Extractor
- 99%+ accuracy on field extraction of textual data using Generative AI
We tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available here. Below is the snapshot of the 50 pager document.
Here is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.
{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043',
'company_name': 'Alphabet Inc.',
'company_phone': '(650) 253-0000',
'fiscal_year': 'March 31, 2023',
'form_type': '10-Q',
"chief_financial_officer": "Ruth M. Porat",
'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}
'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}
'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}
'total_net_sales': {'previous': '68,011', 'current': '69,787', 'description': 'Revenues'}
Conclusion
The integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.
Learn more about the products used in the solution from links below:
Automating data extraction from SEC 10-K forms using Document AI and Generative AI was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.
","author"=>"Harish Verma",
"link"=>"https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167?source=rss----e52cf94d98af---4",
"published_date"=>Thu, 18 Apr 2024 06:15:18.000000000 UTC +00:00,
"image_url"=>nil,
"feed_url"=>"https://medium.com/google-cloud/automating-data-extraction-from-sec-10-k-forms-using-document-ai-and-generative-ai-6b2a086d6167?source=rss----e52cf94d98af---4",
"language"=>nil,
"active"=>true,
"ricc_source"=>"feedjira::v1",
"created_at"=>Thu, 18 Apr 2024 07:13:15.785029000 UTC +00:00,
"updated_at"=>Mon, 21 Oct 2024 20:24:33.477575000 UTC +00:00,
"newspaper"=>"Google Cloud - Medium",
"macro_region"=>"Blogs"}