Improve productivity when processing scanned PDFs using Amazon Q Business | Amazon Web Services – AWS Blog
Amazon Q Businessis a generative AI-powered assistant that can answer questions, provide summaries, generate content, and extract insights directly from the content in digital as well as scanned PDF documents in your enterprise data sources without needing to extract the text first.
Customers across industries such as finance, insurance, healthcare life sciences, and more need to derive insights from various document types, such as receipts, healthcare plans, or tax statements, which are frequently in scanned PDF format. These document types often have a semi-structured or unstructured format, which requires processing to extract text before indexing with Amazon Q Business.
The launch of scanned PDF document support with Amazon Q Business can help you seamlessly process a variety of multi-modal document types through the AWS Management Console and APIs, across all supported Amazon Q Business AWS Regions. You can ingest documents, including scanned PDFs, from your data sources using supported connectors, index them, and then use the documents to answer questions, provide summaries, and generate content securely and accurately from your enterprise systems. This feature eliminates the development effort required to extract text from scanned PDF documents outside of Amazon Q Business, and improves the document processing pipeline for building your generative artificial intelligence (AI) assistant with Amazon Q Business.
In this post, we show how to asynchronously index and run real-time queries with scanned PDF documents using Amazon Q Business.
You can use Amazon Q Business for scanned PDF documents from the console, AWS SDKs, or AWS Command Line Interface (AWS CLI).
Amazon Q Business provides a versatile suite of data connectors that can integrate with a wide range of enterprise data sources, empowering you to develop generative AI solutions with minimal setup and configuration. To learn more, visit Amazon Q Business, now generally available, helps boost workforce productivity with generative AI.
After your Amazon Q Business application is ready to use, you can directly upload the scanned PDFs into an Amazon Q Business index using either the console or the APIs. Amazon Q Business offers multiple data source connectors that can integrate and synchronize data from multiple data repositories into single index. For this post, we demonstrate two scenarios to use documents: one with the direct document upload option, and another using the Amazon Simple Storage Service (Amazon S3) connector. If you need to ingest documents from other data sources, refer to Supported connectors for details on connecting additional data sources.
In this post, we use three scanned PDF documents as examples: an invoice, a health plan summary, and an employment verification form, along with some text documents.
The first step is to index these documents. Complete the following steps to index documents using the direct upload feature of Amazon Q Business. For this example, we upload the scanned PDFs.
You can monitor the uploaded files on the Data sources tab. The Upload status changes from Received to Processing to Indexed or Updated, as which point the file has been successfully indexed into the Amazon Q Business data store. The following screenshot shows the successfully indexed PDFs.
The following steps demonstrate how to integrate and synchronize documents using an Amazon S3 connector with Amazon Q Business. For this example, we index the text documents.
When the sync job is complete, your data source is ready to use. The following screenshot shows all five documents (scanned and digital PDFs, and text files) are successfully indexed.
The following screenshot shows a comprehensive view of the two data sources: the directly uploaded documents and the documents ingested through the Amazon S3 connector.
Now lets run some queries with Amazon Q Business on our data sources.
Your documents might be dense, unstructured, scanned PDF document types. Amazon Q Business can identify and extract the most salient information-dense text from it. In this example, we use the multi-page health plan summary PDF we indexed earlier. The following screenshot shows an example page.
This is an example of a health plan summary document.
In the Amazon Q Business web UI, we ask What is the annual total out-of-pocket maximum, mentioned in the health plan summary?
Amazon Q Business searches the indexed document, retrieves the relevant information, and generates an answer while citing the source for its information. The following screenshot shows the sample output.
Documents might also contain structured data elements in tabular format. Amazon Q Business can automatically identify, extract, and linearize structured data from scanned PDFs to accurately resolve any user queries. In the following example, we use the invoice PDF we indexed earlier. The following screenshot shows an example.
This is an example of an invoice.
In the Amazon Q Business web UI, we ask How much were the headphones charged in the invoice?
Amazon Q Business searches the indexed document and retrieves the answer with reference to the source document. The following screenshot shows that Amazon Q Business is able to extract bill information from the invoice.
Your documents might also contain semi-structured data elements in a form, such as key-value pairs. Amazon Q Business can accurately satisfy queries related to these data elements by extracting specific fields or attributes that are meaningful for the queries. In this example, we use the employment verification PDF. The following screenshot shows an example.
This is an example of an employment verification form.
In the Amazon Q Business web UI, we ask What is the applicants date of employment in the employment verification form? Amazon Q Business searches the indexed employment verification document and retrieves the answer with reference to the source document.
In this section, we show you how to use the AWS CLI to ingest structured and unstructured documents stored in an S3 bucket into an Amazon Q Business index. You can quickly retrieve detailed information about your documents, including their statuses and any errors occurred during indexing. If youre an existing Amazon Q Business user and have indexed documents in various formats, such as scanned PDFs and other supported types, and you now want to reindex the scanned documents, complete the following steps:
"errorMessage": "Document cannot be indexed since it contains no text to index and search on. Document must contain some text."
If youre a new user and havent indexed any documents, you can skip this step.
The following is an example of using the ListDocuments API to filter documents with a specific status and their error messages:
The following screenshot shows the AWS CLI output with a list of failed documents with error messages.
Now you batch-process the documents. Amazon Q Business supports adding one or more documents to an Amazon Q Business index.
The following screenshot shows the AWS CLI output. You should see failed documents as an empty list.
The following screenshot shows that the documents are indexed in the data source.
If you created a new Amazon Q Business application and dont plan to use it further, unsubscribe and remove assigned users from the application and delete it so that your AWS account doesnt accumulate costs. Moreover, if you dont need to use the indexed data sources further, refer to Managing Amazon Q Business data sources for instructions to delete your indexed data sources.
This post demonstrated the support for scanned PDF document types with Amazon Q Business. We highlighted the steps to sync, index, and query supported document typesnow including scanned PDF documentsusing generative AI with Amazon Q Business. We also showed examples of queries on structured, unstructured, or semi-structured multi-modal scanned documents using the Amazon Q Business web UI and AWS CLI.
To learn more about this feature, refer toSupported document formats in Amazon Q Business. Give it a try on theAmazon Q Business consoletoday! For more information, visitAmazon Q Businessand theAmazon Q Business User Guide. You can send feedback toAWS re:Post for Amazon Qor through your usual AWS support contacts.
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.
Himesh Kumar is a seasoned Senior Software Engineer, currently working at Amazon Q Business in AWS. He is passionate about building distributed systems in the generative AI/ML space. His expertise extends to develop scalable and efficient systems, ensuring high availability, performance, and reliability. Beyond the technical skills, he is dedicated to continuous learning and staying at the forefront of technological advancements in AI and machine learning.
Qing Wei is a Senior Software Developer for Amazon Q Business team in AWS, and passionate about building modern applications using AWS technologies. He loves community-driven learning and sharing of technology especially for machine learning hosting and inference related topics. His main focus right now is on building serverless and event-driven architectures for RAG data ingestion.
- Machine Learning: Your Ticket to a Thriving Career in the Tech World - The Impressive Times - July 14th, 2025 [July 14th, 2025]
- Integrative analysis of multi-omics data and gut microbiota composition reveals prognostic subtypes and predicts immunotherapy response in colorectal... - July 14th, 2025 [July 14th, 2025]
- Comprehensive multi-omics and machine learning framework for glioma subtyping and precision therapeutics - Nature - July 14th, 2025 [July 14th, 2025]
- Development and validation of a machine learning-based nomogram for survival prediction of patients with hilar cholangiocarcinoma after... - July 12th, 2025 [July 12th, 2025]
- Geochemical-integrated machine learning approach predicts the distribution of cadmium speciation in European and Chinese topsoils - Nature - July 12th, 2025 [July 12th, 2025]
- Machine learning-based construction of a programmed cell death-related model reveals prognosis and immune infiltration in pancreatic adenocarcinoma... - July 12th, 2025 [July 12th, 2025]
- Application of supervised machine learning and unsupervised data compression models for pore pressure prediction employing drilling, petrophysical,... - July 12th, 2025 [July 12th, 2025]
- Machine learning identifies lipid-associated genes and constructs diagnostic and prognostic models for idiopathic pulmonary fibrosis - Orphanet... - July 12th, 2025 [July 12th, 2025]
- An evaluation methodology for machine learning-based tandem mass spectra similarity prediction - BMC Bioinformatics - July 12th, 2025 [July 12th, 2025]
- The Rise of AI in Trading: Machine Learning and the Stock Market - Disruption Banking - July 12th, 2025 [July 12th, 2025]
- Integrative analysis identifies IL-6/JUN/MMP-9 pathway destroyed blood-brain-barrier in autism mice via machine learning and bioinformatic analysis -... - July 12th, 2025 [July 12th, 2025]
- Interpretive prediction of hyperuricemia and gout patients via machine learning analysis of human gut microbiome - BMC Microbiology - July 10th, 2025 [July 10th, 2025]
- Machine learning-based identification of key factors and spatial heterogeneity analysis of urban flooding: a case study of the central urban area of... - July 10th, 2025 [July 10th, 2025]
- Developing machine learning frameworks to predict mechanical properties of ultra-high performance concrete mixed with various industrial byproducts -... - July 10th, 2025 [July 10th, 2025]
- Small Drones Market Trend Analysis and Forecast Report 2025-2034 | AI and Machine Learning Revolutionizing Autonomous Operations, Trade Tariffs Push... - July 10th, 2025 [July 10th, 2025]
- When a model touches millions: Hatim Kagalwala on accuracy accountability, and applied machine learning - Dataconomy - July 10th, 2025 [July 10th, 2025]
- New Study Uses Gait Data and Machine Learning for Early Detection of Anxiety and Depression - AZoSensors - July 10th, 2025 [July 10th, 2025]
- Machine Learning and the Evolution of Mobile Apps - CIO Applications - July 10th, 2025 [July 10th, 2025]
- Artificial Intelligence, Machine Learning, and Big Data in Thailand: Legal and Regulatory Developments 2025 - Lexology - July 10th, 2025 [July 10th, 2025]
- Karen Hao on how the AI boom became a new imperial frontier - Machine Learning Week 2025 - July 8th, 2025 [July 8th, 2025]
- Machine Learning and AI in Enhancing Image Analysis of 3D Samples - Drug Target Review - July 8th, 2025 [July 8th, 2025]
- Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027 - Machine Learning Week 2025 - July 8th, 2025 [July 8th, 2025]
- Explainable machine learning model for predicting the transarterial chemoembolization response and subtypes of hepatocellular carcinoma patients - BMC... - July 8th, 2025 [July 8th, 2025]
- Identification and validation of glucocorticoid receptor and programmed cell death-related genes in spinal cord injury using machine learning - Nature - July 8th, 2025 [July 8th, 2025]
- Multiclass leukemia cell classification using hybrid deep learning and machine learning with CNN-based feature extraction - Nature - July 6th, 2025 [July 6th, 2025]
- Predictive modeling and machine learning show poor performance of clinical, morphological, and hemodynamic parameters for small intracranial aneurysm... - July 6th, 2025 [July 6th, 2025]
- A robust machine learning approach to predicting remission and stratifying risk in rheumatoid arthritis patients treated with bDMARDs - Nature - July 6th, 2025 [July 6th, 2025]
- Ultrabroadband and band-selective thermal meta-emitters by machine learning - Nature - July 4th, 2025 [July 4th, 2025]
- Machine Learning is Surprisingly Good at Simulating the Universe - Universe Today - July 4th, 2025 [July 4th, 2025]
- Machine learning-assisted multi-dimensional transcriptomic analysis of cytoskeleton-related molecules and their relationship with prognosis in... - July 4th, 2025 [July 4th, 2025]
- Machine learning combined with multi-omics to identify immune-related LncRNA signature as biomarkers for predicting breast cancer prognosis - Nature - July 4th, 2025 [July 4th, 2025]
- Comprehensive machine learning analysis of PANoptosis signatures in multiple myeloma identifies prognostic and immunotherapy biomarkers - Nature - July 4th, 2025 [July 4th, 2025]
- Enhancing game outcome prediction in the Chinese basketball league through a machine learning framework based on performance data - Nature - July 4th, 2025 [July 4th, 2025]
- A novel double machine learning approach for detecting early breast cancer using advanced feature selection and dimensionality reduction techniques -... - July 4th, 2025 [July 4th, 2025]
- Machine learning for Parkinsons disease: a comprehensive review of datasets, algorithms, and challenges - Nature - July 4th, 2025 [July 4th, 2025]
- Cervical cancer prediction using machine learning models based on routine blood analysis - Nature - July 4th, 2025 [July 4th, 2025]
- Enhancing anomaly detection in IoT-driven factories using Logistic Boosting, Random Forest, and SVM: A comparative machine learning approach - Nature - July 4th, 2025 [July 4th, 2025]
- Predicting car accident severity in Northwest Ethiopia: a machine learning approach leveraging driver, environmental, and road conditions - Nature - July 4th, 2025 [July 4th, 2025]
- Sensormatic Solutions Adds Machine Learning to Shrink Analyzer - Ink World magazine - July 4th, 2025 [July 4th, 2025]
- Exploring the link between the ZJU index and sarcopenia in adults aged 2059 using NHANES and machine learning - Nature - July 4th, 2025 [July 4th, 2025]
- Combining multi-parametric MRI radiomics features with tumor abnormal protein to construct a machine learning-based predictive model for prostate... - July 2nd, 2025 [July 2nd, 2025]
- New insight into viscosity prediction of imidazolium-based ionic liquids and their mixtures with machine learning models - Nature - July 2nd, 2025 [July 2nd, 2025]
- Implementing partial least squares and machine learning regressive models for prediction of drug release in targeted drug delivery application -... - July 2nd, 2025 [July 2nd, 2025]
- Advanced analysis of defect clusters in nuclear reactors using machine learning techniques - Nature - July 2nd, 2025 [July 2nd, 2025]
- Machine learning analysis of kinematic movement features during functional tasks to discriminate chronic neck pain patients from asymptomatic controls... - July 2nd, 2025 [July 2nd, 2025]
- Enhanced machine learning models for predicting three-year mortality in Non-STEMI patients aged 75 and above - BMC Geriatrics - July 2nd, 2025 [July 2nd, 2025]
- Modeling seawater intrusion along the Alabama coastline using physical and machine learning models to evaluate the effects of multiscale natural and... - July 2nd, 2025 [July 2nd, 2025]
- A comprehensive study based on machine learning models for early identification Mycoplasma pneumoniae infection in segmental/lobar pneumonia - Nature - July 2nd, 2025 [July 2nd, 2025]
- Identifying ovarian cancer with machine learning DNA methylation pattern analysis - Nature - July 2nd, 2025 [July 2nd, 2025]
- High-isolation dual-band MIMO antenna for next-generation 5G wireless networks at 28/38 GHz with machine learning-based gain prediction - Nature - July 2nd, 2025 [July 2nd, 2025]
- Sony and AMD want to focus on machine learning for the PS6 - Instant Gaming News - July 2nd, 2025 [July 2nd, 2025]
- How Machine Learning is Reshaping the Future of Sports Betting? - London Daily News - July 2nd, 2025 [July 2nd, 2025]
- An interpretable machine learning model for predicting depression in middle-aged and elderly cancer patients in China: a study based on the CHARLS... - July 2nd, 2025 [July 2nd, 2025]
- These Eight Projects Showcase the Power of Machine Learning on the Edge - Hackster.io - June 29th, 2025 [June 29th, 2025]
- Build Custom AI Tools for Your AI Agents that Combine Machine Learning and Statistical Analysis - MarkTechPost - June 29th, 2025 [June 29th, 2025]
- Check out these essential tips and trends for SEO in 2025 as AI and machine learning loom large - EdTech Innovation Hub - June 29th, 2025 [June 29th, 2025]
- Using machine learning to predict the severity of salmonella infection - Open Access Government - June 28th, 2025 [June 28th, 2025]
- How AI and machine learning are transforming drug discovery - Pharmaceutical Technology - June 28th, 2025 [June 28th, 2025]
- Capturing the complexity of human strategic decision-making with machine learning - Nature - June 26th, 2025 [June 26th, 2025]
- A framework to evaluate machine learning crystal stability predictions - Nature - June 24th, 2025 [June 24th, 2025]
- Machine learning revealed giant thermal conductivity reduction by strong phonon localization in two-angle disordered twisted multilayer graphene -... - June 24th, 2025 [June 24th, 2025]
- How AI and Machine Learning Are Powering the Next Generation of Pump Maintenance - Robotics Tomorrow - June 24th, 2025 [June 24th, 2025]
- Actuate Therapeutics Reports Positive Biomarker and Machine Learning Data from Phase 2 Elraglusib Trial in First-Line Treatment of Metastatic... - June 24th, 2025 [June 24th, 2025]
- Texas A&M Researchers Introduce a Two-Phase Machine Learning Method Named ShockCast for High-Speed Flow Simulation with Neural Temporal Re-Meshing -... - June 22nd, 2025 [June 22nd, 2025]
- Machine learning method helps bring diagnostic testing out of the lab - Medical Xpress - June 22nd, 2025 [June 22nd, 2025]
- Sebi proposes five-point rulebook for responsible use of AI, machine learning - The New Indian Express - June 22nd, 2025 [June 22nd, 2025]
- HAPIR: a refined Hallmark gene set-based machine learning approach for predicting immunotherapy response in cancer patients - Nature - June 20th, 2025 [June 20th, 2025]
- Machine learning boosts accuracy of point-of-care disease detection - News-Medical - June 20th, 2025 [June 20th, 2025]
- How AI and Machine Learning Are Transforming Food Poisoning Outbreak Detection - Food Poisoning News - June 20th, 2025 [June 20th, 2025]
- Evo 2 machine learning model enlists the power of AI in the fight against diseases - Medical Xpress - June 20th, 2025 [June 20th, 2025]
- Machine learning can predict which babies will be born with low birth weights - Medical Xpress - June 20th, 2025 [June 20th, 2025]
- Development and Validation of a Machine Learning Model for Identifying Novel HIV Integrase Inhibitors - Cureus - June 20th, 2025 [June 20th, 2025]
- IIT launches new online certificate programme in data science and machine learning for working profession - Times of India - June 20th, 2025 [June 20th, 2025]
- Calgary startup tackles referee abuse with microphones and machine learning - Yahoo - June 20th, 2025 [June 20th, 2025]
- New machine learning program accurately predicts who will stick with their exercise program - AOL.com - June 20th, 2025 [June 20th, 2025]
- Machine learning and generative AI: What are they good for in 2025? - MIT Sloan - June 4th, 2025 [June 4th, 2025]
- Researchers use machine learning to improve gene therapy - Stanford Report - June 4th, 2025 [June 4th, 2025]
- Machine learning for workpiece mass prediction using real and synthetic acoustic data - Nature - June 4th, 2025 [June 4th, 2025]
- Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Input Representations Matter - Apple Machine Learning Research - June 4th, 2025 [June 4th, 2025]
- Machine learning models for predicting severe acute kidney injury in patients with sepsis-induced myocardial injury - Nature - June 4th, 2025 [June 4th, 2025]