Improve productivity when processing scanned PDFs using Amazon Q Business | Amazon Web Services – AWS Blog
Amazon Q Businessis a generative AI-powered assistant that can answer questions, provide summaries, generate content, and extract insights directly from the content in digital as well as scanned PDF documents in your enterprise data sources without needing to extract the text first.
Customers across industries such as finance, insurance, healthcare life sciences, and more need to derive insights from various document types, such as receipts, healthcare plans, or tax statements, which are frequently in scanned PDF format. These document types often have a semi-structured or unstructured format, which requires processing to extract text before indexing with Amazon Q Business.
The launch of scanned PDF document support with Amazon Q Business can help you seamlessly process a variety of multi-modal document types through the AWS Management Console and APIs, across all supported Amazon Q Business AWS Regions. You can ingest documents, including scanned PDFs, from your data sources using supported connectors, index them, and then use the documents to answer questions, provide summaries, and generate content securely and accurately from your enterprise systems. This feature eliminates the development effort required to extract text from scanned PDF documents outside of Amazon Q Business, and improves the document processing pipeline for building your generative artificial intelligence (AI) assistant with Amazon Q Business.
In this post, we show how to asynchronously index and run real-time queries with scanned PDF documents using Amazon Q Business.
You can use Amazon Q Business for scanned PDF documents from the console, AWS SDKs, or AWS Command Line Interface (AWS CLI).
Amazon Q Business provides a versatile suite of data connectors that can integrate with a wide range of enterprise data sources, empowering you to develop generative AI solutions with minimal setup and configuration. To learn more, visit Amazon Q Business, now generally available, helps boost workforce productivity with generative AI.
After your Amazon Q Business application is ready to use, you can directly upload the scanned PDFs into an Amazon Q Business index using either the console or the APIs. Amazon Q Business offers multiple data source connectors that can integrate and synchronize data from multiple data repositories into single index. For this post, we demonstrate two scenarios to use documents: one with the direct document upload option, and another using the Amazon Simple Storage Service (Amazon S3) connector. If you need to ingest documents from other data sources, refer to Supported connectors for details on connecting additional data sources.
In this post, we use three scanned PDF documents as examples: an invoice, a health plan summary, and an employment verification form, along with some text documents.
The first step is to index these documents. Complete the following steps to index documents using the direct upload feature of Amazon Q Business. For this example, we upload the scanned PDFs.
You can monitor the uploaded files on the Data sources tab. The Upload status changes from Received to Processing to Indexed or Updated, as which point the file has been successfully indexed into the Amazon Q Business data store. The following screenshot shows the successfully indexed PDFs.
The following steps demonstrate how to integrate and synchronize documents using an Amazon S3 connector with Amazon Q Business. For this example, we index the text documents.
When the sync job is complete, your data source is ready to use. The following screenshot shows all five documents (scanned and digital PDFs, and text files) are successfully indexed.
The following screenshot shows a comprehensive view of the two data sources: the directly uploaded documents and the documents ingested through the Amazon S3 connector.
Now lets run some queries with Amazon Q Business on our data sources.
Your documents might be dense, unstructured, scanned PDF document types. Amazon Q Business can identify and extract the most salient information-dense text from it. In this example, we use the multi-page health plan summary PDF we indexed earlier. The following screenshot shows an example page.
This is an example of a health plan summary document.
In the Amazon Q Business web UI, we ask What is the annual total out-of-pocket maximum, mentioned in the health plan summary?
Amazon Q Business searches the indexed document, retrieves the relevant information, and generates an answer while citing the source for its information. The following screenshot shows the sample output.
Documents might also contain structured data elements in tabular format. Amazon Q Business can automatically identify, extract, and linearize structured data from scanned PDFs to accurately resolve any user queries. In the following example, we use the invoice PDF we indexed earlier. The following screenshot shows an example.
This is an example of an invoice.
In the Amazon Q Business web UI, we ask How much were the headphones charged in the invoice?
Amazon Q Business searches the indexed document and retrieves the answer with reference to the source document. The following screenshot shows that Amazon Q Business is able to extract bill information from the invoice.
Your documents might also contain semi-structured data elements in a form, such as key-value pairs. Amazon Q Business can accurately satisfy queries related to these data elements by extracting specific fields or attributes that are meaningful for the queries. In this example, we use the employment verification PDF. The following screenshot shows an example.
This is an example of an employment verification form.
In the Amazon Q Business web UI, we ask What is the applicants date of employment in the employment verification form? Amazon Q Business searches the indexed employment verification document and retrieves the answer with reference to the source document.
In this section, we show you how to use the AWS CLI to ingest structured and unstructured documents stored in an S3 bucket into an Amazon Q Business index. You can quickly retrieve detailed information about your documents, including their statuses and any errors occurred during indexing. If youre an existing Amazon Q Business user and have indexed documents in various formats, such as scanned PDFs and other supported types, and you now want to reindex the scanned documents, complete the following steps:
"errorMessage": "Document cannot be indexed since it contains no text to index and search on. Document must contain some text."
If youre a new user and havent indexed any documents, you can skip this step.
The following is an example of using the ListDocuments API to filter documents with a specific status and their error messages:
The following screenshot shows the AWS CLI output with a list of failed documents with error messages.
Now you batch-process the documents. Amazon Q Business supports adding one or more documents to an Amazon Q Business index.
The following screenshot shows the AWS CLI output. You should see failed documents as an empty list.
The following screenshot shows that the documents are indexed in the data source.
If you created a new Amazon Q Business application and dont plan to use it further, unsubscribe and remove assigned users from the application and delete it so that your AWS account doesnt accumulate costs. Moreover, if you dont need to use the indexed data sources further, refer to Managing Amazon Q Business data sources for instructions to delete your indexed data sources.
This post demonstrated the support for scanned PDF document types with Amazon Q Business. We highlighted the steps to sync, index, and query supported document typesnow including scanned PDF documentsusing generative AI with Amazon Q Business. We also showed examples of queries on structured, unstructured, or semi-structured multi-modal scanned documents using the Amazon Q Business web UI and AWS CLI.
To learn more about this feature, refer toSupported document formats in Amazon Q Business. Give it a try on theAmazon Q Business consoletoday! For more information, visitAmazon Q Businessand theAmazon Q Business User Guide. You can send feedback toAWS re:Post for Amazon Qor through your usual AWS support contacts.
Sonali Sahu is leading the Generative AI Specialist Solutions Architecture team in AWS. She is an author, thought leader, and passionate technologist. Her core area of focus is AI and ML, and she frequently speaks at AI and ML conferences and meetups around the world. She has both breadth and depth of experience in technology and the technology industry, with industry expertise in healthcare, the financial sector, and insurance.
Chinmayee Rane is a Generative AI Specialist Solutions Architect at AWS. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing and generative AI solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.
Himesh Kumar is a seasoned Senior Software Engineer, currently working at Amazon Q Business in AWS. He is passionate about building distributed systems in the generative AI/ML space. His expertise extends to develop scalable and efficient systems, ensuring high availability, performance, and reliability. Beyond the technical skills, he is dedicated to continuous learning and staying at the forefront of technological advancements in AI and machine learning.
Qing Wei is a Senior Software Developer for Amazon Q Business team in AWS, and passionate about building modern applications using AWS technologies. He loves community-driven learning and sharing of technology especially for machine learning hosting and inference related topics. His main focus right now is on building serverless and event-driven architectures for RAG data ingestion.
- Snowflake Supercharges Machine Learning for Enterprises with Native Integration of NVIDIA CUDA-X Libraries - Yahoo Finance - November 18th, 2025 [November 18th, 2025]
- An interpretable machine learning model for predicting 5year survival in breast cancer based on integration of proteomics and clinical data -... - November 18th, 2025 [November 18th, 2025]
- scMFF: a machine learning framework with multiple feature fusion strategies for cell type identification - BMC Bioinformatics - November 18th, 2025 [November 18th, 2025]
- URI professor examines how machine learning can help with depression diagnosis Rhody Today - The University of Rhode Island - November 18th, 2025 [November 18th, 2025]
- Predicting drug solubility in supercritical carbon dioxide green solvent using machine learning models based on thermodynamic properties - Nature - November 18th, 2025 [November 18th, 2025]
- Relationship between C-reactive protein triglyceride glucose index and cardiovascular disease risk: a cross-sectional analysis with machine learning -... - November 18th, 2025 [November 18th, 2025]
- Using machine learning to predict student outcomes for early intervention and formative assessment - Nature - November 18th, 2025 [November 18th, 2025]
- Prevalence, associated factors, and machine learning-based prediction of probable depression among individuals with chronic diseases in Bangladesh -... - November 18th, 2025 [November 18th, 2025]
- Snowflake supercharges machine learning for enterprises with native integration of Nvidia CUDA-X libraries - MarketScreener - November 18th, 2025 [November 18th, 2025]
- Unlocking Cardiovascular Disease Insights Through Machine Learning - BIOENGINEER.ORG - November 18th, 2025 [November 18th, 2025]
- Machine learning boosts solar forecasts in diverse climates of India - researchmatters.in - November 18th, 2025 [November 18th, 2025]
- Big Data Machine Learning In Telecom Market by Type and Application Set for 14.8% CAGR Growth Through 2033 - openPR.com - November 18th, 2025 [November 18th, 2025]
- How Humans Could Soon Understand and Talk to Animals, Thanks to Machine Learning - SYFY - November 10th, 2025 [November 10th, 2025]
- Machine learning based analysis of diesel engine performance using FeO nanoadditive in sterculia foetida biodiesel blend - Nature - November 10th, 2025 [November 10th, 2025]
- Machine Learning in Maternal Care - Johns Hopkins Bloomberg School of Public Health - November 10th, 2025 [November 10th, 2025]
- Machine learning-based differentiation of benign and malignant adrenal lesions using 18F-FDG PET/CT: a two-stage classification and SHAP... - November 10th, 2025 [November 10th, 2025]
- How to Better Use AI and Machine Learning in Dermatology, With Renata Block, MMS, PA-C - HCPLive - November 10th, 2025 [November 10th, 2025]
- Avoiding Catastrophe: The Importance of Privacy when Leveraging AI and Machine Learning for Disaster Management - CSIS | Center for Strategic and... - November 10th, 2025 [November 10th, 2025]
- Efferocytosis-related signatures identified via Single-cell analysis and machine learning predict TNBC outcomes and immunotherapy response - Nature - November 10th, 2025 [November 10th, 2025]
- Arc Raiders' use of AI highlights the tension and confusion over where machine learning ends and generative AI begins - PC Gamer - November 3rd, 2025 [November 3rd, 2025]
- From performance to prediction: extracting aging data from the effects of base load aging on washing machines for a machine learning model - Nature - November 3rd, 2025 [November 3rd, 2025]
- Meet 'kvcached': A Machine Learning Library to Enable Virtualized, Elastic KV Cache for LLM Serving on Shared GPUs - MarkTechPost - October 28th, 2025 [October 28th, 2025]
- Bayesian-optimized machine learning boosts actual evapotranspiration prediction in water-stressed agricultural regions of China - Nature - October 28th, 2025 [October 28th, 2025]
- Using machine learning to shed light on how well the triage systems work - News-Medical - October 28th, 2025 [October 28th, 2025]
- Our Last Hope Before The AI Bubble Detonates: Taming LLMs - Machine Learning Week US - October 28th, 2025 [October 28th, 2025]
- Using multiple machine learning algorithms to predict spinal cord injury in patients with cervical spondylosis: a multicenter study - Nature - October 28th, 2025 [October 28th, 2025]
- The diagnostic potential of proteomics and machine learning in Lyme neuroborreliosis - Nature - October 28th, 2025 [October 28th, 2025]
- Using unsupervised machine learning methods to cluster cardio-metabolic profile of the middle-aged and elderly Chinese with general and central... - October 28th, 2025 [October 28th, 2025]
- The prognostic value of POD24 for multiple myeloma: a comprehensive analysis based on traditional statistics and machine learning - BMC Cancer - October 28th, 2025 [October 28th, 2025]
- Reducing inequalities using an unbiased machine learning approach to identify births with the highest risk of preventable neonatal deaths - Population... - October 28th, 2025 [October 28th, 2025]
- Association between SHR and mortality in critically ill patients with CVD: a retrospective analysis and machine learning approach - Diabetology &... - October 28th, 2025 [October 28th, 2025]
- AI-Powered Visual Storytelling: How Machine Learning Transforms Creative Content Production - About Chromebooks - October 28th, 2025 [October 28th, 2025]
- How beauty brand Shiseido nearly tripled revenue per user with machine learning - Performance Marketing World - October 28th, 2025 [October 28th, 2025]
- Magnite introduces machine learning-powered ad podding for streaming platforms - PPC Land - October 26th, 2025 [October 26th, 2025]
- Krafton is an AI first company and will invest 70M USD on machine learning - Female First - October 26th, 2025 [October 26th, 2025]
- Machine learning prediction of bacterial optimal growth temperature from protein domain signatures reveals thermoadaptation mechanisms - BMC Genomics - October 24th, 2025 [October 24th, 2025]
- Data Proportionality and Its Impact on Machine Learning Predictions of Ground Granulated Blast Furnace Slag Concrete Strength | Newswise - Newswise - October 24th, 2025 [October 24th, 2025]
- The Evolution of Machine Learning and Its Applications in Orthopaedics: A Bibliometric Analysis - Cureus - October 24th, 2025 [October 24th, 2025]
- Sentiment Analysis with Machine Learning Achieves 83.48% Accuracy in Predicting Consumer Behavior Trends - Quantum Zeitgeist - October 24th, 2025 [October 24th, 2025]
- Use of machine learning for risk stratification of chest pain patients in the emergency department - BMC Medical Informatics and Decision Making - October 24th, 2025 [October 24th, 2025]
- Mass spectrometry combined with machine learning identifies novel protein signatures as demonstrated with multisystem inflammatory syndrome in... - October 24th, 2025 [October 24th, 2025]
- How Machine Learning Is Shrinking to Fit the Sensor Node - All About Circuits - October 24th, 2025 [October 24th, 2025]
- Machine learning models for mechanical properties prediction of basalt fiber-reinforced concrete incorporating graphical user interface - Nature - October 24th, 2025 [October 24th, 2025]
- Ohio wins national cybersecurity award for fraud solutions using machine learning - Spectrum News NY1 - October 24th, 2025 [October 24th, 2025]
- Itron Partners with Gordian Technologies to Enhance Grid Edge Intelligence with AI and Machine Learning Solutions - Quiver Quantitative - October 24th, 2025 [October 24th, 2025]
- Wearable sensors and machine learning give leg up on better running data - Medical Xpress - October 23rd, 2025 [October 23rd, 2025]
- Geophysical-machine learning tool developed for continuous subsurface geomaterials characterization - Phys.org - October 23rd, 2025 [October 23rd, 2025]
- Ohio wins national cybersecurity award for fraud solutions using machine learning - Spectrum News 1 - October 23rd, 2025 [October 23rd, 2025]
- Machine learning predictions of climate change effects on nearly threatened bird species ( Crithagra xantholaema) habitat in Ethiopia for conservation... - October 23rd, 2025 [October 23rd, 2025]
- A machine learning tool for predicting newly diagnosed osteoporosis in primary healthcare in the Stockholm Region - Nature - October 23rd, 2025 [October 23rd, 2025]
- ECBs New Perspective on Machine Learning in Banking - KPMG - October 23rd, 2025 [October 23rd, 2025]
- Ensemble Machine Learning for Digital Mapping of Soil pH and Electrical Conductivity in the Andean Agroecosystem of Peru - Frontiers - October 21st, 2025 [October 21st, 2025]
- New UA research develops machine learning to address needs of children with autism - AZPM News - October 21st, 2025 [October 21st, 2025]
- NMDSI Speaker Series on Weather Forecasting: What Machine Learning Can and Can't Do, Oct. 23 - Marquette Today - October 21st, 2025 [October 21st, 2025]
- Polyskill Achieves 1.7x Improved Skill Reuse and 9.4% Higher Success Rates through Polymorphic Abstraction in Machine Learning - Quantum Zeitgeist - October 21st, 2025 [October 21st, 2025]
- University of Strathclyde opens admission for MSc in Machine & Deep Learning for Jan 2026 intake - The Indian Express - October 21st, 2025 [October 21st, 2025]
- Reducing Model Biases with Machine Learning Corrections Derived from Ocean Data Assimilation Increments - ESS Open Archive - October 19th, 2025 [October 19th, 2025]
- Unlocking Obesity: Multi-Omics and Machine Learning Insights - Bioengineer.org - October 19th, 2025 [October 19th, 2025]
- Lockheed Martin advances PAC-3 MSE interceptor using artificial intelligence and machine learning - Defence Industry Europe - October 19th, 2025 [October 19th, 2025]
- Semi-automated surveillance of surgical site infections using machine learning and rule-based classification models - Nature - October 19th, 2025 [October 19th, 2025]
- AI and Machine Learning - City of San Jos to release RFP for generative AI platform - Smart Cities World - October 19th, 2025 [October 19th, 2025]
- Machine learning helps identify 'thermal switch' for next-generation nanomaterials - Phys.org - October 17th, 2025 [October 17th, 2025]
- Machine Learning Makes Wildlife Data Analysis Less of a Trek - Maryland.gov - October 17th, 2025 [October 17th, 2025]
- An interpretable multimodal machine learning model for predicting malignancy of thyroid nodules in low-resource scenarios - BMC Endocrine Disorders - October 17th, 2025 [October 17th, 2025]
- In First-Episode Psychosis Patients, Machine Learning Predicted Illness Trajectories to Potentially Improve Outcomes - Brain and Behavior Research - October 17th, 2025 [October 17th, 2025]
- Novel Machine Learning Model Improves MASLD Detection in Type 2 Diabetes - The American Journal of Managed Care (AJMC) - October 17th, 2025 [October 17th, 2025]
- Hybrid machine learning models for predicting the tensile strength of reinforced concrete incorporating nano-engineered and sustainable supplementary... - October 17th, 2025 [October 17th, 2025]
- Modelling of immune infiltration in prostate cancer treated with HDR-brachytherapy using Raman spectroscopy and machine learning - Nature - October 17th, 2025 [October 17th, 2025]
- Association between atherogenic index of plasma and sepsis in critically ill patients with ischemic stroke: a retrospective cohort study using... - October 17th, 2025 [October 17th, 2025]
- AI enters the nuclear age: Pentagon modernizes warheads with machine learning - Washington Times - October 17th, 2025 [October 17th, 2025]
- AI and Machine Learning - Bentley Systems shares its vision for trustworthy AI - Smart Cities World - October 17th, 2025 [October 17th, 2025]
- Looking back to move forward: can historical clinical trial data and machine learning drive change in participant recruitment in anticipation of... - October 15th, 2025 [October 15th, 2025]
- Physics-Based Machine Learning Paves the Way for Advanced 3D-Printed Materials - Bioengineer.org - October 15th, 2025 [October 15th, 2025]
- Predicting one-year overall survival in patients with AITL using machine learning algorithms: a multicenter study - Nature - October 15th, 2025 [October 15th, 2025]
- Explainable machine learning models for predicting of protein-energy wasting in patients on maintenance haemodialysis - BMC Nephrology - October 15th, 2025 [October 15th, 2025]
- Feasibility of machine learning analysis for the identification of patients with possible primary ciliary dyskinesia - Orphanet Journal of Rare... - October 15th, 2025 [October 15th, 2025]
- Machine learning-based prediction of preeclampsia using first-trimester inflammatory markers and red blood cell indices - BMC Pregnancy and Childbirth - October 15th, 2025 [October 15th, 2025]
- Utilizing AI and machine learning to improve railroad safety: Detecting trespasser hotspots - masstransitmag.com - October 15th, 2025 [October 15th, 2025]
- Precision medicine meets machine learning: AI and oncology biomarkers - pharmaphorum - October 15th, 2025 [October 15th, 2025]
- Aether Pro Exchange Transforms Execution Dynamics with Machine-Learning Optimization - GlobeNewswire - October 15th, 2025 [October 15th, 2025]