AWS Textract is a powerful text and data extraction tool that can be used to extract text and data from documents. This article will show you how to use AWS Textract to extract text and data from documents.
- Start by opening the AWS Textract console.
- In the console, click on the “Create” button.
- In the “Name” field, enter “Textract”.
- In the “Description” field, enter a brief description of what you want to extract from the document. For example, you could say that you want to extract all words in a document, or all sentences in a document.
- In the “Usage” field, enter a brief description of how you plan to use Textract in your application or project. For example, you might want to use Textract to extract images or videos from documents.
- In the “Country Code” field, enter the country code of your document’s source country (for example United States). This will help Amazon Textract identify and extract text and data from documents in that specific country!
- Click on the “Create” button and wait for Amazon Textract to start extracting text and data from your document!
Many companies use human workers to do manual data entry on forms, applications, and other physical documents. While this is very accurate, it’s slow and costly. AWS Textract uses machine learning to automate this process.
Why Use AWS Textract?
Textract certainly isn’t the only Optical Character Recognition tool—there are plenty of open source solutions available for free, such as Tesseract OCR. You can read our guide to using that to learn more.
Textract, however, is a lot more than simple OCR as it’s meant for analyzing and extracting data from forms, tables, and other documents. It’s able to pull out important key-value pairs, tables, and other key strings, which makes it actually usable as an interface between scanned documents and a database (though you’ll need to set that automation up yourself).
The other allure is that Textract makes OCR available as a fully managed cloud service. You don’t need to set up your own application servers to run OCR and understand the output; just configure Textract, and send it some documents, it will output the results.
For companies still doing manual data entry, Textract can save you a lot of money, both in the reduced man hours spent typing on a keyboard, and the fact that it can batch process many items at once, increasing the speed of data entry immensely.
In terms of price, Textract is cheapest for straight up text, like scanning pages of books. For that, it only costs $1.50 per 1000 pages. For analyzing tables, it costs $15.00 per 1000 pages. For key-value pairs, it costs $50.00 per 1000 pages. While that’s not exactly free, it sure beats paying a human to do it manually.
Using Textract
Head over to the Textract Management Console, and click “get started.” Using the console manually, you can upload documents using the button here:
Textract will process it immediately. You’ll quickly see what makes Textract so useful; it knew which pieces of text on this W2 form were important, which ones were part of key-value pairs, which ones were part of tables, and which ones it could throw out.
On the right, you’ll find the output, which displays all the raw strings it found, the key-value pairs, and any tables of data. Note that these aren’t mutually exclusive, as in this case it found key-value pairs that where also parts of tables.
You can download the results, and you’ll find a CSV file of all tables and key-value pairs, as well as a text file of the raw text output.
If you want to automate Textract, you’ll need to use the AWS CLI or API. Textract has its own set of commands for working with it from the command line.
You can either serialize the document to base64-encoded document bytes, or upload it to S3 and give Textract a key for where to find it. Then, you can use analyze-document to start a job:
This is a synchronous operation, but you can analyze asynchronously by starting a job and then fetching the results manually.