How to Handle Multi-Column Text Sorting with Amazon Textract
前言
AWS Textract is an AWS tool used for extracting text from PDFs (or images). Ideally, your original document would have only one column, such as a book. However, things become more complex when dealing with multiple columns, such as newspaper articles. Therefore, this article aims to share how to use Amazon Textract to handle sorting of text from multi-column documents. This article is inspired by AWS Textract: how to detect and sort text from a multi-column document, with some improvements made.
My source material is a newspaper article, with the layout as shown below:
Textract Response Format
The Textract output is structured JSON formed by various BlockTypes. A “Page” BlockType consists of multiple “Line” blocks, and each “Line” block consists of multiple “Word” blocks. In these responses, you cannot see any structural information to simply sort multi-column text into a single column. However, what can be known is that Textract parses text from top to bottom and left to right. You can observe this parsing order from the numbered sections 27 to 48 in the image below, where even though they belong to different columns, Textract parses them sequentially from left to right and top to bottom.
Solution
The approach we adopt is to utilize the bounding box coordinates provided by Textract to draw boundaries around text blocks. By grouping nearby lines into a block, we can identify lines belonging to the same column. The final result will resemble the image below:
From the above image, the following information can be inferred:
- The longest block is likely the title.
- Other blocks can be sorted from top to bottom and left to right, allowing us to determine the reading order and understand the distribution of columns.
Step 1: Define the Class
From the official documentation on how to interpret target locations on the document, it is described as follows:
1 | "BoundingBox": { |
From the official documentation
After understanding the format, we can define a class to handle this data:
1 |
|
Step 2: General Function
Next, we need some general functions to handle the data. Here, we will need the following functions:
read_json_file
: Read the JSON file returned by Textract.two_point_distance
: Calculate the distance between two points. This will be used to calculate whether the center distance between Lines is too far.two_point_height
: Calculate the vertical distance between two Lines to determine if they are close enough.pretty_similar
: Determine if the difference is within an acceptable range.print_blocks
: Print information about Blocks.get_lines_from_json
: Get information about Lines from the JSON returned by Textract.find_block_corners
: Find the four corner coordinates of a Block to enclose all Lines.
1 | import math |
Step 3: Define Rule of Block
現在我們要設計,在什麼樣的條件滿足下可以形成一個Block,這邊我們設計的規則如下:
定義規則的程式碼如下:
1 | def is_two_line_close(block, target_line, cur_line): |
Step 4: Iterate Line to Form Block
Then we can start iterating through all the Lines to find the Block.
1 | def merge_lines_to_block(lines): |
Step 5: Execution
Finally, we can execute the code:
1 | json_path = "./result/test.json" # Path to the JSON file returned by Textract |
The result is shown in the image below:
Advance: Clean Code + Single Page
This section is a bit more advanced, as we will clean up the code and make it suitable for single-page documents. To manage the code better, I have split the services into modules. Below is the file structure:
File Structure
1 | . |
Entity Class file
1 | │ ├── models |
1 | from typing import List |
General Function
src/ocr/util/functions.py
1 | # response |
bbox_merger.py: Turn lines to blocks class.
src/ocr/util/bbox_merger.py
1 | import math |
Then we can call the method directly.
tests/test_block_merge.ipynb
1 | from src.ocr.util.bbox_merger import ( |