前言

AWS Textract is an AWS tool used for extracting text from PDFs (or images). Ideally, your original document would have only one column, such as a book. However, things become more complex when dealing with multiple columns, such as newspaper articles. Therefore, this article aims to share how to use Amazon Textract to handle sorting of text from multi-column documents. This article is inspired by AWS Textract: how to detect and sort text from a multi-column document, with some improvements made.

My source material is a newspaper article, with the layout as shown below:

Textract Response Format

The Textract output is structured JSON formed by various BlockTypes. A “Page” BlockType consists of multiple “Line” blocks, and each “Line” block consists of multiple “Word” blocks. In these responses, you cannot see any structural information to simply sort multi-column text into a single column. However, what can be known is that Textract parses text from top to bottom and left to right. You can observe this parsing order from the numbered sections 27 to 48 in the image below, where even though they belong to different columns, Textract parses them sequentially from left to right and top to bottom.

Solution

The approach we adopt is to utilize the bounding box coordinates provided by Textract to draw boundaries around text blocks. By grouping nearby lines into a block, we can identify lines belonging to the same column. The final result will resemble the image below:

From the above image, the following information can be inferred:

The longest block is likely the title.
Other blocks can be sorted from top to bottom and left to right, allowing us to determine the reading order and understand the distribution of columns.

Step 1: Define the Class

From the official documentation on how to interpret target locations on the document, it is described as follows:

"BoundingBox": {
    "Width": 0.007353090215474367,
    "Height": 0.0288887619972229,
    "Left": 0.08638829737901688,
    "Top": 0.03477252274751663
}

From the official documentation

After understanding the format, we can define a class to handle this data:


```python
class Page: 
    """
    Used to handle the content returned by Textract. Each Page may have one or more Lines.
    
    Args:
        page (int): The number of this Page.
        lines (Line): The Lines contained in this Page.
    """
    def __init__(self, page_number, lines):
        self.lines = lines
        self.page = page_number
    
    def __str__(self):
        for line in self.lines:
            print(f"line: {line.__str__()}")
        return f"Page: {self.page}"

class Block:
    """
    Block: This is the Block we are dealing with. Each Block may have one or more Lines.
    
    Args:
        page (int): The Page where this Block is located.
        id (int): Simply records the index of this Block.
        lines (Line): The Lines contained in this Block.
        left (float): The x-coordinate of the top-left corner of the Block.
        top (float): The y-coordinate of the top-left corner of the Block.
        height (float): The height of the Block.
        width (float): The width of the Block.
    """
    def __init__(self):
        self.lines = []
        self.page = 0
        self.id = ""
        self.left = 0 
        self.top = 0 
        self.height = 0 
        self.width = 0
        
    def __str__(self):
        return f"Block: page={self.page}, id={self.id}, (x1,y1)=({self.left}, {self.top}), (x2,y2)=({self.left + self.width},{self.top + self.height})"

    def add_line(self, line):
        """ 
        Add a line to the block and recalculate the center of the block.
        """
        self.lines.append(line)

class Line:
    """
    Handles the smallest unit of text.
    
    Args:
        page (int): The Page where this Line is located.
        Id (int): The Line Id returned by Textract for easy lookup of the corresponding text in the original document.
        text (str): The text of the Line.
        top (float): The y-coordinate of the top-left corner of the Line.
        left (float): The x-coordinate of the top-left corner of the Line.
        width (float): The width of the Line.
        height (float): The height of the Line.
    """
    def __init__(self, Id, page, text, top, left, width, height):
        self.top = top
        self.left = left
        self.width = width
        self.height = height
        self.page = page
        self.Id = Id
        self.text = text 
        self.center = self.get_center()
    
    def __str__(self):
        return f"Line: \t page={self.page}, Id={self.Id}, Text={self.text}, \n\t (x1,y1)=({self.left}, top={self.top}); width={self.width}, height={self.height} \n"

    def get_center(self):
        """
        Get the center of the Line.
        """
        x = self.left
        y = self.top
        x1 = self.left + self.width
        y1 = self.top + self.height
        x_center = (x + x1) / 2
        y_center = (y + y1) / 2
        return [x_center, y_center]

Step 2: General Function

Next, we need some general functions to handle the data. Here, we will need the following functions:

read_json_file: Read the JSON file returned by Textract.
two_point_distance: Calculate the distance between two points. This will be used to calculate whether the center distance between Lines is too far.
- $d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$
two_point_height: Calculate the vertical distance between two Lines to determine if they are close enough.
pretty_similar: Determine if the difference is within an acceptable range.
print_blocks: Print information about Blocks.
get_lines_from_json: Get information about Lines from the JSON returned by Textract.
find_block_corners: Find the four corner coordinates of a Block to enclose all Lines.

import math
import json 
from pdf2image import convert_from_bytes
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

def read_json_file(file_name):
    with open(file_name, "r") as file:
        return json.load(file)

def two_point_distance(x, y, x1, y1):
    distance = math.sqrt((x - x1) ** 2 + (y - y1) ** 2)
    return distance

def two_point_hight(y, y1): 
    return abs(y-y1)

def pretty_similar(x, x1, tolerance):
    return abs(x - x1) < tolerance

def print_blocks(blocks):
    for block in blocks:
        print(f"{block.__str__()}")
        last_line = block.lines[0]
        for line in block.lines:
            print(f"Line: \t Page={line.page}, Id={line.Id}, left={line.left}, top={line.top}, width={line.width}, height={line.height}, center={line.center} two_point_distance = {two_point_distance(last_line.center[0], last_line.center[1], line.center[0], line.center[1])} (x1,y1)=({line.left}, {line.top}), (x2,y2)=({line.left + line.width},{line.top + line.height})")
            last_line = line
        print("\n")

def get_lines_from_json(file_path):
    json_response = read_json_file(file_path)
    lines = []
    for item in json_response["Blocks"]:
        if item["BlockType"] == "LINE":
            box = item["Geometry"]["BoundingBox"]
            lines.append(Line(item["Id"], item["Page"], item["Text"], box["Top"], box["Left"], box["Width"], box["Height"]))
    return lines

def find_block_corners(blocks): 
    min_top = float('inf')
    min_left = float('inf')
    
    for index, block in enumerate(blocks):
        min_top = min(line.top for line in block.lines)
        min_left = min(line.left for line in block.lines)
        max_bottom = max(line.top + line.height for line in block.lines)
        max_right = max(line.left + line.width for line in block.lines)

        block.height = max_bottom - min_top 
        block.width = max_right - min_left
        block.top = min_top
        block.left = min_left 
        block.id = index
        
    return blocks
  

def show_image_bbox(pdf_file, blocks):
    """
    Use to show the image with bounding box
    """
    with open(pdf_file, 'rb') as file:
        images = convert_from_bytes(file.read())

    for index, image in enumerate(images):
        width, height =image.size  
        page = index + 1
        print(f"Process Page Index: {page}")
        
        plt.figure(figsize=(20,16))
        plt.imshow(image)
        
        # iterate over the blocks 
        for i, block in enumerate(blocks):
            if (block.page == page):
                rect = Rectangle((width * block.left, height * block.top), block.width * width, block.height * height, edgecolor='r', facecolor='none')
                plt.text(width * block.left, height * block.top, block.id, fontsize=12, color='red')
                plt.gca().add_patch(rect)
        plt.show()

Step 3: Define Rule of Block

現在我們要設計，在什麼樣的條件滿足下可以形成一個Block，這邊我們設計的規則如下：

定義規則的程式碼如下：

def is_two_line_close(block, target_line, cur_line):
    """
    Check if two lines are close enough, so we can merge them into a block
    """
    left_tolerance = 0.02
    width_tolerance =  0.01 # [left_tolerance, (target_line.width - cur_line.width).abs / 2].max
    distance_tolerance = 0.04
    height_tolerance = 0.03

    def is_left_similar(line1, line2, tolerance = left_tolerance):
        return pretty_similar(line1.left, line2.left, tolerance)
    def is_width_similar(line1, line2, tolerance = width_tolerance): 
        return  pretty_similar(line1.width, line2.width, tolerance)
    def is_height_similar(line1, line2, tolerance = height_tolerance):
        return two_point_hight(line1.top, line2.top) < tolerance 
    def is_on_same_page(line1, line2):
        return line1.page == line2.page
    def is_center_close(line1, line2):
        return two_point_distance(line1.center[0], line1.center[1], line2.center[0], line2.center[1]) < distance_tolerance
    
    def is_same_paragraph():
        """
        Use to handle the same paragraph
        If the starting point on the left is the same and the height is similar, they are in the same Block
        """
        if (is_left_similar(target_line, cur_line) and 
            is_height_similar(block.lines[-1], cur_line)):
            return True 
        return False 
    
    def is_text_center_context():
        """
        Use to handle the text in the center
        If the center is close and the height is similar, they are in the same Block
        """
        return (is_center_close(target_line, cur_line) and is_height_similar(block.lines[-1], cur_line))

    # First check if they are on the same page, as a Block cannot span multiple pages
    if is_on_same_page(target_line, cur_line): 
        # If in the same paragraph or the text is centered, select them into block
        if is_same_paragraph() or is_text_center_context():
            return True
    else: 
        return False

Step 4: Iterate Line to Form Block

Then we can start iterating through all the Lines to find the Block.

def merge_lines_to_block(lines):
    """
    Merge Lines into Blocks.
    """
    ready_blocks = [] 

    # As long as there are still lines, continue to form Blocks
    while lines: 
        block = Block()
        target_line = lines[0] # Take the first Line as the first object to compare for forming a Block
        block.add_line(lines[0]) # Add the target_line to the Block
        block.page = target_line.page # Set the Block's Page
        lines.pop(0) # Remove the target_line from lines
        index = 0  # Reset index to 0 because pop affects the index order
        # Recursively iterate through all Lines until there are no more lines to compare
        while index < len(lines):
            cur_line = lines[index]
            if target_line.page == cur_line.page: 
                # If the width is the same, the centers cannot be too far apart
                if is_two_line_close(block, target_line, cur_line):
                    block.add_line(cur_line)
                    lines.pop(index) # After popping, cur_line needs to start from index 0
                    index = 0  # Reset index to 0
                    continue  # Continue to the next iteration
            index += 1  # Check the next element

        ready_blocks.append(block) # Add the organized Block to the list
    return ready_blocks

Step 5: Execution

Finally, we can execute the code:

json_path = "./result/test.json" # Path to the JSON file returned by Textract
pdf_path = "../../src/test.pdf"

lines = get_lines_from_json(json_path) # Get information about Lines from the JSON returned by Textract
blocks = merge_lines_to_block(lines) # Merge Lines into Blocks 
blocks = find_block_corners(blocks) # Find the four corner coordinates of a Block to enclose all Lines
show_image_bbox(pdf_path, blocks) # Display the image with bounding boxes around Blocks
#print_blocks(blocks) # Print information about Blocks

The result is shown in the image below:

Advance: Clean Code + Single Page

This section is a bit more advanced, as we will clean up the code and make it suitable for single-page documents. To manage the code better, I have split the services into modules. Below is the file structure:

File Structure

.
├── resource
│   ├── pdf # put pdf file 
│   │   ├── single-column.pdf
│   │   └── multi-column.pdf
│   └── result # put textract json file
│       ├── single-column
│       │   └── final-result.json
│       └── multi-column
│           ├── final-result.json
│           ├── result_0.json
│           └── result_1.json
├── src # main source code 
│   ├── models # Class
│   │   ├── __init__.py
│   │   ├── block.py
│   │   ├── line.py
│   │   ├── page.py
│   │   └── process_type.py
│   ├── ocr # OCR related service 
│   │   ├── util
│   │   │   ├── __init__.py
│   │   │   ├── bbox_merger.py
│   │   │   └── functions.py
│   │   └── __init__.py
│   └── __init__.py
├── tests
│   ├── __init__.py
│   ├── test_block_merge.ipynb
└── README.md

Entity Class file

│   ├── models
│   │   ├── __pycache__
│   │   ├── __init__.py
│   │   ├── block.py
│   │   ├── line.py
│   │   ├── page.py
│   │   └── process_type.py

from typing import List
from src.models.line import Line


class Block:
    def __init__(self) -> None:
        self.lines: List[Line] = []
        self.page: int = 0
        self.reason: str = ""
        self.id: str = ""

        self.left: int = 0
        self.top: int = 0
        self.height: int = 0
        self.width: int = 0

    def __str__(self) -> str:
        return f"Block: page={self.page}, id={self.id}, (x1,y1)=({self.left}, {self.top}), (x2,y2)=({self.left + self.width},{self.top + self.height})"

    def add_line(self, line):
        self.lines.append(line)

class Line:
    def __init__(self, id_: str, page: int, text: str, top: int, left: int, width: int, height: int) -> None:
        self.top: int = top
        self.left: int = left
        self.width: int = width
        self.height: int = height

        self.page: int = page
        self.id_: str = id_
        self.text: str = text
        self.center: list = self.get_center()

    def __str__(self) -> str:
        return (f"Line: \t page={self.page}, "
                f"Id={self.id_}, "
                f"Text={self.text}, \n"
                f"left={self.left}, top={self.top}); "
                f"width={self.width}, height={self.height} \n")

    def get_center(self) -> list:
        x = self.left
        y = self.top
        x1 = self.left + self.width
        y1 = self.top + self.height
        x_center = (x + x1) / 2
        y_center = (y + y1) / 2
        return [x_center, y_center]

class Page:
    def __init__(self, page_number, lines):
        self.lines = lines
        self.page = page_number

    def __str__(self):
        for line in self.lines:
            print(f"line: {line.__str__()}")
        return f"Page: {self.page}"

class ProcessType:
    LINE = "LINE"
    WORD = "WORD"

General Function

src/ocr/util/functions.py

# response
import json
from PIL import Image
from pdf2image import convert_from_bytes
from typing import List


def read_json_file(file_path):
    with open(file_path, "r") as file:
        return json.load(file)


def read_file_to_bytes(file_path: str) -> List[Image.Image]:
    with open(file_path, 'rb') as file:
        pdf_binary = file.read()
    return convert_from_bytes(pdf_binary)

bbox_merger.py: Turn lines to blocks class.

src/ocr/util/bbox_merger.py

import math
import re
from typing import List

from pdf2image import convert_from_bytes
from matplotlib.patches import Rectangle

from src.models.block import Block
from src.models.line import Line
from src.ocr.util.functions import read_json_file
from matplotlib import pyplot as plt


class ColumnType:
    """
    Column type, based on the column type of pdf.
    """
    SINGLE = "SINGLE"
    MULTI = "MULTI"


def get_lines_from_json(file_path: str) -> List[Line]:
    """
    Get all type "LINE" from json file generated by textract.
    :param file_path: json file generated by textract.
    :return: list of Line.
    """
    lines: List[Line] = []
    json_res = read_json_file(file_path)
    for item in json_res["Blocks"]:
        if item["BlockType"] == "LINE":
            box = item["Geometry"]["BoundingBox"]
            lines.append(
                Line(
                    item["Id"],
                    item["Page"],
                    item["Text"],
                    box["Top"],
                    box["Left"],
                    box["Width"],
                    box["Height"]))
    return lines


def print_blocks(blocks: List[Block]) -> None:
    """
    print block and line information
    :param blocks: blocks to print
    """
    for block in blocks:
        print(f"{block.__str__()}")
        for line in block.lines:
            print(f"{line.__str__()}")
        print("\n")


class LineSimilarityChecker:
    """
    This class is used to check the similarity between two lines.
    """

    def __init__(self, column_type: ColumnType,
                 distance_tolerance: float = 0.03,
                 width_tolerance: float = 0.01,
                 left_tolerance: float = 0.02,
                 height_tolerance: float = 0.02,
                 same_line_tolerance: float = 0.005
                 ) -> None:
        self.column_type = column_type

        self.distance_tolerance = distance_tolerance
        self.width_tolerance = width_tolerance
        self.left_tolerance = left_tolerance
        self.height_tolerance = height_tolerance
        self.same_line_tolerance = same_line_tolerance

    def is_left_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.left_tolerance
        return self.pretty_similar(line1.left, line2.left, tolerance)

    def is_width_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.width_tolerance
        return self.pretty_similar(line1.width, line2.width, tolerance)

    def is_height_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.height_tolerance
        return self.two_point_height(line1.top, line2.top) < tolerance

    def is_center_close(self, line1: Line, line2: Line) -> bool:
        return self.two_point_distance(
            line1.center[0],
            line1.center[1],
            line2.center[0],
            line2.center[1]) < self.distance_tolerance

    @staticmethod
    def pretty_similar(x: float, x1: float, tolerance: float):
        return abs(x - x1) < tolerance

    @staticmethod
    def two_point_distance(x: float, y: float, x1: float, y1: float):
        distance = math.sqrt((x - x1) ** 2 + (y - y1) ** 2)
        return distance

    @staticmethod
    def two_point_height(y: float, y1: float):
        return abs(y - y1)


class LineMerger:
    """
    This class is used to turn lines to blocks by compare each line's similarity.
    """

    def __init__(self, lines, column_type: ColumnType = ColumnType.SINGLE):
        self.column_type = column_type
        self.line_check = LineSimilarityChecker(self.column_type)
        self.lines: List[Line] = lines

    def get_blocks(self) -> List[Block]:
        """
        Get all blocks after turning lines into blocks.
        :param column_type: column is SINGLE or MULTI default is SINGLE
        :return: blocks
        """
        blocks = self.merge_lines_to_block(self.lines)
        return self.find_block_corners(blocks)

    def merge_lines_to_block(self, lines) -> List[Block]:
        blocks: List[Block] = []
        while lines:
            block = Block()
            block.add_line(lines.pop(0))
            block.page = block.lines[0].page
            target_line = block.lines[0]
            index = 0
            while index < len(lines):
                cur_line = lines[index]
                if target_line.page == cur_line.page:
                    # for single column, when encounter number point, make a
                    # new a block
                    if self.column_type == ColumnType.SINGLE and self.is_start_special_word(
                            cur_line):
                        print("---Found special word---")
                        print(cur_line.text)
                        print("---End special word: Jump to next block---")
                        break
                    # other case, all need to compare the lines are close or
                    # not.
                    else:
                        if self.is_two_line_close(block, cur_line):
                            block.add_line(cur_line)
                            lines.pop(index)
                            index = 0
                            continue
                index += 1
            blocks.append(block)
        return blocks

    def is_start_special_word(self, cur_line: Line):
        # 先去字串前後空白 根據空白進行split，取第一個字串
        curStart = cur_line.text.strip().split(" ")[0]
        pattern = self._regex_pattern()

        if re.match(pattern, curStart):
            return True
        else:
            return False

    @staticmethod
    def _regex_pattern() -> str:
        # general word or number + "."  + any words (e.g. 1.Hello my friend)
        GENERAL_WORD_DOT_PATTERN = r'^[a-zA-Z0-9]\..*'
        # non-general one word or number + general one word or num + any word (e.g (1) This is ...)
        NON_ALPHANUMERIC_WORD_PATTERN = r'[^a-zA-Z0-9][a-zA-Z0-9][^a-zA-Z0-9].*'

        return '{}|{}'.format(
            GENERAL_WORD_DOT_PATTERN,
            NON_ALPHANUMERIC_WORD_PATTERN)

    def is_two_line_close(self, block, cur_line):
        last_line: Line = block.lines[-1]
        target_line: Line = block.lines[0]

        if self.is_on_same_page(target_line, cur_line):
            # multi column: center text & paragraph block
            if self.column_type == ColumnType.MULTI:
                if (self.is_same_paragraph(last_line, cur_line) or
                        self.is_text_center_context(last_line, cur_line)):
                    return True

            # single column: same line
            elif self.column_type == ColumnType.SINGLE:
                if (self.is_on_same_line(last_line, cur_line) or
                        self.is_same_paragraph(last_line, cur_line)):
                    return True

        return False

    @staticmethod
    def is_on_same_page(line1, line2) -> bool:
        return line1.page == line2.page

    def is_on_same_line(self, last_line, cur_line) -> bool:
        return self.line_check.is_height_similar(last_line, cur_line)

    def is_same_paragraph(
            self,
            last_line: Line,
            cur_line: Line) -> bool:
        if (self.line_check.is_left_similar(last_line, cur_line)
                and self.line_check.is_height_similar(last_line, cur_line)):
            return True

        return False

    def is_text_center_context(self, last_line: Line, cur_line: Line) -> bool:
        return (self.line_check.is_center_close(last_line, cur_line) and
                self.line_check.is_height_similar(last_line, cur_line))

    @staticmethod
    def find_block_corners(blocks: List[Block]) -> List[Block]:
        for index, block in enumerate(blocks):
            min_top = min(line.top for line in block.lines)
            min_left = min(line.left for line in block.lines)
            max_bottom = max(line.top + line.height for line in block.lines)
            max_right = max(line.left + line.width for line in block.lines)

            block.height = max_bottom - min_top
            block.width = max_right - min_left
            block.top = min_top
            block.left = min_left
            block.id = index

        return blocks


def show_image_bbox(pdf_file, blocks) -> None:
    """
    show image bounding box
    :param pdf_file: the pdf file location
    :param blocks: the list of blocks we want to draw
    """
    with open(pdf_file, 'rb') as file:
        images = convert_from_bytes(file.read())

    for index, image in enumerate(images):
        width, height = image.size
        page = index + 1
        print(f"Process Page Index: {page}")

        plt.figure(figsize=(20, 16))
        plt.imshow(image)

        # iterate over the blocks
        for i, block in enumerate(blocks):
            if block.page == page:
                rect = Rectangle(
                    (width * block.left,
                     height * block.top),
                    block.width * width,
                    block.height * height,
                    edgecolor='r',
                    facecolor='none')
                plt.text(
                    width * block.left,
                    height * block.top,
                    block.id,
                    fontsize=12,
                    color='red')
                plt.gca().add_patch(rect)
        plt.show()

Then we can call the method directly.
tests/test_block_merge.ipynb

from src.ocr.util.bbox_merger import (
    show_image_bbox, 
    get_lines_from_json, 
    LineMerger, 
    print_blocks, 
    ColumnType
)

class Value:
    def __init__(self, is_multi_column: bool):
        if is_multi_column:
            self.topic = "multi-column"
            self.column_type = ColumnType.MULTI
        else:
            self.topic = "single-column"
            self.column_type = ColumnType.SINGLE
            
        self.json_path = '../resource/result/{}/final-result.json'.format(self.topic)
        self.pdf_path = '../resource/pdf/{}.pdf'.format(self.topic)

def main(is_multi_column: bool):
    v = Value(is_multi_column = is_multi_column)
    lines = get_lines_from_json(v.json_path)
    blocks = LineMerger(lines, v.column_type).get_blocks()
    print_blocks(blocks)
    show_image_bbox(pdf_file=v.pdf_path, blocks=blocks)

if __name__ == "__main__":
    main(is_multi_column = False)