前言

AWS Textract 是用於從 pdf（或圖片）中擷取文字的 AWS 工具。最好的情況是您的原始文件只有一欄，例如一本書。當您有多個專欄（例如報紙文章）時，事情處理起來會更加複雜。所以這次來分享一下如何使用 Amazon Textract 來處理多欄位的文字排序。有參考這篇AWS Textract: how to detect and sort text from a multi-column document做一些改良。

我的來源是一篇報紙文章，版面如下：

Textract Response format

Textract 輸出是由各種 BlockType 分層排列形成的 JSON。一個BlockType的「Page」由多個「Line」組成，而「Line」又由多個「Word」組成。在這些回應中，您看不到任何結構資訊，無法將多列文字僅排序為一列。但是可以知道的是，Textract在解析文字時，是由上到下，且一排排的解析，可以參考下圖中的編號的第27~48，您可以發現儘管是在不同的Column，但是Textract解析的順序是由左到右依序往下解析。

Solution

我們所採用的想法是使用 Textract 所提供繪製出邊界框座標。把相近的Line形成一個Block，來找出同一Column的Line。最終結果會如下圖：

從上圖中可以得知以下訊息：

最長的Block可能是Title
其他Block可以由上而下，由左至右排序，這樣就可以組合出閱讀順序，並得知Column的分佈

Step 1: Define the Class

從官網中有說明文檔上的目標位置怎麼看，簡單來說會長以下

"BoundingBox": {
                    "Width": 0.007353090215474367,
                    "Height": 0.0288887619972229,
                    "Left": 0.08638829737901688,
                    "Top": 0.03477252274751663
                }

取自官方網站

在理解格式之後我們可以先定義Class來處理這些資料:

class Page:
    def __init__(self, page_number, lines):
        self.lines = lines
        self.page = page_number

    def __str__(self):
        for line in self.lines:
            print(f"line: {line.__str__()}")
        return f"Page: {self.page}"


from typing import List
from src.models.line import Line


class Block:
    def __init__(self) -> None:
        self.lines: List[Line] = []
        self.page: int = 0
        self.reason: str = ""
        self.id: str = ""

        self.left: int = 0
        self.top: int = 0
        self.height: int = 0
        self.width: int = 0

    def __str__(self) -> str:
        return f"Block: page={self.page}, id={self.id}, (x1,y1)=({self.left}, {self.top}), (x2,y2)=({self.left + self.width},{self.top + self.height})"

    def add_line(self, line):
        self.lines.append(line)

class Line:
    def __init__(self, id_: str, page: int, text: str, top: int, left: int, width: int, height: int) -> None:
        self.top: int = top
        self.left: int = left
        self.width: int = width
        self.height: int = height

        self.page: int = page
        self.id_: str = id_
        self.text: str = text
        self.center: list = self.get_center()

    def __str__(self) -> str:
        return (f"Line: \t page={self.page}, "
                f"Id={self.id_}, "
                f"Text={self.text}, \n"
                f"left={self.left}, top={self.top}); "
                f"width={self.width}, height={self.height} \n")

    def get_center(self) -> list:
        x = self.left
        y = self.top
        x1 = self.left + self.width
        y1 = self.top + self.height
        x_center = (x + x1) / 2
        y_center = (y + y1) / 2
        return [x_center, y_center]

Step 2: General Function

接下來我們需要一些通用的Function來處理資料，這邊我們會需要以下幾個Function

read_json_file: 讀取Textract回傳的JSON檔案
read_file_to_bytes: 主要讀取PDF檔案呈現結果使用
print_blocks: 用來印出Block的資訊
get_lines_from_json: 從Textract回傳的JSON中取得Line的資訊

# response
import json
from PIL import Image
from pdf2image import convert_from_bytes
from typing import List


def read_json_file(file_path):
    with open(file_path, "r") as file:
        return json.load(file)


def read_file_to_bytes(file_path: str) -> List[Image.Image]:
    with open(file_path, 'rb') as file:
        pdf_binary = file.read()
    return convert_from_bytes(pdf_binary)


def get_lines_from_json(file_path: str) -> List[Line]:
    """
    Get all type "LINE" from json file generated by textract.
    :param file_path: json file generated by textract.
    :return: list of Line.
    """
    lines: List[Line] = []
    json_res = read_json_file(file_path)
    for item in json_res["Blocks"]:
        if item["BlockType"] == "LINE":
            box = item["Geometry"]["BoundingBox"]
            lines.append(
                Line(
                    item["Id"],
                    item["Page"],
                    item["Text"],
                    box["Top"],
                    box["Left"],
                    box["Width"],
                    box["Height"]))
    return lines


def print_blocks(blocks: List[Block]) -> None:
    """
    print block and line information
    :param blocks: blocks to print
    """
    for block in blocks:
        print(f"{block.__str__()}")
        for line in block.lines:
            print(f"{line.__str__()}")
        print("\n")

此外，因為我們判斷兩個Line是否要合併成一個Block很大的要素就是看他們是否相近，因此我們可以定義一個Class負責做這件事情：

two_point_distance 計算兩點之間的距離，這個會被用來計算Lines之間的中心距離是否太遠: $\text{d} = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$


class LineSimilarityChecker:
    """
    This class is used to check the similarity between two lines.
    """

    def __init__(self, column_type: ColumnType,
                 distance_tolerance: float = 0.03,
                 width_tolerance: float = 0.01,
                 left_tolerance: float = 0.02,
                 height_tolerance: float = 0.02,
                 same_line_tolerance: float = 0.005
                 ) -> None:
        self.column_type = column_type

        self.distance_tolerance = distance_tolerance
        self.width_tolerance = width_tolerance
        self.left_tolerance = left_tolerance
        self.height_tolerance = height_tolerance
        self.same_line_tolerance = same_line_tolerance

    def is_left_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.left_tolerance
        return self.pretty_similar(line1.left, line2.left, tolerance)

    def is_width_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.width_tolerance
        return self.pretty_similar(line1.width, line2.width, tolerance)

    def is_height_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.height_tolerance
        return self.two_point_height(line1.top, line2.top) < tolerance

    def is_center_close(self, line1: Line, line2: Line) -> bool:
        return self.two_point_distance(
            line1.center[0],
            line1.center[1],
            line2.center[0],
            line2.center[1]) < self.distance_tolerance

    @staticmethod
    def pretty_similar(x: float, x1: float, tolerance: float):
        return abs(x - x1) < tolerance

    @staticmethod
    def two_point_distance(x: float, y: float, x1: float, y1: float):
        distance = math.sqrt((x - x1) ** 2 + (y - y1) ** 2)
        return distance

    @staticmethod
    def two_point_height(y: float, y1: float):
        return abs(y - y1)

Step 3: Define Rule of Block

現在我們要設計，在什麼樣的條件滿足下可以形成一個Block，這邊我們設計的規則如下：

定義規則的程式碼如下：

def is_two_line_close(block, target_line, cur_line):
    """
    Check if two lines are close enough, so we can merge them into a block
    """
    left_tolerance = 0.02
    width_tolerance =  0.01 # [left_tolerance, (target_line.width - cur_line.width).abs / 2].max
    distance_tolerance = 0.04
    height_tolerance = 0.03

    def is_left_similar(line1, line2, tolerance = left_tolerance):
        return pretty_similar(line1.left, line2.left, tolerance)
    def is_width_similar(line1, line2, tolerance = width_tolerance): 
        return  pretty_similar(line1.width, line2.width, tolerance)
    def is_height_similar(line1, line2, tolerance = height_tolerance):
        return two_point_hight(line1.top, line2.top) < tolerance 
    def is_on_same_page(line1, line2):
        return line1.page == line2.page
    def is_center_close(line1, line2):
        return two_point_distance(line1.center[0], line1.center[1], line2.center[0], line2.center[1]) < distance_tolerance
    
    def is_same_paragraph():
        """
        用來處理相同的Paragraph
        如果左邊起點相同 且 高度相近 那他們就是在同一個Block
        """
        if (is_left_similar(target_line, cur_line) and 
            is_height_similar(block.lines[-1], cur_line)):
            return True 
        return False 
    
    def is_text_center_context():
        """
        用來處理置中的文字
        如果中心點相近 且 高度相近 那他們就是在同一個Block
        """
        return (is_center_close(target_line, cur_line) and is_height_similar(block.lines[-1], cur_line))

    # 首先檢查是否在同一頁 因為不能讓Block跨頁
    if is_on_same_page(target_line, cur_line): 
        # 把相同的paragraph 或是 內容置中的 匡選起來
        if is_same_paragraph() or is_text_center_context():
            return True
    else: 
        return False

Step 4: Iterate Line to Form Block

接下來我們就可以開始遞迴所有的Line來找出Block

def merge_lines_to_block(lines):
    """
    把Line合併成Block
    """
    ready_blocks = [] # 準備一個空的Block

    # 只要lines還有東西就繼續組成Block
    while lines: 
        block = Block()
        target_line = lines[0] # 取出第一個Line作為組成Block的第一個被比較的對象
        block.add_line(lines[0]) # 把target_line加入Block
        block.page = target_line.page # 設定Block的Page
        lines.pop(0) # 把target_line從lines中移除
        index = 0  # 重新設置 index 為 0，因為pop會影響到index順序
        # 遞迴所有的Line直到沒有lines可以比較了
        while index < len(lines):
            cur_line = lines[index]
            if target_line.page == cur_line.page: 
                # 寬度一樣 那中心不能差太遠 
                if is_two_line_close(block, target_line, cur_line):
                    block.add_line(cur_line)
                    lines.pop(index) # pop完之後cur_line要從index 0開始
                    index = 0  # 重新設置 index 為 0
                    continue  # 繼續下一輪循環
            index += 1  # 檢查下一個元素

        ready_blocks.append(block) # 把整理好的Block加入清單
    return blocks

Step5: Execute

最後我們就可以執行程式碼了


json_path = "./result/test.json" # 從Textract回傳的JSON檔案
pdf_path = "../../src/test.pdf"

lines = get_lines_from_json(json_path) # 從Textract回傳的JSON中取得Line的資訊
blocks = merge_lines_to_block(lines) # 把Line合併成Block 
blocks = find_block_corners(blocks) # 找出Block的四個角落座標，以可以包圍所有的Line
show_image_bbox(pdf_path, blocks)
#print_blocks(blocks) # 可以印出Block的資訊

結果如下圖：

Advance: Clean Code + Single Page

但是上述的結果並不適合Single Page，因此我們可以多做一些額外的設定。為了方便管理，我把各個服務拆分變成模組，以下是檔案結構

檔案結構

.
├── resource
│   ├── pdf # 放pdf檔案
│   │   ├── single-column.pdf
│   │   └── multi-column.pdf
│   └── result # 放textract解析後的json檔案
│       ├── single-column
│       │   └── final-result.json
│       └── multi-column
│           ├── final-result.json
│           ├── result_0.json
│           └── result_1.json
├── src # 放程式碼
│   ├── models # 放Class
│   │   ├── __init__.py
│   │   ├── block.py
│   │   ├── line.py
│   │   ├── page.py
│   │   └── process_type.py
│   ├── ocr # 放處理OCR相關的服務
│   │   ├── util
│   │   │   ├── __init__.py
│   │   │   ├── bbox_merger.py
│   │   │   └── functions.py
│   │   └── __init__.py
│   └── __init__.py
├── tests
│   ├── __init__.py
│   ├── test_block_merge.ipynb
└── README.md

經過整理的程式碼可以參考如下：

各種class檔案

│   ├── models
│   │   ├── __pycache__
│   │   ├── __init__.py
│   │   ├── block.py
│   │   ├── line.py
│   │   ├── page.py
│   │   └── process_type.py

from typing import List
from src.models.line import Line


class Block:
    def __init__(self) -> None:
        self.lines: List[Line] = []
        self.page: int = 0
        self.reason: str = ""
        self.id: str = ""

        self.left: int = 0
        self.top: int = 0
        self.height: int = 0
        self.width: int = 0

    def __str__(self) -> str:
        return f"Block: page={self.page}, id={self.id}, (x1,y1)=({self.left}, {self.top}), (x2,y2)=({self.left + self.width},{self.top + self.height})"

    def add_line(self, line):
        self.lines.append(line)

class Line:
    def __init__(self, id_: str, page: int, text: str, top: int, left: int, width: int, height: int) -> None:
        self.top: int = top
        self.left: int = left
        self.width: int = width
        self.height: int = height

        self.page: int = page
        self.id_: str = id_
        self.text: str = text
        self.center: list = self.get_center()

    def __str__(self) -> str:
        return (f"Line: \t page={self.page}, "
                f"Id={self.id_}, "
                f"Text={self.text}, \n"
                f"left={self.left}, top={self.top}); "
                f"width={self.width}, height={self.height} \n")

    def get_center(self) -> list:
        x = self.left
        y = self.top
        x1 = self.left + self.width
        y1 = self.top + self.height
        x_center = (x + x1) / 2
        y_center = (y + y1) / 2
        return [x_center, y_center]

class Page:
    def __init__(self, page_number, lines):
        self.lines = lines
        self.page = page_number

    def __str__(self):
        for line in self.lines:
            print(f"line: {line.__str__()}")
        return f"Page: {self.page}"

class ProcessType:
    LINE = "LINE"
    WORD = "WORD"

通用Function

src/ocr/util/functions.py

# response
import json
from PIL import Image
from pdf2image import convert_from_bytes
from typing import List


def read_json_file(file_path):
    with open(file_path, "r") as file:
        return json.load(file)


def read_file_to_bytes(file_path: str) -> List[Image.Image]:
    with open(file_path, 'rb') as file:
        pdf_binary = file.read()
    return convert_from_bytes(pdf_binary)

bbox_merger.py 負責把lines變成blocks

src/ocr/util/bbox_merger.py

import math
import re
from typing import List

from pdf2image import convert_from_bytes
from matplotlib.patches import Rectangle

from src.models.block import Block
from src.models.line import Line
from src.ocr.util.functions import read_json_file
from matplotlib import pyplot as plt


class ColumnType:
    """
    Column type, based on the column type of pdf.
    """
    SINGLE = "SINGLE"
    MULTI = "MULTI"


def get_lines_from_json(file_path: str) -> List[Line]:
    """
    Get all type "LINE" from json file generated by textract.
    :param file_path: json file generated by textract.
    :return: list of Line.
    """
    lines: List[Line] = []
    json_res = read_json_file(file_path)
    for item in json_res["Blocks"]:
        if item["BlockType"] == "LINE":
            box = item["Geometry"]["BoundingBox"]
            lines.append(
                Line(
                    item["Id"],
                    item["Page"],
                    item["Text"],
                    box["Top"],
                    box["Left"],
                    box["Width"],
                    box["Height"]))
    return lines


def print_blocks(blocks: List[Block]) -> None:
    """
    print block and line information
    :param blocks: blocks to print
    """
    for block in blocks:
        print(f"{block.__str__()}")
        for line in block.lines:
            print(f"{line.__str__()}")
        print("\n")


class LineSimilarityChecker:
    """
    This class is used to check the similarity between two lines.
    """

    def __init__(self, column_type: ColumnType,
                 distance_tolerance: float = 0.03,
                 width_tolerance: float = 0.01,
                 left_tolerance: float = 0.02,
                 height_tolerance: float = 0.02,
                 same_line_tolerance: float = 0.005
                 ) -> None:
        self.column_type = column_type

        self.distance_tolerance = distance_tolerance
        self.width_tolerance = width_tolerance
        self.left_tolerance = left_tolerance
        self.height_tolerance = height_tolerance
        self.same_line_tolerance = same_line_tolerance

    def is_left_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.left_tolerance
        return self.pretty_similar(line1.left, line2.left, tolerance)

    def is_width_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.width_tolerance
        return self.pretty_similar(line1.width, line2.width, tolerance)

    def is_height_similar(self, line1, line2, tolerance=None):
        tolerance = tolerance or self.height_tolerance
        return self.two_point_height(line1.top, line2.top) < tolerance

    def is_center_close(self, line1: Line, line2: Line) -> bool:
        return self.two_point_distance(
            line1.center[0],
            line1.center[1],
            line2.center[0],
            line2.center[1]) < self.distance_tolerance

    @staticmethod
    def pretty_similar(x: float, x1: float, tolerance: float):
        return abs(x - x1) < tolerance

    @staticmethod
    def two_point_distance(x: float, y: float, x1: float, y1: float):
        distance = math.sqrt((x - x1) ** 2 + (y - y1) ** 2)
        return distance

    @staticmethod
    def two_point_height(y: float, y1: float):
        return abs(y - y1)


class LineMerger:
    """
    This class is used to turn lines to blocks by compare each line's similarity.
    """

    def __init__(self, lines, column_type: ColumnType = ColumnType.SINGLE):
        self.column_type = column_type
        self.line_check = LineSimilarityChecker(self.column_type)
        self.lines: List[Line] = lines

    def get_blocks(self) -> List[Block]:
        """
        Get all blocks after turning lines into blocks.
        :param column_type: column is SINGLE or MULTI default is SINGLE
        :return: blocks
        """
        blocks = self.merge_lines_to_block(self.lines)
        return self.find_block_corners(blocks)

    def merge_lines_to_block(self, lines) -> List[Block]:
        blocks: List[Block] = []
        while lines:
            block = Block()
            block.add_line(lines.pop(0))
            block.page = block.lines[0].page
            target_line = block.lines[0]
            index = 0
            while index < len(lines):
                cur_line = lines[index]
                if target_line.page == cur_line.page:
                    # for single column, when encounter number point, make a
                    # new a block
                    if self.column_type == ColumnType.SINGLE and self.is_start_special_word(
                            cur_line):
                        print("---Found special word---")
                        print(cur_line.text)
                        print("---End special word: Jump to next block---")
                        break
                    # other case, all need to compare the lines are close or
                    # not.
                    else:
                        if self.is_two_line_close(block, cur_line):
                            block.add_line(cur_line)
                            lines.pop(index)
                            index = 0
                            continue
                index += 1
            blocks.append(block)
        return blocks

    def is_start_special_word(self, cur_line: Line):
        # 先去字串前後空白 根據空白進行split，取第一個字串
        curStart = cur_line.text.strip().split(" ")[0]
        pattern = self._regex_pattern()

        if re.match(pattern, curStart):
            return True
        else:
            return False

    @staticmethod
    def _regex_pattern() -> str:
        # general word or number + "."  + any words (e.g. 1.Hello my friend)
        GENERAL_WORD_DOT_PATTERN = r'^[a-zA-Z0-9]\..*'
        # non-general one word or number + general one word or num + any word (e.g (1) This is ...)
        NON_ALPHANUMERIC_WORD_PATTERN = r'[^a-zA-Z0-9][a-zA-Z0-9][^a-zA-Z0-9].*'

        return '{}|{}'.format(
            GENERAL_WORD_DOT_PATTERN,
            NON_ALPHANUMERIC_WORD_PATTERN)

    def is_two_line_close(self, block, cur_line):
        last_line: Line = block.lines[-1]
        target_line: Line = block.lines[0]

        if self.is_on_same_page(target_line, cur_line):
            # multi column: center text & paragraph block
            if self.column_type == ColumnType.MULTI:
                if (self.is_same_paragraph(last_line, cur_line) or
                        self.is_text_center_context(last_line, cur_line)):
                    return True

            # single column: same line
            elif self.column_type == ColumnType.SINGLE:
                if (self.is_on_same_line(last_line, cur_line) or
                        self.is_same_paragraph(last_line, cur_line)):
                    return True

        return False

    @staticmethod
    def is_on_same_page(line1, line2) -> bool:
        return line1.page == line2.page

    def is_on_same_line(self, last_line, cur_line) -> bool:
        return self.line_check.is_height_similar(last_line, cur_line)

    def is_same_paragraph(
            self,
            last_line: Line,
            cur_line: Line) -> bool:
        if (self.line_check.is_left_similar(last_line, cur_line)
                and self.line_check.is_height_similar(last_line, cur_line)):
            return True

        return False

    def is_text_center_context(self, last_line: Line, cur_line: Line) -> bool:
        return (self.line_check.is_center_close(last_line, cur_line) and
                self.line_check.is_height_similar(last_line, cur_line))

    @staticmethod
    def find_block_corners(blocks: List[Block]) -> List[Block]:
        for index, block in enumerate(blocks):
            min_top = min(line.top for line in block.lines)
            min_left = min(line.left for line in block.lines)
            max_bottom = max(line.top + line.height for line in block.lines)
            max_right = max(line.left + line.width for line in block.lines)

            block.height = max_bottom - min_top
            block.width = max_right - min_left
            block.top = min_top
            block.left = min_left
            block.id = index

        return blocks


def show_image_bbox(pdf_file, blocks) -> None:
    """
    show image bounding box
    :param pdf_file: the pdf file location
    :param blocks: the list of blocks we want to draw
    """
    with open(pdf_file, 'rb') as file:
        images = convert_from_bytes(file.read())

    for index, image in enumerate(images):
        width, height = image.size
        page = index + 1
        print(f"Process Page Index: {page}")

        plt.figure(figsize=(20, 16))
        plt.imshow(image)

        # iterate over the blocks
        for i, block in enumerate(blocks):
            if block.page == page:
                rect = Rectangle(
                    (width * block.left,
                     height * block.top),
                    block.width * width,
                    block.height * height,
                    edgecolor='r',
                    facecolor='none')
                plt.text(
                    width * block.left,
                    height * block.top,
                    block.id,
                    fontsize=12,
                    color='red')
                plt.gca().add_patch(rect)
        plt.show()

接下來就可以直接呼叫了
tests/test_block_merge.ipynb

from src.ocr.util.bbox_merger import (
    show_image_bbox, 
    get_lines_from_json, 
    LineMerger, 
    print_blocks, 
    ColumnType
)

class Value:
    def __init__(self, is_multi_column: bool):
        if is_multi_column:
            self.topic = "multi-column"
            self.column_type = ColumnType.MULTI
        else:
            self.topic = "single-column"
            self.column_type = ColumnType.SINGLE
            
        self.json_path = '../resource/result/{}/final-result.json'.format(self.topic)
        self.pdf_path = '../resource/pdf/{}.pdf'.format(self.topic)

def main(is_multi_column: bool):
    v = Value(is_multi_column = is_multi_column)
    lines = get_lines_from_json(v.json_path)
    blocks = LineMerger(lines, v.column_type).get_blocks()
    print_blocks(blocks)
    show_image_bbox(pdf_file=v.pdf_path, blocks=blocks)

if __name__ == "__main__":
    main(is_multi_column = False)