前言

AWS Textract is an AWS tool used for extracting text from PDFs (or images). Ideally, your original document would have only one column, such as a book. However, things become more complex when dealing with multiple columns, such as newspaper articles. Therefore, this article aims to share how to use Amazon Textract to handle sorting of text from multi-column documents. This article is inspired by AWS Textract: how to detect and sort text from a multi-column document, with some improvements made.

My source material is a newspaper article, with the layout as shown below:

Textract Response Format

The Textract output is structured JSON formed by various BlockTypes. A “Page” BlockType consists of multiple “Line” blocks, and each “Line” block consists of multiple “Word” blocks. In these responses, you cannot see any structural information to simply sort multi-column text into a single column. However, what can be known is that Textract parses text from top to bottom and left to right. You can observe this parsing order from the numbered sections 27 to 48 in the image below, where even though they belong to different columns, Textract parses them sequentially from left to right and top to bottom.

Solution

The approach we adopt is to utilize the bounding box coordinates provided by Textract to draw boundaries around text blocks. By grouping nearby lines into a block, we can identify lines belonging to the same column. The final result will resemble the image below:

From the above image, the following information can be inferred:

  • The longest block is likely the title.
  • Other blocks can be sorted from top to bottom and left to right, allowing us to determine the reading order and understand the distribution of columns.

Step 1: Define the Class

From the official documentation on how to interpret target locations on the document, it is described as follows:

1
2
3
4
5
6
"BoundingBox": {
"Width": 0.007353090215474367,
"Height": 0.0288887619972229,
"Left": 0.08638829737901688,
"Top": 0.03477252274751663
}

From the official documentation

After understanding the format, we can define a class to handle this data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88

```python
class Page:
"""
Used to handle the content returned by Textract. Each Page may have one or more Lines.

Args:
page (int): The number of this Page.
lines (Line): The Lines contained in this Page.
"""
def __init__(self, page_number, lines):
self.lines = lines
self.page = page_number

def __str__(self):
for line in self.lines:
print(f"line: {line.__str__()}")
return f"Page: {self.page}"

class Block:
"""
Block: This is the Block we are dealing with. Each Block may have one or more Lines.

Args:
page (int): The Page where this Block is located.
id (int): Simply records the index of this Block.
lines (Line): The Lines contained in this Block.
left (float): The x-coordinate of the top-left corner of the Block.
top (float): The y-coordinate of the top-left corner of the Block.
height (float): The height of the Block.
width (float): The width of the Block.
"""
def __init__(self):
self.lines = []
self.page = 0
self.id = ""
self.left = 0
self.top = 0
self.height = 0
self.width = 0

def __str__(self):
return f"Block: page={self.page}, id={self.id}, (x1,y1)=({self.left}, {self.top}), (x2,y2)=({self.left + self.width},{self.top + self.height})"

def add_line(self, line):
"""
Add a line to the block and recalculate the center of the block.
"""
self.lines.append(line)

class Line:
"""
Handles the smallest unit of text.

Args:
page (int): The Page where this Line is located.
Id (int): The Line Id returned by Textract for easy lookup of the corresponding text in the original document.
text (str): The text of the Line.
top (float): The y-coordinate of the top-left corner of the Line.
left (float): The x-coordinate of the top-left corner of the Line.
width (float): The width of the Line.
height (float): The height of the Line.
"""
def __init__(self, Id, page, text, top, left, width, height):
self.top = top
self.left = left
self.width = width
self.height = height
self.page = page
self.Id = Id
self.text = text
self.center = self.get_center()

def __str__(self):
return f"Line: \t page={self.page}, Id={self.Id}, Text={self.text}, \n\t (x1,y1)=({self.left}, top={self.top}); width={self.width}, height={self.height} \n"

def get_center(self):
"""
Get the center of the Line.
"""
x = self.left
y = self.top
x1 = self.left + self.width
y1 = self.top + self.height
x_center = (x + x1) / 2
y_center = (y + y1) / 2
return [x_center, y_center]

Step 2: General Function

Next, we need some general functions to handle the data. Here, we will need the following functions:

  • read_json_file: Read the JSON file returned by Textract.
  • two_point_distance: Calculate the distance between two points. This will be used to calculate whether the center distance between Lines is too far.
    • d=(x2x1)2+(y2y1)2d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
  • two_point_height: Calculate the vertical distance between two Lines to determine if they are close enough.
  • pretty_similar: Determine if the difference is within an acceptable range.
  • print_blocks: Print information about Blocks.
  • get_lines_from_json: Get information about Lines from the JSON returned by Textract.
  • find_block_corners: Find the four corner coordinates of a Block to enclose all Lines.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import math
import json
from pdf2image import convert_from_bytes
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

def read_json_file(file_name):
with open(file_name, "r") as file:
return json.load(file)

def two_point_distance(x, y, x1, y1):
distance = math.sqrt((x - x1) ** 2 + (y - y1) ** 2)
return distance

def two_point_hight(y, y1):
return abs(y-y1)

def pretty_similar(x, x1, tolerance):
return abs(x - x1) < tolerance

def print_blocks(blocks):
for block in blocks:
print(f"{block.__str__()}")
last_line = block.lines[0]
for line in block.lines:
print(f"Line: \t Page={line.page}, Id={line.Id}, left={line.left}, top={line.top}, width={line.width}, height={line.height}, center={line.center} two_point_distance = {two_point_distance(last_line.center[0], last_line.center[1], line.center[0], line.center[1])} (x1,y1)=({line.left}, {line.top}), (x2,y2)=({line.left + line.width},{line.top + line.height})")
last_line = line
print("\n")

def get_lines_from_json(file_path):
json_response = read_json_file(file_path)
lines = []
for item in json_response["Blocks"]:
if item["BlockType"] == "LINE":
box = item["Geometry"]["BoundingBox"]
lines.append(Line(item["Id"], item["Page"], item["Text"], box["Top"], box["Left"], box["Width"], box["Height"]))
return lines

def find_block_corners(blocks):
min_top = float('inf')
min_left = float('inf')

for index, block in enumerate(blocks):
min_top = min(line.top for line in block.lines)
min_left = min(line.left for line in block.lines)
max_bottom = max(line.top + line.height for line in block.lines)
max_right = max(line.left + line.width for line in block.lines)

block.height = max_bottom - min_top
block.width = max_right - min_left
block.top = min_top
block.left = min_left
block.id = index

return blocks


def show_image_bbox(pdf_file, blocks):
"""
Use to show the image with bounding box
"""
with open(pdf_file, 'rb') as file:
images = convert_from_bytes(file.read())

for index, image in enumerate(images):
width, height =image.size
page = index + 1
print(f"Process Page Index: {page}")

plt.figure(figsize=(20,16))
plt.imshow(image)

# iterate over the blocks
for i, block in enumerate(blocks):
if (block.page == page):
rect = Rectangle((width * block.left, height * block.top), block.width * width, block.height * height, edgecolor='r', facecolor='none')
plt.text(width * block.left, height * block.top, block.id, fontsize=12, color='red')
plt.gca().add_patch(rect)
plt.show()

Step 3: Define Rule of Block

現在我們要設計,在什麼樣的條件滿足下可以形成一個Block,這邊我們設計的規則如下:

定義規則的程式碼如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def is_two_line_close(block, target_line, cur_line):
"""
Check if two lines are close enough, so we can merge them into a block
"""
left_tolerance = 0.02
width_tolerance = 0.01 # [left_tolerance, (target_line.width - cur_line.width).abs / 2].max
distance_tolerance = 0.04
height_tolerance = 0.03

def is_left_similar(line1, line2, tolerance = left_tolerance):
return pretty_similar(line1.left, line2.left, tolerance)
def is_width_similar(line1, line2, tolerance = width_tolerance):
return pretty_similar(line1.width, line2.width, tolerance)
def is_height_similar(line1, line2, tolerance = height_tolerance):
return two_point_hight(line1.top, line2.top) < tolerance
def is_on_same_page(line1, line2):
return line1.page == line2.page
def is_center_close(line1, line2):
return two_point_distance(line1.center[0], line1.center[1], line2.center[0], line2.center[1]) < distance_tolerance

def is_same_paragraph():
"""
Use to handle the same paragraph
If the starting point on the left is the same and the height is similar, they are in the same Block
"""
if (is_left_similar(target_line, cur_line) and
is_height_similar(block.lines[-1], cur_line)):
return True
return False

def is_text_center_context():
"""
Use to handle the text in the center
If the center is close and the height is similar, they are in the same Block
"""
return (is_center_close(target_line, cur_line) and is_height_similar(block.lines[-1], cur_line))

# First check if they are on the same page, as a Block cannot span multiple pages
if is_on_same_page(target_line, cur_line):
# If in the same paragraph or the text is centered, select them into block
if is_same_paragraph() or is_text_center_context():
return True
else:
return False

Step 4: Iterate Line to Form Block

Then we can start iterating through all the Lines to find the Block.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def merge_lines_to_block(lines):
"""
Merge Lines into Blocks.
"""
ready_blocks = []

# As long as there are still lines, continue to form Blocks
while lines:
block = Block()
target_line = lines[0] # Take the first Line as the first object to compare for forming a Block
block.add_line(lines[0]) # Add the target_line to the Block
block.page = target_line.page # Set the Block's Page
lines.pop(0) # Remove the target_line from lines
index = 0 # Reset index to 0 because pop affects the index order
# Recursively iterate through all Lines until there are no more lines to compare
while index < len(lines):
cur_line = lines[index]
if target_line.page == cur_line.page:
# If the width is the same, the centers cannot be too far apart
if is_two_line_close(block, target_line, cur_line):
block.add_line(cur_line)
lines.pop(index) # After popping, cur_line needs to start from index 0
index = 0 # Reset index to 0
continue # Continue to the next iteration
index += 1 # Check the next element

ready_blocks.append(block) # Add the organized Block to the list
return ready_blocks

Step 5: Execution

Finally, we can execute the code:

1
2
3
4
5
6
7
8
json_path = "./result/test.json" # Path to the JSON file returned by Textract
pdf_path = "../../src/test.pdf"

lines = get_lines_from_json(json_path) # Get information about Lines from the JSON returned by Textract
blocks = merge_lines_to_block(lines) # Merge Lines into Blocks
blocks = find_block_corners(blocks) # Find the four corner coordinates of a Block to enclose all Lines
show_image_bbox(pdf_path, blocks) # Display the image with bounding boxes around Blocks
#print_blocks(blocks) # Print information about Blocks

The result is shown in the image below:

Advance: Clean Code + Single Page

This section is a bit more advanced, as we will clean up the code and make it suitable for single-page documents. To manage the code better, I have split the services into modules. Below is the file structure:

File Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
.
├── resource
│   ├── pdf # put pdf file
│   │   ├── single-column.pdf
│   │   └── multi-column.pdf
│   └── result # put textract json file
│   ├── single-column
│   │   └── final-result.json
│   └── multi-column
│   ├── final-result.json
│   ├── result_0.json
│   └── result_1.json
├── src # main source code
│   ├── models # Class
│   │   ├── __init__.py
│   │   ├── block.py
│   │   ├── line.py
│   │   ├── page.py
│   │   └── process_type.py
│   ├── ocr # OCR related service
│   │   ├── util
│   │   │   ├── __init__.py
│   │   │   ├── bbox_merger.py
│   │   │   └── functions.py
│   │   └── __init__.py
│   └── __init__.py
├── tests
│   ├── __init__.py
│   ├── test_block_merge.ipynb
└── README.md
Entity Class file
1
2
3
4
5
6
7
│   ├── models
│   │   ├── __pycache__
│   │   ├── __init__.py
│   │   ├── block.py
│   │   ├── line.py
│   │   ├── page.py
│   │   └── process_type.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from typing import List
from src.models.line import Line


class Block:
def __init__(self) -> None:
self.lines: List[Line] = []
self.page: int = 0
self.reason: str = ""
self.id: str = ""

self.left: int = 0
self.top: int = 0
self.height: int = 0
self.width: int = 0

def __str__(self) -> str:
return f"Block: page={self.page}, id={self.id}, (x1,y1)=({self.left}, {self.top}), (x2,y2)=({self.left + self.width},{self.top + self.height})"

def add_line(self, line):
self.lines.append(line)

class Line:
def __init__(self, id_: str, page: int, text: str, top: int, left: int, width: int, height: int) -> None:
self.top: int = top
self.left: int = left
self.width: int = width
self.height: int = height

self.page: int = page
self.id_: str = id_
self.text: str = text
self.center: list = self.get_center()

def __str__(self) -> str:
return (f"Line: \t page={self.page}, "
f"Id={self.id_}, "
f"Text={self.text}, \n"
f"left={self.left}, top={self.top}); "
f"width={self.width}, height={self.height} \n")

def get_center(self) -> list:
x = self.left
y = self.top
x1 = self.left + self.width
y1 = self.top + self.height
x_center = (x + x1) / 2
y_center = (y + y1) / 2
return [x_center, y_center]

class Page:
def __init__(self, page_number, lines):
self.lines = lines
self.page = page_number

def __str__(self):
for line in self.lines:
print(f"line: {line.__str__()}")
return f"Page: {self.page}"

class ProcessType:
LINE = "LINE"
WORD = "WORD"
General Function

src/ocr/util/functions.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# response
import json
from PIL import Image
from pdf2image import convert_from_bytes
from typing import List


def read_json_file(file_path):
with open(file_path, "r") as file:
return json.load(file)


def read_file_to_bytes(file_path: str) -> List[Image.Image]:
with open(file_path, 'rb') as file:
pdf_binary = file.read()
return convert_from_bytes(pdf_binary)
bbox_merger.py: Turn lines to blocks class.

src/ocr/util/bbox_merger.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
import math
import re
from typing import List

from pdf2image import convert_from_bytes
from matplotlib.patches import Rectangle

from src.models.block import Block
from src.models.line import Line
from src.ocr.util.functions import read_json_file
from matplotlib import pyplot as plt


class ColumnType:
"""
Column type, based on the column type of pdf.
"""
SINGLE = "SINGLE"
MULTI = "MULTI"


def get_lines_from_json(file_path: str) -> List[Line]:
"""
Get all type "LINE" from json file generated by textract.
:param file_path: json file generated by textract.
:return: list of Line.
"""
lines: List[Line] = []
json_res = read_json_file(file_path)
for item in json_res["Blocks"]:
if item["BlockType"] == "LINE":
box = item["Geometry"]["BoundingBox"]
lines.append(
Line(
item["Id"],
item["Page"],
item["Text"],
box["Top"],
box["Left"],
box["Width"],
box["Height"]))
return lines


def print_blocks(blocks: List[Block]) -> None:
"""
print block and line information
:param blocks: blocks to print
"""
for block in blocks:
print(f"{block.__str__()}")
for line in block.lines:
print(f"{line.__str__()}")
print("\n")


class LineSimilarityChecker:
"""
This class is used to check the similarity between two lines.
"""

def __init__(self, column_type: ColumnType,
distance_tolerance: float = 0.03,
width_tolerance: float = 0.01,
left_tolerance: float = 0.02,
height_tolerance: float = 0.02,
same_line_tolerance: float = 0.005
) -> None:
self.column_type = column_type

self.distance_tolerance = distance_tolerance
self.width_tolerance = width_tolerance
self.left_tolerance = left_tolerance
self.height_tolerance = height_tolerance
self.same_line_tolerance = same_line_tolerance

def is_left_similar(self, line1, line2, tolerance=None):
tolerance = tolerance or self.left_tolerance
return self.pretty_similar(line1.left, line2.left, tolerance)

def is_width_similar(self, line1, line2, tolerance=None):
tolerance = tolerance or self.width_tolerance
return self.pretty_similar(line1.width, line2.width, tolerance)

def is_height_similar(self, line1, line2, tolerance=None):
tolerance = tolerance or self.height_tolerance
return self.two_point_height(line1.top, line2.top) < tolerance

def is_center_close(self, line1: Line, line2: Line) -> bool:
return self.two_point_distance(
line1.center[0],
line1.center[1],
line2.center[0],
line2.center[1]) < self.distance_tolerance

@staticmethod
def pretty_similar(x: float, x1: float, tolerance: float):
return abs(x - x1) < tolerance

@staticmethod
def two_point_distance(x: float, y: float, x1: float, y1: float):
distance = math.sqrt((x - x1) ** 2 + (y - y1) ** 2)
return distance

@staticmethod
def two_point_height(y: float, y1: float):
return abs(y - y1)


class LineMerger:
"""
This class is used to turn lines to blocks by compare each line's similarity.
"""

def __init__(self, lines, column_type: ColumnType = ColumnType.SINGLE):
self.column_type = column_type
self.line_check = LineSimilarityChecker(self.column_type)
self.lines: List[Line] = lines

def get_blocks(self) -> List[Block]:
"""
Get all blocks after turning lines into blocks.
:param column_type: column is SINGLE or MULTI default is SINGLE
:return: blocks
"""
blocks = self.merge_lines_to_block(self.lines)
return self.find_block_corners(blocks)

def merge_lines_to_block(self, lines) -> List[Block]:
blocks: List[Block] = []
while lines:
block = Block()
block.add_line(lines.pop(0))
block.page = block.lines[0].page
target_line = block.lines[0]
index = 0
while index < len(lines):
cur_line = lines[index]
if target_line.page == cur_line.page:
# for single column, when encounter number point, make a
# new a block
if self.column_type == ColumnType.SINGLE and self.is_start_special_word(
cur_line):
print("---Found special word---")
print(cur_line.text)
print("---End special word: Jump to next block---")
break
# other case, all need to compare the lines are close or
# not.
else:
if self.is_two_line_close(block, cur_line):
block.add_line(cur_line)
lines.pop(index)
index = 0
continue
index += 1
blocks.append(block)
return blocks

def is_start_special_word(self, cur_line: Line):
# 先去字串前後空白 根據空白進行split,取第一個字串
curStart = cur_line.text.strip().split(" ")[0]
pattern = self._regex_pattern()

if re.match(pattern, curStart):
return True
else:
return False

@staticmethod
def _regex_pattern() -> str:
# general word or number + "." + any words (e.g. 1.Hello my friend)
GENERAL_WORD_DOT_PATTERN = r'^[a-zA-Z0-9]\..*'
# non-general one word or number + general one word or num + any word (e.g (1) This is ...)
NON_ALPHANUMERIC_WORD_PATTERN = r'[^a-zA-Z0-9][a-zA-Z0-9][^a-zA-Z0-9].*'

return '{}|{}'.format(
GENERAL_WORD_DOT_PATTERN,
NON_ALPHANUMERIC_WORD_PATTERN)

def is_two_line_close(self, block, cur_line):
last_line: Line = block.lines[-1]
target_line: Line = block.lines[0]

if self.is_on_same_page(target_line, cur_line):
# multi column: center text & paragraph block
if self.column_type == ColumnType.MULTI:
if (self.is_same_paragraph(last_line, cur_line) or
self.is_text_center_context(last_line, cur_line)):
return True

# single column: same line
elif self.column_type == ColumnType.SINGLE:
if (self.is_on_same_line(last_line, cur_line) or
self.is_same_paragraph(last_line, cur_line)):
return True

return False

@staticmethod
def is_on_same_page(line1, line2) -> bool:
return line1.page == line2.page

def is_on_same_line(self, last_line, cur_line) -> bool:
return self.line_check.is_height_similar(last_line, cur_line)

def is_same_paragraph(
self,
last_line: Line,
cur_line: Line) -> bool:
if (self.line_check.is_left_similar(last_line, cur_line)
and self.line_check.is_height_similar(last_line, cur_line)):
return True

return False

def is_text_center_context(self, last_line: Line, cur_line: Line) -> bool:
return (self.line_check.is_center_close(last_line, cur_line) and
self.line_check.is_height_similar(last_line, cur_line))

@staticmethod
def find_block_corners(blocks: List[Block]) -> List[Block]:
for index, block in enumerate(blocks):
min_top = min(line.top for line in block.lines)
min_left = min(line.left for line in block.lines)
max_bottom = max(line.top + line.height for line in block.lines)
max_right = max(line.left + line.width for line in block.lines)

block.height = max_bottom - min_top
block.width = max_right - min_left
block.top = min_top
block.left = min_left
block.id = index

return blocks


def show_image_bbox(pdf_file, blocks) -> None:
"""
show image bounding box
:param pdf_file: the pdf file location
:param blocks: the list of blocks we want to draw
"""
with open(pdf_file, 'rb') as file:
images = convert_from_bytes(file.read())

for index, image in enumerate(images):
width, height = image.size
page = index + 1
print(f"Process Page Index: {page}")

plt.figure(figsize=(20, 16))
plt.imshow(image)

# iterate over the blocks
for i, block in enumerate(blocks):
if block.page == page:
rect = Rectangle(
(width * block.left,
height * block.top),
block.width * width,
block.height * height,
edgecolor='r',
facecolor='none')
plt.text(
width * block.left,
height * block.top,
block.id,
fontsize=12,
color='red')
plt.gca().add_patch(rect)
plt.show()

Then we can call the method directly.
tests/test_block_merge.ipynb

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from src.ocr.util.bbox_merger import (
show_image_bbox,
get_lines_from_json,
LineMerger,
print_blocks,
ColumnType
)

class Value:
def __init__(self, is_multi_column: bool):
if is_multi_column:
self.topic = "multi-column"
self.column_type = ColumnType.MULTI
else:
self.topic = "single-column"
self.column_type = ColumnType.SINGLE

self.json_path = '../resource/result/{}/final-result.json'.format(self.topic)
self.pdf_path = '../resource/pdf/{}.pdf'.format(self.topic)

def main(is_multi_column: bool):
v = Value(is_multi_column = is_multi_column)
lines = get_lines_from_json(v.json_path)
blocks = LineMerger(lines, v.column_type).get_blocks()
print_blocks(blocks)
show_image_bbox(pdf_file=v.pdf_path, blocks=blocks)

if __name__ == "__main__":
main(is_multi_column = False)