Shannon's Blog 🐟 技術 | 生活 | 旅行

Amazon Textract 如何處理多欄位(Multi-Column)的文字排序

發表於2024-03-20|更新於2024-09-10|工具OCR|aws•textract

前言 AWS Textract 是用於從 pdf（或圖片）中擷取文字的 AWS 工具。最好的情況是您的原始文件只有一欄，例如一本書。當您有多個專欄（例如報紙文章）時，事情處理起來會更加複雜。所以這次來分享一下如何使用 Amazon Textract 來處理多欄位的文字排序。有參考這篇AWS Textract: how to detect and sort text from a multi-column document做一些改良。我的來源是一篇報紙文章，版面如下： Textract Response format Textract 輸出是由各種 BlockType 分層排列形成的 JSON。一個BlockType的「Page」由多個「Line」組成，而「Line」又由多個「Word」組成。在這些回應中，您看不到任何結構資訊，無法將多列文字僅排序為一列。但是可以知道的是，Textract在解析文字時，是由上到下，且一排排的解析，可以參考下圖中的編號的第27~48，您可以發現儘管是在不同的Column，但是Textract解析的順序是由左到右依序往下解析。 Solution 我們所 ...

冰島自由行2024冬天旅遊指南: 行程規劃/保險/自駕注意

發表於2024-03-12|更新於2024-09-11|留學旅遊|Iceland•Travel

前言在德國讀雙聯終於考完期末考時，那時是二月初。我們想著要怎麼規劃一個月的寒假，剛好小夥伴們在我家討論想去的國家時，大家都有相同的願望就是去冰島看極光，剛好極光最適合觀賞的月份是10月~3月，因此我們就開啟了Flyscanner搜尋哪一天的機票最便宜，看了看發現2/26的去程跟3/7回程最便宜，二話不說就訂機票了，訂完機票我們就會有動力來開始計劃11天10夜的行程！那這篇文章主要記錄了當時我們的行程規劃、預算規劃、自駕注意事項、保險、意外事件等等，希望對大家有幫助！因為文章很長，建議根據旁邊的目錄來找你想要了解的章節…跳過去直接看比較快。如果你滿足以下情境，那麼這篇文章對你來說就是一個很好的參考：我想要環島，但是不知道該怎麼規劃行程？查看：01-行程安排必須下載的App和網站有哪些？查看：02-必備App和網站我是來歐洲交換的窮學生，希望可以省錢又可以玩得開心（預算在5萬內含機票）查看：03-開銷花費雪地自駕跟租車有什麼需要注意的？查看：04-自駕注意事項汽車的保險該怎麼買？查看：05-保險如果自駕發生意外該怎麼辦？查看：06-意外事件冰川健行和藍冰 ...

IT Project Management 重點整理

發表於2024-01-19|更新於2024-09-11|工具Agile, Waterfall|agile•waterfall

前言這是為了準備IT Project Management的期末考而整理的重點。 Agile vs Waterfall Agile Methodology Ref: 敏捷式開發(Agile)、瀑布式開發(Waterfall) 、敏捷式UX、Lean UX。兜幾？敏捷的原則：減少浪費、快速產出、不斷循環、快速學習敏捷是目前在軟體開發的趨勢，因為軟體開發的需求會隨著時間改變，所以需要一個可以快速產出並且可以快速改變的開發方法。 Positive Side of Agile 客戶會需要與團隊緊密合作，客戶獲得了strong sense of ownership。如果有上市時間更重要，敏捷可以更快的上市產出基本版本。 Negative Side of Agile 有時候會因為衝刺到後面跟一開始的目標偏離，讓產品缺乏一致性，硬體產品也較不適用於此方法，畢竟硬體做出來就不能一直更改。客戶可能沒這麼多時間因為Agile專注在time-boxes交付和頻繁的重新確定優先順序，某些準備交付的項目可能無法在規定的時間內完成。可能需要額外的衝刺Spring（超出最初計劃的衝刺），從而 ...

Spark UI 觀察日誌

發表於2024-01-08|更新於2024-09-10|工具Spark|spark•Debugging

Job, Stage, Task Ref: [看图说话] 基于Spark UI性能优化与调试——初级篇 Ref: 理解spark中的job、stage、task 本篇主要會記錄使用Spark的一些觀察日誌，希望可以了解以下問題： Q: stage, task, job, partition 之間的關係？ Q: 何時會需要 Shuffle? Shuffle 是如何運作的？簡單來說，一個 Spark Application 被提交之後，會根據 Action 的觸發產生 Job，每個 Job 根據 Shuffle 的分界點，又會被分成多個 Stage，而每個 Stage 預設會根據核心大小、資料大小，包含多個 Partition，也就是Task，以加快運算。大概是以下這種感覺： Job 首先，Spark 中的數據都是由 RDD 組成的，而 RDD 是由 partition 組成的，每個 partition 代表一個數據塊。RDD 支援兩種操作分別是 Transformation 和 Action，Transformation 並不會讓程式馬上執行，而是會返回一個新的 RDD，而 ...

Hexo - Butterfly 版本的語言切換功能設置

發表於2023-11-26|更新於2023-12-24|技術hexo|hexo

前言因為求職需求，需要把網站轉換成英文，但是也想保留中文，開始尋找方法可以做中英文切換。剛好看到 Hexo - Butterfly 官方網站，發現他們的網站就有中英文切換的功能，但是找遍了各個網站都沒有人說明。所以只好看source code來了解是怎麼做的。奮鬥了 2 天，終於找到方法了，以下是我整理的方法。 Step 1. 開設一個 private en repository 參考：完美的Hexo多语言解决方案靈感主要參考上述連結，主要運作原理就是透過建立多個 GitHub Pages，基本上會有一個專門運行中文的 repository，另外再開設一個專門運行英文的 repository，透過設定不同的 config.yml 和 _config.butterfly.yml來達到中英文切換的效果。以下是我建立的兩個 repository 建立特定語言的 GitHub Pages Step 2. 設置 [en/zh] config 先用_config.yml複製出兩個檔案分別是 config-en.yml跟config-zh.yml檔案，並且做以下設定。 ...

Spark and Pyspark Local Mode & Cluster on Mac

發表於2023-11-18|更新於2024-09-10|工具Spark|spark

Install Java Open a terminal and execute java. It should redirect you to a download site (if you haven’t installed it already) Python 3 如果你透過 homebrew 安裝，或 conda 可以跳過此步驟。 Browse to https://python.org/downloads, get a 3.x version (latest is 3.12.0.). Install the pkg. Spark/Pyspark Go to https://spark.apache.org/downloads.html and download Spark. Use Spark 3.5.0 for Hadoop 3.3 執行以下指令，我們把 spark 移動到 /usr/local 底下，通常 /usr/local 是使用者自己手動下載的非系統預設軟體，這個folder由用戶自己管理。 123# Untar Archive with: tar xfz ...

Pyspark 的基本概念

發表於2023-11-17|更新於2024-09-10|工具Spark|spark•Debugging

前言本篇文章主要的目的是在整理 Spark: The Definitive Guide 這本書的內容，並且加上自己的理解，讓自己更加熟悉 Spark 的基本概念。 Spark Application 取自：Spark: The Definitive Guide Spark Application mainly consist of two processes: Driver process： executing main() function, sits on a node in the cluster maintaining information about the Spark Application responding to a user’s program or input analyzing, distributing, and scheduling work across the executors Executor process： executing code assigned to it by the driver reporting the stat ...

Twitter Dataset - 使用 LSTM 預測文章的情緒

發表於2023-11-16|更新於2023-11-17|CodeMechine Learning|Mechine Learning

前言最近選了一堂AI課程，這是第六個作業，主要教授內容為以下主題：學會使用 LSTM 使用SpaCy 作業要求 Train a text classification on the TweetEval emotion recognition dataset using LSTMs and GRUs. 建立LSTM模型：Follow the example described here. Use the same architecture, but: only use the last output of the LSTM in the loss function use an embedding dim of 128 use a hidden dim of 256. 使用SpaCy切割字：Use spaCy to split the tweets into words. 挑選Top5000的字：Limit your vocabulary (i.e. the words that you converted to an index) to the most frequen ...

COCO Dataset - 使用 Faster RCNN + MobileNet 進行 Object Detection

發表於2023-11-03|更新於2023-11-09|CodeMechine Learning|Mechine Learning

前言最近選了一堂AI課程，這是第四個作業，主要教授內容為以下主題： Download Coco dataset User pre-trained version of Faster R-CNN to predict the bounding box Calculate IoU 作業要求下載coco資料集：Download the file „2017 Val images [5/1GB]“ and „ 2017 Train/Val annotations [241MB]“ from the Coco page. You can use the library pycocotools to load them into your notebook. 隨機從dataset選擇十張：Randomly select 10 images from this dataset. 使用pre-trained模型FasterR-CNN預測bbox：Use a pre-trained version of Faster R-CNN (Resnet50 backbone) to predict t ...

Flower102 Dataset - 使用 Transfer Learning 訓練 + 使用 Batch Normalization 於 CNN

發表於2023-10-31|更新於2023-11-02|CodeMechine Learning|Mechine Learning

前言最近選了一堂AI課程，這是第四個作業，主要教授內容為以下主題： Pick a dataset and train a model on it. Transfer Learning - Fine Tuning. Batch Normalization in CNN. 主要參考以下網站： Flower102 Dataset Transfer Learning DataSet of Pytorch Models for transfer learning Shannon’s Blog of Transfer Learning Resnet18 作業要求 Task: 選擇一個DataSet： Check out the torchvision DataSet of Pytorch and decide one dataset that you want to use (no CIFAR, no ImageNet, no FashionMNIST). 印出圖片和資料大小：Show some example images of the dataset in the notebook ...