OpenCv-PythonとpdfLaTeXで自炊pdfファイルの位置調整

この記事は TeX ＆ LaTeX Advent Calendar 2021 の11日目の記事です．

10日目は t_kemmochi さん，12日目は yukishita さんです．

動機

書籍の自炊にブックエッジスキャナーのAvisionのFB2280Eを利用しています。書籍を裁断せずに1ページずつスキャンを行うため、本の開き具合などの要因で本の序盤、中盤、終盤で得られるスキャン画像の位置ずれが大きくなります。以前からpdfLaTeXを利用して位置調整をしていたのですが、OpenCvを利用して自動化できそうだったので試してみました。

やること

本記事では、pdf化した文書画像の版面（文字の印刷された部分の意味で使います）を半自動的に計算してpdfの画像の位置調整をするために、OpenCvとpdfLaTeXを利用する方法を解説します。手順としては

Pythonの画像認識用ライブラリであるOpenCvを利用して版面を計算する。
テンプレートエンジンライブラリのJinja2を使って画像のバウンディングボックスなどを記述したLaTeXファイルを作成する。
pdfLaTeXでpdfを読み込み、位置の調整されたpdfを作成する。

という流れになっています。作業環境としては、Visual Studio CodeをRemove - WSL拡張機能と併用しています。PythonやそのライブラリはWSL上のUbuntuにインストールし、他にもpdfファイルを他のファイル形式にするためにpoppler-utilsなんかもインストールしました。

参考記事

OpenCvを利用した文書画像のレイアウト解析については

ブログSomething Like Programming内の記事Document Layout Analysis
githubリポジトリrbaguila/document-layout-analysis
OpenCv-Pythonチュートリアル

を参考にしました。

pdfLaTeXがpdf加工に使えるという話と、pdfpagesパッケージの存在は

doraTeXさんのブログTeX Alchemist Onlineの記事pdfTeX による見開きPDFの結合・分割

で知ったように思います。

画像pdfをpdfpagesで取り込む

まず、元になるpdfを用意します．スキャン後は影になっている部分をbrissなどのソフトを使って切り落とし、傾き補正などの処理はしてあるものとします。これをpdfpagesというLaTeXのパッケージを用いてpdfLaTeXで単純に取り込んだものが次の画像です。

上記画像ではpdfを取り込む際にeso-picパッケージを利用してグリッドラインを表示しています。グリッドラインの表示のさせかたについては

LaTeXの出力pdfにグリッドラインを引く(自炊pdfの画像位置調整)

を御覧ください。

黒い枠が取り込んだ画像の大きさを表す枠です。pdfpagesでincludepdfコマンドを使って画像pdfを1ページずつ取り込むと，各画像が中央に配置されます。スキャン時のブレによって画像の上下左右の空きが少しずつ異なっているのが分かるかと思います。画像中の文字が書かれた部分を囲む枠を認識させて，書籍を実測するなどして決めた青色の線の中に収めるのが目標です。

画像認識のための下準備

OpenCvで画像認識を行うために、用意したpdfファイルを画像ファイルに変換しておきます。poppler-utilsに含まれているpdftoppmを使ってターミナルから

pdftoppm -png -r 300 filename.pdf filename

とし、pdfファイルを300dpiのpngファイルとして書き出しておきます。pdftoppmコマンドはtexliveにも含まれています。

OpenCv-Pythonによる版面の検出

次のPythonスクリプトをbbox_calc.pyという名前で保存し、このスクリプトを使って、上で書き出したpngファイルを処理します。

import os
import cv2
import numpy as np

#画像の閾値処理
def img_binalizer(docname, page, bool):
    imgpath = "images/" + str(docname) + "/" + str(docname) + "-" + str(page).zfill(3) +".png"
    if bool == True: #LaTeXファイル出力用
        img = cv2.imread(imgpath, cv2.IMREAD_GRAYSCALE) #画像は2値化済みなので最初から2値で読み込む
        imgGray = img
          
    elif bool == False: #確認用の画像作成
        img = cv2.imread(imgpath, cv2.IMREAD_COLOR) #カラー画像として読み込む
        imgGray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) #グレースケール画像へ変換
    else:
        pass

    imgBlur = cv2.medianBlur(imgGray, 5) #medianblurの方がゴミが消えやすそう
    # imgBlur = cv2.GaussianBlur(imgMedBlur, (5, 5), 0)

    #ゴミが多い場合
    # kernel = np.ones((5, 5), np.int8)
    # temp_img = cv2.morphologyEx(imgGray,cv2.MORPH_OPEN,kernel,iterations=3)
    # imgBlur = cv2.medianBlur(temp_img, 7)
        
    #画像の閾値処理 cv2.thresholdの2つ目の出力が閾値処理された後の2値画像
    _ , thresh = cv2.threshold(imgBlur,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)
    return img, thresh

#輪郭の外接矩形の情報から端点の座標を計算
def contour_corner(contour):
    #各contourの外接矩形の左上の座標(x,y)と幅w，高さhを取得
    x, y, w, h = cv2.boundingRect(contour)
    #contourの左上と右下の点の座標を配列に格納
    bdcorner = np.array([[x, y], [x+w, y+h]], dtype=np.int16)
    return bdcorner

#画像のオブジェクト(文字)の輪郭を検出して，各輪郭の左上と右下端点の座標を格納した配列を作成(左上が原点)
def calc_contour_corners(thresh):
    #輪郭線が図形に被りすぎないようにするために画像中の図形を膨張させる
    kernel = np.ones((8, 4), np.int8) 
    dilate_img = cv2.dilate(thresh, kernel, iterations=1)
    
    #funcContoursで図形の輪郭を検出する．
    #RETER_EXTERNALは階層的な輪郭の最外層だけを返す
    #CHAIN_APPROX_SIMPLEは矩形状の輪郭の四隅だけを検出する
    contours, _ = cv2.findContours(dilate_img.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    if len(contours) == 0:
        #contourが検出されない場合(白紙ページなど)の例外処理．適当な数値の座標を与えておく．
        bdcorners = np.array([[[0, 0], [5, 5]]]) 
    else:
        #各countourに対して左上と右下の2端点の座標を格納した配列を作る
        bdcorners = np.array([ contour_corner(cnt) for cnt in contours], dtype=np.int16)
    return bdcorners


#各countourの端点の座標から版面の端点の座標を計算する．(左上が原点)
def calc_bbox_corners(bdcorners): 
    ul_point = bdcorners[:,0].min(axis=0) #contourの左上の点の両座標の最小値を計算し，その値を座標とする点を与える．
    lr_point = bdcorners[:,1].max(axis=0) #contourの右下の点の両座標の最大値を計算し，その値を座標とする点を与える．
    return ul_point, lr_point

#各contourを囲む矩形を描画する
def draw_contour_borders(bdcorners, output):
    for cnt in np.arange(len(bdcorners)):
        cv2.rectangle(output, tuple(bdcorners[cnt,0]), tuple(bdcorners[cnt,1]), (0, 255, 0), 2) #各contourに対して矩形を描く

    bbox_corners = calc_bbox_corners(bdcorners)
    ul_point = bbox_corners[0]
    lr_point = bbox_corners[1]
    cv2.rectangle(output, tuple(ul_point), tuple(lr_point), (255, 255, 0), 2) #版面を囲む矩形を描く
    return output



if __name__ == "__main__": #以下は他ファイルから読み込んだときには実行されない．
    import time
    import os
    import fitz #pymupdfを使う

    docname = "filename"
    out_dir = "output/" + str(docname)

    if not os.path.exists(out_dir):
        os.makedirs(out_dir)

    #pdfファイルのページ数を取得する．
    pdfpath = "images/" + str(docname) + "/" + str(docname) + ".pdf"
    pdfpages = fitz.open(str(pdfpath)).pageCount

    #処理を行うページ番号
    ini = 1
    fin = pdfpages + 1

    for p in range(ini, fin):
        #画像の閾値処理(2つ目の出力が閾値処理された2値画像)
        img, th = img_binalizer(docname, p, False)

        #コンソールへ出力
        print(p)
        print(img.shape)

        output_img = img.copy()
                
        s=time.time()
        bdcorner = calc_contour_corners(th) #2値画像中の図形を検出して外接矩形の端点を収めた行列
        output_borders = draw_contour_borders(bdcorner, output_img) #画像imgに外接矩形を描画
        t=time.time()-s
        print(t)
              
        #ボーダーを書き入れた画像を書き出す
        cv2.imwrite(str(out_dir) + "/output-%03d.jpg" %p, output_borders)

これを使って先程のpngファイルを処理すると次の画像のようなjpgファイルが得られます。

緑色の枠はOpenCvのfindContourで検出した文字または単語の外接矩形で、この矩形の端点の座標を元に文字が描かれている領域の端点を計算して描いたのが青色の枠です。この青色の枠を元に画像の位置をどれだけずらして取り込むかを計算していきます。

スクリプトbbox_calc.pyで得られた画像をチェックして、版面が上手く捉えられているようであれば次の工程に進みます。

画像pdfの位置補正量の計算

今度はbbox_calc.py内で定義した関数を元に、各画像の版面を囲む矩形の対角線上にある端点の座標を計算し、そこから画像をどのくらい移動させるかを計算させて、その結果をJinja2を用いてLaTeXファイルとして書き出します。次のスクリプトは、処理するpdfファイル名がfilename.pdfであったときはfilename.pyとして保存します。

import os
import sys
# 1つ上のフォルダをモジュールの検索リストに含める
sys.path.append(os.path.join(os.path.dirname(__file__), '..'))

#opencv
import cv2

#numpy
import numpy as np
import time

#pymupdf(pdfページ番号を取得するのに使う)
import fitz 

#散布図を描くのに使う
import matplotlib.pyplot as plt

#自作関数の読み込み
import bbox_calc as bbc

#jinja2(テンプレートファイルからpdfを取り込むlatexファイルを作成)
import jinja2
latex_jinja_env = jinja2.Environment(
	block_start_string = '\BLOCK{',
	block_end_string = '}',
	variable_start_string = '\VAR{',
	variable_end_string = '}',
	comment_start_string = '\#{',
	comment_end_string = '}',
	line_statement_prefix = '%%',
	line_comment_prefix = '%#',
	trim_blocks = True,
	autoescape = False,
	loader = jinja2.FileSystemLoader(os.path.abspath('.'))
)
template = latex_jinja_env.get_template('jinja_template.tex')

#スクリプト名と同じ名前のpdfファイルからページ数を取得する
basename = os.path.basename(__file__) #スクリプトファイル名を取得
docname = os.path.splitext(basename)[0] #スクリプトファイルから拡張子を除いた名前を取得
pdfpath = "images/" + str(docname) + "/" + str(docname) + ".pdf"
pdfpages = fitz.open(str(pdfpath)).pageCount#pdfのページ数を取得(要fitz)

#物理的な紙面のサイズを設定
# B5 : 182mm×257mm
# A5 : 148mmx210mm
# B6 : 128mmx182mm
phorizontal = 150 #紙面のhorizontal size (mm)
pvertical = 220 #紙面のvertical size (mm)
tmargin = 17 #top margin (mm)
tmargin2 = 38 #top margin2 (mm)
tmargin3 = 52.5 #top margin3 (mm)
bmargin = 18 #bottom margin (mm)
rmargin = 14.5 #right margin (mm)
lmargin = 14.5  #left margin (mm)
center = "{:.2f}".format(phorizontal / 2) #紙面の中心

#pdfファイルの開始ページと終了ページ
ini = 1 #開始ページ
fin = pdfpages + 1 #終了ページ + 1
scale = 1.0 #拡大倍率
angle = 0 #回転角度


#単位の換算(dpiとpixelの値からmmに換算する)
def px2mm(pix, dpi):
    mm = pix * (25.4 / dpi)
    return mm

#画像ファイルの版面のboundingboxを計算
def bbox(thresh, pagenum, xdpi, ydpi):
        img_height, img_width = thresh.shape
                
        #bbox_calcの計算では左上が原点であることに注意する
        bdcorners = bbc.calc_contour_corners(thresh)
        bbox_corners = bbc.calc_bbox_corners(bdcorners)
        xmin, ymin = bbox_corners[0] #版面左上の点の座標
        xmax, ymax = bbox_corners[1] #版面右下の点の座標

        # boundingbox用の座標を計算する．今度は左下が原点．単位はmmにする
        # ll = lower left, ur = upper right を表す
        llx = px2mm(xmin, xdpi) #版面左下点のx座標
        lly = px2mm(img_height - ymax, ydpi) #版面左下点のy座標
        urx = px2mm(xmax, xdpi) #版面右上点のx座標
        ury = px2mm(img_height - ymin, ydpi) #版面右上点のy座標
        xtext = px2mm(xmax - xmin, xdpi) #版面幅
        ytext = px2mm(ymax - ymin, ydpi) #版面高さ
        img_tmargin = px2mm(ymin, ydpi) #画像ファイルにおける上部マージン

        #pdfpagesでは画像が中央に配置される．そのときのマージンを計算(単位はmm)
        init_hmargin = 0.5 * (phorizontal - (xtext * scale) )
        init_vmargin = 0.5 * (pvertical - (ytext * scale) )
                
        #includeする画像におけるマージン(単位はpixel)
        tmargin_px = ymin # top
        bmargin_px = img_height - ymax #bottom
        lmargin_px = xmin #left
        rmargin_px = img_width - xmax #right
        
        #調整用のマージン比率を計算
        tbmargin_ratio = tmargin_px / bmargin_px
        img_tmargin_ratio = tmargin_px / img_height
        sidemargin_ratio = (lmargin_px + rmargin_px) / img_width
        lmargin_ratio = lmargin_px / img_width

        #pdfpagesの初期配置から移動させる距離の計算(水平方向)
        if (sidemargin_ratio > 0.25) and (lmargin_ratio > 0.10) : #左右の空きが大きいときには動かさない
            xshift = 0
        elif pagenum % 2 == 0: #ページ番号の偶奇でどちらに寄せるかを決める
        # else:
            xshift = lmargin - init_hmargin  #左寄せ
        elif sidemargin_ratio < 0.03:
            xshift = lmargin - init_hmargin  #左寄せ
        else:
            xshift = init_hmargin - rmargin #右寄せ 
            
        #上部マージンを切り替える閾値(散布図を参考に)
        tm_ratio_thresh1 = 0.1
        tm_ratio_thresh2 = 0.2

        #pdfpagesの初期配置から移動させる距離の計算(垂直方向)
        if (sidemargin_ratio > 0.25) and (lmargin_ratio > 0.10): #左右の空きが大きいときは動かさない
            yshift = 0
        elif img_tmargin_ratio > tm_ratio_thresh2 : #上の空きが大きければ下に詰める
            yshift = -tmargin3 + init_vmargin
        elif (img_tmargin_ratio < tm_ratio_thresh2 ) and (img_tmargin_ratio > tm_ratio_thresh1):
            yshift = -tmargin2 + init_vmargin
        # elif img_tmargin_ratio < 0.005:
            # yshift = 0
        else:
            yshift = -tmargin + init_vmargin
            # yshift = bmargin - init_vmargin

        #少数第3位以下を切り捨てる    
        llx_2f = "{:.2f}".format(llx)
        lly_2f = "{:.2f}".format(lly)
        urx_2f = "{:.2f}".format(urx)
        ury_2f = "{:.2f}".format(ury)
        xshift_2f = "{:.2f}".format(xshift)
        yshift_2f = "{:.2f}".format(yshift)
        tbmargin_ratio_2f = "{:.3f}".format(tbmargin_ratio)
        img_tmargin_ratio_2f = "{:.3f}".format(img_tmargin_ratio)
        sidemargin_ratio_2f = "{:.2f}".format(sidemargin_ratio)
        lmargin_ratio_2f = "{:.2f}".format(lmargin_ratio)
        xtext_2f = "{:.2f}".format(xtext)
        ytext_2f = "{:.2f}".format(ytext)

        return llx_2f, lly_2f, urx_2f, ury_2f, xshift_2f, yshift_2f, tbmargin_ratio_2f, img_tmargin_ratio_2f, sidemargin_ratio_2f, lmargin_ratio_2f, xtext_2f, ytext_2f


#jinjaで生成するtexファイルに挿入する文字列の初期化
string = ""

#既存のlogファイルの削除
logname = "log.txt"
if os.path.isfile(logname):
    os.remove(logname)

#上部マージンと画像縦サイズの比率を収めた配列(散布図に使う)
margin_ratios = np.empty(shape=pdfpages,dtype=np.float16)

init_time = time.time()
#主要部分
for p in range(ini, fin):
    s=time.time()    
    
    img, th = bbc.img_binalizer(docname, p, True)
    
    #単位換算に必要な画像ファイルのdpiを与える
    xdpi = 300
    ydpi = 300
    
    bb = bbox(th, p, xdpi, ydpi)
    t=time.time()-s

    #上部マージンと画像縦サイズの比率を収めた配列
    margin_ratios[p-1] = bb[7]

    #コンソールへの出力
    print("page: " + str(p))
    print("shape: " + str(img.shape))
    # print(bb)
    print("ll:" + str((bb[0], bb[1])) + ", ur:" + str((bb[2], bb[3])))
    print("xshift: " + str(bb[4]) + ", yshift: " + str(bb[5]))
    print("tb margin ratio:" + str(bb[6]) + ", image top margin ratio:" + str(bb[7]) + ", side margin ratio:" + str(bb[8]) + ", left margin ratio:" + str(bb[9]) + "\n")
    print("time: " + str(t) + "\n")

    #logファイルの出力
    with open(str(logname) , mode = "a", encoding="utf-8") as lg:
        lg.write("page: " + str(p) + "\n")
        lg.write("shape: " + str(img.shape) + "\n")
        lg.write("ll:" + str((bb[0], bb[1])) + ", ur:" + str((bb[2], bb[3])) + "\n")
        lg.write("xshift: " + str(bb[4]) + ", yshift: " + str(bb[5]) + "\n")
        lg.write("tb margin ratio:" + str(bb[6]) + ", image top margin ratio:" + str(bb[7]) + "\n")
        lg.write("side margin ratio:" + str(bb[8]) + ", left margin ratio:" + str(bb[9]) + "\n")
        lg.write("time: " + str(t) + "\n\n")
    
　　#LaTeXファイルに書き込む文字列(画像のBoundary Boxと原点の移動(offset))
    string = string + "\\includepdf[pages={{{0}}},scale={1},angle={2},noautoscale,bb={3}mm {4}mm {5}mm {6}mm,offset={7}mm {8}mm,frame]{{\\target}}\n".format(p, scale, angle, bb[0], bb[1], bb[2], bb[3], bb[4], bb[5])    

#スクリプト実行時間
fin_time = time.time() - init_time
print(fin_time)

#マージン比率の散布図を描画させる
x = list(np.arange(1,fin))
fig = plt.figure()
plt.xlabel("page number",fontsize=18)
plt.ylabel("top margin / vertical image size",fontsize=18)
plt.grid(True)
plt.scatter(x,margin_ratios)
fig.set_size_inches(15,10)
plt.show()

#jinjaで書き込むデータ
data = {
    "executecommand": "%#! pdflatex jinja_output.tex", #yatex用
    "firstpagenum": str(ini), 
    "paperwidth" : str(phorizontal), 
    "paperheight": str(pvertical), 
    "leftmargin": str(lmargin), 
    "rightmargin": str(rmargin), 
    "topmargin": str(tmargin), 
    "topmarginii": str(tmargin2), 
    "topmarginiii": str(tmargin3),
    "bottommargin": str(bmargin), 
    "centerposition": str(center), 
    "target": str(pdfpath),
    "main": str(string),
    }

document = template.render(data)
with open("jinja_output.tex", mode = "w", encoding="utf-8") as fd: 
    fd.write(document)

Jinja2のテンプレートファイルは、jinja_template.texという名前で次のような内容です。

\documentclass{article}

\usepackage[paperwidth=\VAR{paperwidth}mm,paperheight=\VAR{paperheight}mm]{geometry} 

\newif\ifesopic
\esopictrue
%\esopicfalse

\ifesopic
\usepackage[texcoord]{eso-pic} %各ページ前面にグリッドを描くのに利用．texcoordで左上が原点．
\fi

\usepackage{pdfpages} %\includepdfコマンドを利用

\ifesopic
\usepackage{tikz} %グリッドの描画に利用
\usetikzlibrary{calc,math}
\fi

\ifesopic
%eso-picで使う新しい長さのコマンドを設定
\newlength{\myleftmargin}
\newlength{\myrightmargin}
\newlength{\mytopmargin}
\newlength{\mytopmarginii}
\newlength{\mytopmarginiii}
\newlength{\mybottommargin}
\newlength{\mycenterposition}

%margins
\setlength{\myleftmargin}{\VAR{leftmargin}mm}
\setlength{\myrightmargin}{\VAR{rightmargin}mm}
\setlength{\mytopmargin}{\VAR{topmargin}mm}
\setlength{\mytopmarginii}{\VAR{topmarginii}mm}
\setlength{\mytopmarginiii}{\VAR{topmarginiii}mm}
\setlength{\mybottommargin}{\VAR{bottommargin}mm}
\setlength{\mycenterposition}{\VAR{centerposition}mm}

%グリッドと版面ガイドラインの描画
\AddToShipoutPictureFG{%
\begin{tikzpicture}[remember picture, overlay,
                   help lines/.append style={line width=0.1pt,
                                             color=blue!50},
                   minor divisions/.style={help lines,line width=0.2pt,
                                           color=red!50},
                   major divisions/.style={help lines,line width=0.3pt,
                                           color=red},
                   guide lines/.style={line width=0.5pt,color=blue},
]
 \draw[help lines] (current page.south west) grid[step=1mm]
                   (current page.north east);
 \draw[minor divisions] (current page.south west) grid[step=10mm]
                        (current page.north east);
 \draw[major divisions] (current page.south west) grid[step=50mm]
                        (current page.north east);

\draw[guide lines] ($(current page.north west) + (\myleftmargin,0)$)--($(current page.south west)+ (\myleftmargin,0)$); %left
\draw[guide lines] ($(current page.north east) - (\myrightmargin,0)$)--($(current page.south east)- (\myrightmargin,0)$); %right
\draw[guide lines] ($(current page.north west) - (0,\mytopmargin)$)--($(current page.north east)- (0,\mytopmargin)$); %top
\draw[guide lines] ($(current page.north west) - (0,\mytopmarginii)$)--($(current page.north east)- (0,\mytopmarginii)$); %top2
\draw[guide lines] ($(current page.north west) - (0,\mytopmarginiii)$)--($(current page.north east)- (0,\mytopmarginiii)$); %top3
\draw[guide lines] ($(current page.south west) + (0,\mybottommargin)$)--($(current page.south east)+ (0,\mybottommargin)$); %bottom
\draw[guide lines] ($(current page.north west) + (\mycenterposition,0)$)--($(current page.south west)+ (\mycenterposition,0)$); %center
\end{tikzpicture}%
}
\fi

\def\firstpagenum{\VAR{firstpagenum}} %開始ページの数字を入れる(pdfの一部を読み込むとき)
\setcounter{page}{\firstpagenum}
\usepackage[pdfstartpage=\firstpagenum]{hyperref}


\def\target{\VAR{target}} %読み込むpdfのパス

%yatex用
\VAR{executecommand}


\begin{document}

\VAR{main}

\end{document}

計算結果

このスクリプトの実行が終了すると、まず次の画像のような散布図を出力するようにしてあります。

縦軸は元にした画像の上部マージンと画像の縦の長さの比率です。

書籍の各章のはじめのページには上部にページ番号が無いものも多く、その分上部マージンの比率が大きくなっています。上の散布図での外れ値は、ほぼそのようなページを反映しています。この図を元に上部マージン切り替えの閾値を手で設定して、計算の切り替えをしています。

スクリプトを実行して得られるLaTeXファイル(の一部)は次の画像のようになります。

各ページごとに画像のBounding Box(bb)と原点位置の補正(offset)を計算しています。このLaTeXファイルをタイプセットすると次の画像が得られます。

書籍の物理的な紙面サイズは実測してスクリプト内に書き込んでおき、上下左右のマージン（青色の線）なども実測値を元に何度かスクリプトを実行して調整していきます。最後に画像の版面枠とグリッドを取り除いて次の画像のようなpdfファイルが得られます。

最後に

今は散布図を元にマージンを切り替える比率の値を手で設定しているのですが、これを自動化するなどしたいところです。

うぶつん

このブログを検索