【Phi-3 Vision】OCRが速攻で作れる！軽量マルチモーダルSLMの使い方

2024年7月17日2024年9月11日

Phi-3 Visionは、Microsoftが開発した無料で使えるマルチモーダルSLM(小規模言語モデル)です。

画像からテキストを読み取ることができ、Phi-3 Visionを使えばOCRを簡単に作ることができます。

この記事では、Phi-3 Visionの使い方を分かりやすく解説します。

無料ウェビナー！

参加者募集中

マルチモーダルRAGとは？画像文書に強いRAGをデモで解説【8/27開催】

Phi-3 Visionとは

Phi-3 Visionは、Microsoftが開発したオープンソースのSLM(小規模言語モデル)です。

Phi-3 Visionを使えば、名刺情報や契約書類の画像データからテキストを抽出するようなOCRが簡単に作れます。

Phi-3 Visionの特徴

同等以上のサイズのモデルと比較して性能が高い
画像とテキストを入力して、テキストを生成することができる
モデルをダウンロードして、ローカル環境でも実行が可能

Phi-3の詳細は、以下の記事で解説しています。

Phi-3 Visionの使い方

Phi-3 Visionを使ったテキスト生成について解説していきます。

実行環境

この記事で用意した実行環境は以下のとおりです。

GPU：NVIDIA A100 80GB
GPUメモリ（VRAM）：80GB
OS ：Ubuntu 22.04
Docker

Dockerで環境構築

Dockerを使用してPhi-3 Visionの環境構築をします

Dockerの使い方は以下の記事をご覧ください。

Phi-3 Visionの実装

Dockerコンテナで起動したJupyter Lab上でPhi-3 Visionの実装をします。

STEP

ライブラリのインポート

Jupyter Labのコードセルに次のコマンドを実行して、ライブラリをインポートします。

from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor

STEP

モデルとプロセッサーの設定

Phi-3 Visionのモデルとトークナイザーを読み込みます。

model_id = "microsoft/Phi-3-vision-128k-instruct" 
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    trust_remote_code=True,
    torch_dtype="auto",
    _attn_implementation='flash_attention_2'
)

processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True
)

コードの説明

microsoft/Phi-3-vision-128k-instruct : Phi-3 Visionのモデルタイプを指定します。

AutoModelForCausalLM.from_pretrained()：モデルを読み込みます。

flash_attention_2：推論を高速化するFlash Attentionを指定しています。

pipeline() : テキスト生成タスクのためのTransformerのパイプラインを設定しています。

AutoProcessor.from_pretrained()：画像処理のプロセッサーとトークナイザーを読み込みます。

モデルID	パラメータサイズ	コンテキスト長	GPUメモリ使用量
microsoft/Phi-3-vision-128k-instruct	42億パラメータ	128kトークン	8GB

モデルを読み込む際にGPUメモリを消費しますので、余裕を持ったGPUメモリをご用意ください。

STEP

画像からテキストを生成する関数を定義

画像とテキストを入力して、テキストを生成する関数を定義します。

def textvision(messages, path_image):
    
    image = Image.open(path_image)

    prompt = processor.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

    generate_ids = model.generate(
        **inputs,
        eos_token_id=processor.tokenizer.eos_token_id,
        max_new_tokens=256,
        do_sample=False,
    )

    generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]

    response = processor.batch_decode(
        generate_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    
    resized_image = image.resize((int(image.width * 0.5), int(image.height * 0.5)))
    resized_image.show()
    print(response)

コードの説明

Image.open()

指定されたパスをもとに画像を開きます。

processor.tokenizer.apply_chat_template()

メッセージにチャットテンプレートを適用します。

processor(prompt, [image], return_tensors=”pt”).to(“cuda:0”)

テキストと画像をテンソル変換してProcessorに渡し、GPUに転送しています。

model.generate()

入力データをもとにトークンIDを生成します。

processor.batch_decode()

モデルが生成したトークンIDをテキストに変換します。

Phi-3 Visionで画像からテキストを生成

Phi-3 Visionを使って、画像の入力からテキストを生成してみます。

英語で画像説明

「指定した画像について説明してください。」というプロンプトを英語で実行してみます。

prompt_input = [
    {"role": "user", "content": "<|image_1|>\n Please describe this image."}
]

path_image = "testimage1.jpg"

textvision(prompt_input, path_image)

コードの説明

[{“role”: “user”, “content”: “<|image_1|>\n Please describe this image.”}]

「Please describe this image」というプロンプトを入力しています。

path_image = “testimage1.jpg”

画像を指定しています。

Please describe this image.
————————————–
(翻訳)
この画像について説明してください。

The image features a cozy indoor setting with a wooden table at the center. On the table, there is a white cup of coffee with a heart-shaped foam on top, accompanied by a small potted plant. To the right of the cup, there is a plush toy of a cat with orange fur and white paws. The background shows a window with a view of a street scene, including a building with a red roof and a car parked on the street. The room is softly lit with warm lighting, creating a welcoming atmosphere.
————————————————————————–
（翻訳）
画像には、中央に木製のテーブルがある居心地の良い室内の様子が写っています。テーブルの上には、ハート型の泡が乗った白いコーヒーカップと、小さな鉢植えの植物があります。カップの右側には、オレンジ色の毛並みと白い足を持つ猫のぬいぐるみが置かれています。背景には、赤い屋根の建物や通りに駐車された車などが見える窓があります。部屋は暖かい照明でやわらかく照らされ、歓迎の雰囲気を醸し出しています。

細部にわたって画像の説明ができています！

日本語で画像説明

「この画像について説明してください。」というプロンプトを日本語で実行してみます。

prompt_input = [
    {"role": "user", "content": "<|image_1|>\n この画像について説明してください。"},
    {"role": "assistant", "content": "あなたは画像を日本語で回答するアシスタントです。"}
]

path_image = "testimage3.jpg"

textvision(prompt_input, path_image)

コードの説明

{“role”: “assistant”, “content”: “あなたは日本語で回答するアシスタントです”}

回答するモデルの役割（ロール）を指定しています。

Please describe this image.
————————————–
(翻訳)
この画像について説明してください。

この画像では、キャッシュレスターに準備されたキャッシュレスターのシーンが描かれています。キャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスターのキャッシュレスターの中で、キャッシュレスタ

日本語での回答は、完全に破綻しています。

英語で同じ内容を生成してみます。

Please describe this image.
————————————–
(翻訳)
この画像について説明してください。

The image features a cat dressed in a white chef’s hat and a white jacket with black buttons, sitting on a wooden cutting board in a kitchen setting. The cat has a playful expression and is looking directly at the camera. The kitchen is equipped with various cooking utensils such as pans, a rolling pin, and a pot, as well as a range hood and a stove. There are also shelves with jars and spices, and a tomato is visible on the counter.
————————————————————————–
（翻訳）
画像には、白いシェフ帽と黒いボタンの付いた白いジャケットを着た猫が、キッチンの木製のまな板の上に座っている様子が描かれています。猫は遊び心のある表情でカメラを見ています。キッチンには、フライパン、めん棒、鍋などのさまざまな調理器具や、レンジフードとストーブが備わっています。また、棚には瓶やスパイスが置かれており、カウンターにはトマトが見えます。

英語入力の場合は、高い精度の応答になりました！

画像からテキストの読み取り（OCR）

「画像のテキストをOCRで抽出してください」というプロンプトを投げてみます。

prompt_input = [
    {"role": "user", "content": "<|image_1|>\n OCR the text of the image. Extract the text of the following fields and put it in a JSON format: \
    'Company', 'Name','Title','Phone', 'E-mail','URL','Adress'"}]

path_image = "testocr.jpg"

textvision(prompt_input, path_image)

OCR the text of the image. Extract the text of the following fields and put it in a JSON format:
‘Company’, ‘Name’,’Title’,’Phone’, ‘E-mail’,’URL’,’Adress'”
———————————————————————
（翻訳）
画像のテキストをOCRで抽出し、以下のフィールドのテキストを抽出してJSON形式にしてください：
‘会社名’, ‘名前’, ‘役職’, ‘電話番号’, ‘Eメール’, ‘URL’, ‘住所’

{
“Company”: “Cat Innovation Inc.”,
“Name”: “Scottish Foldman”,
“Title”: “Manager”,
“Phone”: “03-313-2800”,
“E-mail”: “f-scottish@catinnovation.co.jp”,
“URL”: “https://catinnovation.co.jp”,
“Address”: “3-48-1 Nekogaya, Shinjuku-ku, Tokyo 562-2800, Japan”
}

プロンプトから精度の高いテキストの抽出ができています！