CLIP と LLM を使用したマルチモーダル RAG システムの構築

この記事では、オープンソースの Large Language Multi-Modal モデルを使用して、検索拡張生成 (RAG) システムを構築する方法について説明します。この記事の焦点は、LangChain または LLLama インデックスに依存せずにこの目標を達成し、フレームワークへの依存をさらに回避することです。

RAGとは

人工知能の分野において、検索拡張生成 (RAG) は、大規模な言語モデルの機能を向上させる革新的なテクノロジーです。本質的に、RAG は、モデルが外部ソースからリアルタイム情報を動的に取得できるようにすることで、AI 応答の特異性を高めます。

このアーキテクチャは、生成機能と動的な検索プロセスをシームレスに組み合わせ、AI がさまざまなドメインの絶えず変化する情報に適応できるようにします。微調整や再トレーニングとは異なり、RAG は、モデル全体を変更せずに AI を最新の関連情報で更新できるコスト効率の高いソリューションを提供します。

RAGの役割

1. 精度と信頼性の向上

大規模言語モデル (LLM) の予測不可能性に対処するため、LLM を信頼できる知識ソースにリダイレクトします。これにより、誤った情報や古い情報が提供されるリスクが軽減され、より正確で信頼性の高い回答が保証されます。

2. 透明性と信頼を高める

LLM のような生成 AI モデルは透明性に欠けることが多く、その出力を信頼することが困難です。 RAG は、生成されたテキスト出力を組織がより細かく制御できるようにすることで、偏見、信頼性、コンプライアンスに関する懸念に対処します。

3. 幻覚を軽減する

LLM は幻覚的な反応、つまり首尾一貫しているが不正確な情報や捏造された情報に陥りやすい傾向があります。 RAG は、回答が信頼できる情報源に基づいていることを保証することで、主要セクターに対する誤解を招くアドバイスのリスクを軽減します。

4. コスト効率の高い適応性

RAG は、大規模な再トレーニングや微調整を必要とせずに AI 出力を向上させるコスト効率の高い方法を提供します。必要に応じて特定の詳細を動的に取得することで、情報を最新かつ関連性のある状態に保つことができ、変化する情報に対する AI の適応性が確保されます。

マルチモーダルモーダルモデル

マルチモーダル性には、複数の入力と、それらを組み合わせて 1 つの出力を作成することが含まれます。CLIP を例に挙げると、CLIP のトレーニングデータはテキストと画像のペアです。対照学習を通じて、モデルはテキストと画像のペア間の一致関係を学習できます。

モデルは、同じものを表す異なる入力に対して、同じ (非常に類似した) 埋め込みベクトルを生成します。

マルチモーダル大規模言語

GPT4v と Gemini Vision は、さまざまなデータタイプ (画像、テキスト、言語、オーディオなど) を統合するマルチモーダル言語モデル (MLLM) を調査します。 GPT-3、BERT、RoBERTa などの大規模言語モデル (LLM) はテキストベースのタスクでは優れていますが、他のデータタイプを理解して処理する際には課題に直面します。この制限に対処するために、マルチモーダルモデルはさまざまなモダリティを組み合わせて、さまざまなデータをより包括的に理解できるようにします。

マルチモーダル大規模言語モデル従来のテキストベースの方法を超えています。 GPT-4 を例にとると、これらのモデルは画像やテキストを含むさまざまなデータタイプをシームレスに処理し、情報をより完全に理解することができます。

RAGと組み合わせる

ここでは、Clip を使用して画像とテキストを埋め込み、これらの埋め込みを ChromDB ベクターデータベースに保存します。ビッグモデルは、取得された情報に基づいてユーザーチャットセッションに参加するために利用されます。

Kaggle の画像と Wikipedia の情報を活用して、花の専門家チャットボットを作成します。

まずパッケージをインストールします:

 ! pip install -q timm einops wikipedia chromadb open_clip_torch !pip install -q transformers==4.36.0 !pip install -q bitsandbytes==0.41.3 accelerate==0.25.0

データを前処理する手順は簡単で、画像とテキストをフォルダーに入れるだけです。

任意のベクターデータベースを自由に使用できますが、ここでは ChromaDB を使用します。

 import chromadb from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction from chromadb.utils.data_loaders import ImageLoader from chromadb.config import Settings client = chromadb.PersistentClient(path="DB") embedding_function = OpenCLIPEmbeddingFunction() image_loader = ImageLoader() # must be if you reads from URIs

ChromaDB にはカスタム埋め込み関数が必要です。

 from chromadb import Documents, EmbeddingFunction, Embeddings class MyEmbeddingFunction(EmbeddingFunction): def __call__(self, input: Documents) -> Embeddings: # embed the documents somehow or images return embeddings

ここでは、テキスト用と画像用の 2 つのコレクションを作成します。

 collection_images = client.create_collection( name='multimodal_collection_images', embedding_functinotallow=embedding_function, data_loader=image_loader) collection_text = client.create_collection( name='multimodal_collection_text', embedding_functinotallow=embedding_function, ) # Get the Images IMAGE_FOLDER = '/kaggle/working/all_data' image_uris = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if not image_name.endswith('.txt')]) ids = [str(i) for i in range(len(image_uris))] collection_images.add(ids=ids, uris=image_uris) #now we have the images collection

Clip の場合、次のようなテキストを使用して画像を取得できます。

 from matplotlib import pyplot as plt retrieved = collection_images.query(query_texts=["tulip"], include=['data'], n_results=3) for img in retrieved['data'][0]: plt.imshow(img) plt.axis("off") plt.show()

画像を使用して関連する画像を取得することもできます。

テキストコレクションは次のようになります。

 # now the text DB from chromadb.utils import embedding_functions default_ef = embedding_functions.DefaultEmbeddingFunction() text_pth = sorted([os.path.join(IMAGE_FOLDER, image_name) for image_name in os.listdir(IMAGE_FOLDER) if image_name.endswith('.txt')]) list_of_text = [] for text in text_pth: with open(text, 'r') as f: text = f.read() list_of_text.append(text) ids_txt_list = ['id'+str(i) for i in range(len(list_of_text))] ids_txt_list collection_text.add( documents = list_of_text, ids =ids_txt_list )

次に、上記のテキストコレクションを使用して埋め込みを取得します。

 results = collection_text.query( query_texts=["What is the bellflower?"], n_results=1 ) results

結果は次のとおりです。

 {'ids': [['id0']], 'distances': [[0.6072186183744086]], 'metadatas': [[None]], 'embeddings': None, 'documents': [['Campanula () is the type genus of the Campanulaceae family of flowering plants. Campanula are commonly known as bellflowers and take both their common and scientific names from the bell-shaped flowers—campanula is Latin for "little bell".\nThe genus includes over 500 species and several subspecies, distributed across the temperate and subtropical regions of the Northern Hemisphere, with centers of diversity in the Mediterranean region, Balkans, Caucasus and mountains of western Asia. The range also extends into mountains in tropical regions of Asia and Africa.\nThe species include annual, biennial and perennial plants, and vary in habit from dwarf arctic and alpine species under 5 cm high, to large temperate grassland and woodland species growing to 2 metres (6 ft 7 in) tall.']], 'uris': None, 'data': None}

または、画像を使用してテキストを取得します。

 query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg' raw_image = Image.open(query_image) doc = collection_text.query( query_embeddings=embedding_function(query_image), n_results=1, )['documents'][0][0]

上図の結果は次のようになります。

 A rose is either a woody perennial flowering plant of the genus Rosa (), in the family Rosaceae (), or the flower it bears. There are over three hundred species and tens of thousands of cultivars. They form a group of plants that can be erect shrubs, climbing, or trailing, with stems that are often armed with sharp prickles. Their flowers vary in size and shape and are usually large and showy, in colours ranging from white through yellows and reds. Most species are native to Asia, with smaller numbers native to Europe, North America, and northwestern Africa. Species, cultivars and hybrids are all widely grown for their beauty and often are fragrant. Roses have acquired cultural significance in many societies. Rose plants range in size from compact, miniature roses, to climbers that can reach seven meters in height. Different species hybridize easily, and this has been used in the development of the wide range of garden roses.

これで、テキストと画像のマッチングが完了しました。実は、これが CLIP の作業のすべてです。次に、LLM の追加を始めます。

 from huggingface_hub import hf_hub_download hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True) hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True) hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True) hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True) hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)

visheratin/LLaVA-3bを使用します。

 from modeling_llava import LlavaForConditionalGeneration import torch model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b") model = model.to("cuda")

トークナイザーをロードします。

 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")

次に、後で呼び出せるようにプロセッサを定義します。

 from processing_llava import LlavaProcessor, OpenCLIPImageProcessor image_processor = OpenCLIPImageProcessor(model.config.preprocess_config) processor = LlavaProcessor(image_processor, tokenizer)

下記から直接ご利用いただけます。

 question = 'Answer with organized answers: What type of rose is in the picture? Mention some of its characteristics and how to take care of it ?' query_image = '/kaggle/input/flowers/flowers/rose/00f6e89a2f949f8165d5222955a5a37d.jpg' raw_image = Image.open(query_image) doc = collection_text.query( query_embeddings=embedding_function(query_image), n_results=1, )['documents'][0][0] plt.imshow(raw_image) plt.show() imgs = collection_images.query(query_uris=query_image, include=['data'], n_results=3) for img in imgs['data'][0][1:]: plt.imshow(img) plt.axis("off") plt.show()

結果は次のとおりです。

結果には必要な情報のほとんども含まれています。

これで統合は完了です。最後のステップはチャットテンプレートを作成することです。

 prompt = """<|im_start|>system A chat between a curious human and an artificial intelligence assistant. The assistant is an exprt in flowers , and gives helpful, detailed, and polite answers to the human's questions. The assistant does not hallucinate and pays very close attention to the details.<|im_end|> <|im_start|>user <image> {question} Use the following article as an answer source. Do not write outside its scope unless you find your answer better {article} if you thin your answer is better add it after document.<|im_end|> <|im_start|>assistant """.format(questinotallow='question', article=doc)

ここではチャットプロセスの作成方法については詳しく説明しません。完全なコードは次のとおりです。

https://github.com/nadsoft-opensource/RAG-with-open-source-multi-modal

<<: 2024年のAIソフトウェアテストの主なトレンド

>>: 2024年のテクノロジートレンド