ディープラーニングモデルの圧縮と加速モデル推論

導入

機械学習モデルを本番環境にデプロイする場合、モデルのプロトタイプ作成フェーズでは考慮されていなかった要件を満たす必要があることがよくあります。たとえば、本番環境で使用されるモデルでは、さまざまなユーザーからの多数のリクエストを処理する必要があります。したがって、レイテンシやスループットを低く抑えるように最適化する必要があります。

レイテンシ: リンクをクリックしてから Web ページが読み込まれるまでにかかる時間など、タスクが完了するまでにかかる時間です。タスクを開始してから結果が表示されるまでの待機時間です。
スループット: システムが一定期間内に処理できるリクエストの数。

これは、機械学習モデルが予測を行う際に非常に高速でなければならないことを意味します。この目的のために、モデル推論の速度を上げるためのさまざまな手法があります。この記事では、最も重要なもののいくつかを紹介します。

モデル圧縮

モデルを小さくすることを目的とした技術もあり、それらはモデル圧縮技術と呼ばれていますが、推論段階でモデルを高速化することに重点を置いているものもあり、それらはモデル最適化の分野に分類されます。しかし、モデルを小さくすると推論速度が向上する場合が多いため、これら 2 つの研究領域の境界線は非常に曖昧です。

1. 低ランク分解

これは私たちが初めて目にしたアプローチであり、広範囲に研究されており、実際、最近これに関する論文が数多く発表されています。

基本的な考え方は、ニューラルネットワークの行列 (ネットワークの層を表す行列) を低次元行列 (ただし、2 次元を超える行列がよく使用されるため、テンソルと言った方が正確です) に置き換えることです。この方法では、ネットワークパラメータの数を減らし、推論速度を向上させます。

簡単な例としては、CNN ネットワークで 3x3 畳み込みを 1x1 畳み込みに置き換えることが挙げられます。この技術は、SqueezeNet などのネットワーク構造で使用されます。

最近では、同様のアイデアが、限られたリソースで大規模な言語モデルを微調整できるようにするなど、他の目的にも適用されています。事前トレーニング済みモデルを下流のタスク用に微調整する場合でも、事前トレーニング済みモデルのすべてのパラメータでモデルをトレーニングする必要があり、コストが非常に高くなる可能性があります。

そのため、「大規模言語モデルの低ランク適応」（または LoRA）と呼ばれる方法のアイデアは、元のモデルを、より次元の小さい小さな行列（行列分解を使用）に置き換えることです。このように、事前トレーニング済みモデルをより下流のタスクに適応させるには、これらの新しいマトリックスのみを再トレーニングする必要があります。

写真

LoRA における行列分解

それでは、Hugging Face の PEFT ライブラリを使用して LoRA を微調整する方法を見てみましょう。 LoRA を使用して bigscience/mt0-large を微調整するとします。まず、必要なものをインポートしていることを確認する必要があります。

 !pip install peft !pip install transformers

 from transformers import AutoModelForSeq2SeqLM from peft import get_peft_model, LoraConfig, TaskType model_name_or_path = "bigscience/mt0-large" tokenizer_name_or_path = "bigscience/mt0-large"

次の手順では、微調整中に LoRA に適用される構成を作成します。

 peft_config = LoraConfig( task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1 )

次に、Transformers ライブラリの基本モデルと LoRA 用に作成した構成オブジェクトを使用してモデルをインスタンス化します。

 model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path) model = get_peft_model(model, peft_config) model.print_trainable_parameters()

2. 知識の蒸留

これは、「小さな」モデルを生産に投入できるようにするもう 1 つのアプローチです。アイデアとしては、教師と呼ばれる大きなモデルと、生徒と呼ばれる小さなモデルを用意し、教師の知識を使用して生徒に予測を行う方法を教えるというものです。この方法により、学生だけを本番環境に配置できるようになります。

このアプローチの典型的な例は、このようにして開発されたモデル、つまり BERT の学生モデルである DistillBERT です。 DistilBERT は BERT より 40% 小さくなりますが、言語理解は 97% 維持され、推論速度は 60% 高速になります。このアプローチの欠点の 1 つは、生徒をトレーニングするために依然として大規模な教師モデルが必要であり、教師のようなモデルをトレーニングするためのリソースが不足する可能性があることです。

Python で知識蒸留を行う方法の簡単な例を見てみましょう。理解すべき重要な概念は、カルバック・ライブラー・ダイバージェンスです。これは、2 つの分布の違いを理解するために使用される数学的概念です。実際、私たちのケースでは、2 つのモデルの予測の違いを理解したいので、トレーニングの損失関数はこの数学的概念に基づきます。

 import tensorflow as tf from tensorflow.keras import layers, models from tensorflow.keras.datasets import mnist from tensorflow.keras.utils import to_categorical import numpy as np # Load the MNIST dataset (train_images, train_labels), (test_images, test_labels) = mnist.load_data() # Preprocess the data train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255 test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255 train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Define the teacher model (a larger model) teacher_model = models.Sequential([ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)), layers.MaxPooling2D((2, 2)), layers.Conv2D(64, (3, 3), activation='relu'), layers.MaxPooling2D((2, 2)), layers.Conv2D(64, (3, 3), activation='relu'), layers.Flatten(), layers.Dense(64, activation='relu'), layers.Dense(10, activation='softmax') ]) teacher_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Train the teacher model teacher_model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2) # Define the student model (a smaller model) student_model = models.Sequential([ layers.Flatten(input_shape=(28, 28, 1)), layers.Dense(64, activation='relu'), layers.Dense(10, activation='softmax') ]) student_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Knowledge distillation step: Transfer knowledge from the teacher to the student def distillation_loss(y_true, y_pred): alpha = 0.1 # Temperature parameter (adjust as needed) return tf.keras.losses.KLDivergence()(tf.nn.softmax(y_true / alpha, axis=1), tf.nn.softmax(y_pred / alpha, axis=1)) # Train the student model using knowledge distillation student_model.fit(train_images, train_labels, epochs=10, batch_size=64, validation_split=0.2, loss=distillation_loss) # Evaluate the student model test_loss, test_acc = student_model.evaluate(test_images, test_labels) print(f'Test accuracy: {test_acc * 100:.2f}%')

3. 剪定

プルーニングは私が大学院の論文で研究したモデル圧縮手法であり、実際、私は以前、Julia でプルーニングを実装する方法について「Julia での人工ニューラルネットワークの反復プルーニング手法」という記事を公開しました。

剪定は、決定木の過剰適合問題を解決するために生まれました。実際、剪定は木の枝を切り落とすことで木の深さを減らします。この概念は後にニューラルネットワークで使用され、ネットワーク内のエッジやノードが削除されます (非構造化プルーニングまたは構造化プルーニングのどちらを実行するかによって異なります)。

ネットワークからノード全体を削除すると、レイヤーを表すマトリックスが小さくなるため、モデルも小さくなり、速度も速くなります。代わりに、単一のエッジを削除すると、行列のサイズは同じままですが、削除されたエッジの位置にゼロが配置されるため、非常にスパースな行列が得られます。したがって、非構造化プルーニングの利点は速度の向上ではなくメモリにあります。これは、スパースマトリックスをメモリに格納すると、密なマトリックスを格納する場合よりもスペースが少なくなるためです。

しかし、どのノードまたはエッジを削除すればよいのでしょうか?これらは通常、最も不要なノードまたはエッジです。次の 2 つの論文を研究することをお勧めします: 「Optimal Brain Damage」および「Optimal Brain Surgeon and general network pruning」。

単純な MNIST モデルでプルーニングを実装する方法を示す Python スクリプトを見てみましょう。

 import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.datasets import mnist from tensorflow.keras.utils import to_categorical from tensorflow_model_optimization.sparsity import keras as sparsity import numpy as np # Load the MNIST dataset (train_images, train_labels), (test_images, test_labels) = mnist.load_data() # Preprocess the data train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255 test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255 train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Create a simple neural network model def create_model(): model = Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28, 1)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax') ]) return model # Create and compile the original model model = create_model() model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Train the original model model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2) # Prune the model # Specify the pruning parameters pruning_params = { 'pruning_schedule': sparsity.PolynomialDecay(initial_sparsity=0.50, final_sparsity=0.90, begin_step=0, end_step=2000, frequency=100) } # Create a pruned model pruned_model = sparsity.prune_low_magnitude(create_model(), **pruning_params) # Compile the pruned model pruned_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Train the pruned model (fine-tuning) pruned_model.fit(train_images, train_labels, epochs=2, batch_size=64, validation_split=0.2) # Strip pruning wrappers to create a smaller and faster model final_model = sparsity.strip_pruning(pruned_model) # Evaluate the final pruned model test_loss, test_acc = final_model.evaluate(test_images, test_labels) print(f'Test accuracy after pruning: {test_acc * 100:.2f}%')

定量化

量子化はおそらく現在最も広く使用されている圧縮技術であると言っても過言ではないと思います。繰り返しますが、基本的な考え方は単純です。通常、ニューラルネットワークのパラメータを表すには 32 ビットの浮動小数点数を使用します。しかし、より低い精度の数値を使用した場合はどうなるでしょうか? 16 ビット、8 ビット、4 ビット、または 1 ビットを使用してバイナリネットワークを構築できます。

それはどういう意味ですか？精度の低い数値を使用すると、モデルは軽量かつ小型になりますが、精度も低下し、元のモデルよりも近似した結果が得られます。これは、ネットワークのサイズを大幅に縮小できるため、特にスマートフォンなどの特殊なハードウェア上のエッジデバイスに展開する必要がある場合によく使用される手法です。 TensorFlow Lite、PyTorch、TensorRT など、多くのフレームワークでは量子化を簡単に適用できます。

量子化はトレーニング前に適用できるため、パラメータが特定の範囲内の値しか取れないネットワークを直接切り捨てることも、トレーニング後に適用できるため、パラメータの値を丸めることになります。ここでは、Python で量子化を適用する方法についてもう一度簡単に説明します。

 import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Flatten, Dropout from tensorflow.keras.datasets import mnist from tensorflow.keras.utils import to_categorical import numpy as np # Load the MNIST dataset (train_images, train_labels), (test_images, test_labels) = mnist.load_data() # Preprocess the data train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255 test_images = test_images.reshape((10000, 28, 28, 1)).astype('float32') / 255 train_labels = to_categorical(train_labels) test_labels = to_categorical(test_labels) # Create a simple neural network model def create_model(): model = Sequential([ Flatten(input_shape=(28, 28, 1)), Dense(128, activation='relu'), Dropout(0.2), Dense(64, activation='relu'), Dropout(0.2), Dense(10, activation='softmax') ]) return model # Create and compile the original model model = create_model() model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Train the original model model.fit(train_images, train_labels, epochs=5, batch_size=64, validation_split=0.2) # Quantize the model to 8-bit integers converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] quantized_model = converter.convert() # Save the quantized model to a file with open('quantized_model.tflite', 'wb') as f: f.write(quantized_model) # Load the quantized model for inference interpreter = tf.lite.Interpreter(model_path='quantized_model.tflite') interpreter.allocate_tensors() # Evaluate the quantized model test_loss, test_acc = 0.0, 0.0 for i in range(len(test_images)): input_data = np.array([test_images[i]], dtype=np.float32) interpreter.set_tensor(interpreter.get_input_details()[0]['index'], input_data) interpreter.invoke() output_data = interpreter.get_tensor(interpreter.get_output_details()[0]['index']) test_loss += tf.keras.losses.categorical_crossentropy(test_labels[i], output_data).numpy() test_acc += np.argmax(test_labels[i]) == np.argmax(output_data) test_loss /= len(test_images) test_acc /= len(test_images) print(f'Test accuracy after quantization: {test_acc * 100:.2f}%')

結論は

この記事では、実稼働中のモデルにとって重要な要件となる可能性があるモデル推論フェーズを高速化するためのいくつかのモデル圧縮方法について説明しました。特に、低ランク分解、知識蒸留、プルーニング、量子化などの手法に焦点を当て、基本的な考え方を説明し、Python での簡単な実装を示します。モデル圧縮は、スマートフォンなど、リソース (RAM、GPU など) が限られた特定のハードウェアにモデルを展開する場合にも非常に役立ちます。

<<: ついに誰かがユーザー分析の方法論を徹底的に説明しました

>>: 業界関係者が語るウルトラマン解雇：業界にとっては大激震だが、AI開発の全体的な動向には影響しない