nb.oの日記: TF2.0のKerasでPost-training quantization

修正

（2019.11.25）@PINTO03091 さんから指摘いただいたFull Integer quantのrepresentative_data_genのコード、説明の誤りを修正（ありがとうございます）。

目的

TensorFlow2.0がリリースされたので、

Keras modelから Post-training quantizationでTF-Lite modelへの変換を試してみる。
いろいろなTF-Lite quantization modelへの変換を試してみる。
それぞれのTF-Lite quantization modelの特性を確認してみる。

動機

以前、TF-2.0rc1でtf.kerasのMobileNet v2をfine-tuinginし、Post-training quantizationするノートブックを作った。
TF2.0がリリースされたので、このノートブックをもとにモデルを変換して、いろいろなTF-Lite model を比較してみようと思った。

TF2.0rc1でtf.kerasのMobileNet v2をfine-tuning、Post-training quantizationするnotebookを作ってみたので公開。
Google colabで実行可。
・Weight quantization
・Float16 quantization
・Integer quantization
・Full integer quantization -> Edge TPU Modelhttps://t.co/18htw5SgFs
— nb.o (@Nextremer_nb_o) September 22, 2019

参考資料

TensorFlow Liteのドキュメント

TF2.0のバイナリ（@PINTO03091さんのTensorFlowLite-binリポジトリ）

PINTO0309/TensorflowLite-bin（GitHub）

学習・推論用のコード

keras-post-training-quantization.ipynb

バージョン情報

TensorFlow: 2.0.0
Edge TPU Compiler version 2.0.267685300
Edge TPU runtime and Python API library: 2.12.1 (September 2019)
Raspbian: 10.1
Jetson Nano: JetPack 4.2.2

環境

モデルの生成、学習はGoogle Colab上で行う。
推論の実行は、Google Colab（CPU, GPU）、Raspberry Pi 3 B+、Jetson Nano。
Raspberry Pi 3 B+、Jetson Nanoにインストールする TF Lite 2.0 は @PINTO03091さんのTensorflowLite-binを利用。
なお、Jetson NanoのPython3のバージョン3.6であるため、pyenvをつかって3.7を用意する。

Keras Modelの作成・学習

ソースについては、keras-post-training-quantization.ipynbをベースとする。
学習データは、tensorflow_datasets（tfsd）のtf_flower データセットを使用する。

試したKeras モデル

今回、試したモデルはImage Classificationの以下の4つ。チョイスについてなにか深い意味はない。

独自のCNN
MobileNet v2 1.0
Inception v3
DenseNet121

（ResNetが無いのは、Fine-tuningしてもval_accuracyがほぼランダムになってしまったため諦めた... なんでだろう？データが少ない？）

独自のCNNモデル

Sequentialで単純にモデルを構築（Dropoutを入れてもFull Integer quant modelや Edge TPU modelを生成できる）。IMG_SIZEは112。
（サイズを224にすると、Edge TPU Modelに変換できなくなった... Edge TPU Compilerがモデルのサイズを意識しているのか？）

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), padding='same',
                           input_shape=(IMG_SIZE, IMG_SIZE, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='same'),
    tf.keras.layers.Dropout(0.25),

    tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2), padding='same'),
    tf.keras.layers.Dropout(0.25),

    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(info.features['label'].num_classes, activation='softmax')
])

MobileNet v2

Pre-trained modelをFine-tuningする。IMG_SIZEは224。

base_model = MobileNetV2(include_top=False,
                         weights='imagenet',
                         input_shape=(IMG_SIZE, IMG_SIZE, 3))

model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(info.features['label'].num_classes, activation='softmax')
])

Inception v3

こちらも同様。IMG_SIZEは229。

base_model = InceptionV3(include_top=False,
                         weights='imagenet',
                         input_shape=(IMG_SIZE, IMG_SIZE, 3))

model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(info.features['label'].num_classes, activation='softmax')
])

DenseNet121

DenseNetはEdge TPUで動かしたことがなかったのでチョイス。IMG_SIZEは224。

base_model = DenseNet121(include_top=False,
                         weights='imagenet',
                         input_shape=(IMG_SIZE, IMG_SIZE, 3))

model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(info.features['label'].num_classes, activation='softmax')
])

モデルの保存

学習したモデルはh5形式で保存する。

# Save keras model
model.save(os.path.join(models_dir, 'xxx.h5'))

Keras ModelからTF-Lite Modelへの変換

TF-Lite model には、

量子化されていないTF-Lite model
量子化された Quantization model

の2種類が存在する。

量子化することでモデルのサイズを TF-Lite model よりさらに削減することができる。

Quantization modelはPost-training quantizationもしくは、Quantization-aware trainingで作成することができる。TF2.0では、Post-training quantizationのみサポートされているため、今回は、Post-training quantizationを扱う。

さらに Post-training quantizationで量子化できるモデルは、以下の4種類が存在する。

Weight quantization model
Float16 quantization model
Integer quantization model
Full integer quantization model

また、Full integer quant model（Integer quant model) から Edge TPUで動作可能なEdge TPU modelを作成することができる。

Python API を使った TF-Lite modelへの変換

TensorFlow Lite converter Python APIを使って保存した Keras modelをTF-Lite model に変換する。

TF2.0 の場合、
from_keras_modelを使ってKeras model（h5形式）からtf.lite.TFLiteConverterを取得して、変換する。
TF1.x の場合、
from_keras_model_fileを使ってファイルからkeras modelをロードして、tf.lite.TFLiteConverterを取得して、変換する。

TF-Lite Model

保存したKeras modelをload_modelでロード、from_keras_modelでconverterを取得、convert で変換する。
特に変換する際のパラメータは必要ない。

loaded_model = tf.keras.models.load_model(os.path.join(models_dir, 'mobilenet_v2.h5'))
converter = tf.lite.TFLiteConverter.from_keras_model(loaded_model)

tflite_model = converter.convert()

tflite_file = models_dir/'mobilenet_v2.tflite'
tflite_file.write_bytes(tflite_model)

Weight quantization Model

"hybrid" quantizationとも呼ばれる方法。

重みのみを量子化して、推論時は浮動小数点演算で行われる。

量子化することで、モデルサイズが1/4程度になる。
推論の精度の低下は少なく、元のモデルとほぼ同等。
推論の処理時間も早くなる。
モバイル（Android、iOS）やサーバー（クラウド）での実行を想定？

モデルの変換時、optimizationsフラグにOPTIMIZE_FOR_SIZEを指定する。

loaded_model = tf.keras.models.load_model(os.path.join(models_dir, 'mobilenet_v2.h5'))
converter = tf.lite.TFLiteConverter.from_keras_model(loaded_model)

converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_weight_quant_model = converter.convert()

tflite_weight_model_quant_file = models_dir/'mobilenet_v2_weight_quant.tflite'
tflite_weight_model_quant_file.write_bytes(tflite_weight_quant_model)

Float16 quantization Model

重みをFloat16に量子化する方法。

モデルサイズは1/2程度になる。
GPU delegateが可能。
Andorid、iOSなどGPUで推論の処理時間が早くなる。
CPUでも実行可能。
モバイルでの実行を想定。

モデルの変換時、supported_typesフラグにtf.float16を指定する。

※TF1.xでは、supported_typesにtf.lite.constants.FLOAT16を指定する。tf.lite.constants.FLOAT16はTF2.0では削除されたので、tf.float16を指定すればよいはず？

loaded_model = tf.keras.models.load_model(os.path.join(models_dir, 'mobilenet_v2.h5'))
converter = tf.lite.TFLiteConverter.from_keras_model(loaded_model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

tflite_fp16_quant_model = converter.convert()

tflite_fp16_model_quant_file = models_dir/'mobilenet_v2_fp16_quant.tflite'
tflite_fp16_model_quant_file.write_bytes(tflite_fp16_quant_model)

Integer quantization Model

重みとアクティベーションの完全な整数量子化。

モデルサイズが1/4程度になる。
メモリ使用量の削減、推論時間の高速化。
入力と出力は Float となる。
（TF-Lite、Weight quant、Float16 quant modelと同じインターフェースで実行できる）
モバイルやエッジ（ARM CPU）での実行を想定。

モデルの変換時、optimizationsフラグにDEFAULT 指定する。

入力のキャリブレーション（入力データがどの範囲を取りうるのかを調整）が必要。RepresentativeDataset に入力データを返すジェネレーターを指定する。

以下では、representative_data_genが入力データを返すジェネレーター。
ここでは、~~学習時に1/255としているため、元の画素である0－255に戻していることに注意。~~学習時のデータの範囲（0.0-1.0）を入力する。UINT8（0-255）に戻したときに取りうる範囲を決める。

def representative_data_gen():
  for batch in test.take(255):
    yield [batch[0]]

loaded_model = tf.keras.models.load_model(os.path.join(models_dir, 'mobilenet_v2.h5'))
converter = tf.lite.TFLiteConverter.from_keras_model(loaded_model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen

tflite_full_integer_quant_model = converter.convert()

tflite_full_integer_model_quant_file = models_dir/'mobilenet_v2_integer_quant.tflite'
tflite_full_integer_model_quant_file.write_bytes(tflite_full_integer_quant_model)

Full integer quantization Model

Integer quant modelと同様だが、完全な整数量子化を行う。
モデルの入力・出力とも整数となる。このため、推論時は他のモデルと異なるインターフェースとなる。

モデルの変換時、基本的にはInteger quant modelと同様で、それ以外にinference_input_type、inference_output_typeにtf.uint8を指定することで、Full integer quant modelとなる。

なお、inference_input_type、inference_output_typeは、TF2.0では削除されているため、TF2.0のインターフェースでは、Full integer quant modelとならない（Integer quantization modelになってしまう。）このため、tf.compat.v1.lite.TFLiteConverter.from_keras_model_fileをつかってモデルをロードして変換する。

最初、インターフェースが削除されていたことに気がつかず、この件がわからず悩んでいた（バグなんて思っていた）。

def representative_data_gen():
  for batch in test.take(255):
    yield [batch[0]]

loaded_model = tf.keras.models.load_model(os.path.join(models_dir, 'mobilenet_v2.h5'))
converter = tf.lite.TFLiteConverter.from_keras_model(loaded_model)

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
converter.representative_dataset = representative_data_gen

tflite_full_integer_quant_model = converter.convert()

tflite_full_integer_model_quant_file = models_dir/'mobilenet_v2_full_integer_quant.tflite'
tflite_full_integer_model_quant_file.write_bytes(tflite_full_integer_quant_model)

Edge TPU モデル

Full Integer quant modelからEdge TPU modelに変換が可能。

September 2019 Updatesからは、Integer quant modelからの変換も可能になった。

Keras modelのpost-training quant modelからの変換を強化したとあるので、これもその一つなのだろうと推測。

"We've released a minor update to the Edge TPU Compiler (version 2.0.2xx) with improved support for post-training quantization—especially those built with Keras"

（Coral の September 2019 Updates から抜粋）

Edge TPU Compiler version 2.0.267685300であれば、Integer quant modelからの変換ができる。ただし、edgetpu.classification.engineのインターフェイスは利用できない。このため、今回はFull Integer quant modelから変換したモデルのみ使用している。

えー？
いまさっき見たらedgetpu_compilerのバージョンが微妙にあがっていて、TFLiteConverter.from_keras_model()からの変換できちゃったよ？Full integer quantizationじゃないのに？
しかもこのモデル、CPUにオフロードされているよ？
（full integerの場合、MobileNet v2はCPUオフロードなし） pic.twitter.com/MatiI34Pgx
— nb.o (@Nextremer_nb_o) September 22, 2019

Integer quant modelを変換すると、2つのOpeがCPU側にオフロードされている。
これは、入力・出力のFloatをIntに変換するためのOpeと思われる。

QUANTIZE
DEQUANTIZE

edgetpu_compiler -s --out_dir /content/models /content/models/mobilenet_v2_integer_quant.tflite
Edge TPU Compiler version 2.0.267685300

Model compiled successfully in 307 ms.

Input model: /content/models/mobilenet_v2_integer_quant.tflite
Input size: 2.58MiB
Output model: /content/models/mobilenet_v2_integer_quant_edgetpu.tflite
Output size: 2.77MiB
On-chip memory available for caching model parameters: 6.91MiB
On-chip memory used for caching model parameters: 2.71MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 72
Operation log: /content/models/mobilenet_v2_integer_quant_edgetpu.log

Model successfully compiled but not all operations are supported by the Edge TPU. A percentage of the model will instead run on the CPU, which is slower. If possible, consider updating your model to use only operations supported by the Edge TPU. For details, visit g.co/coral/model-reqs.
Number of operations that will run on Edge TPU: 70
Number of operations that will run on CPU: 2

Operator                       Count      Status

ADD                            10         Mapped to Edge TPU
QUANTIZE                       1          Operation is otherwise supported, but not mapped due to some unspecified limitation
PAD                            5          Mapped to Edge TPU
CONV_2D                        35         Mapped to Edge TPU
DEPTHWISE_CONV_2D              17         Mapped to Edge TPU
DEQUANTIZE                     1          Operation is working on an unsupported data type
MEAN                           1          Mapped to Edge TPU
FULLY_CONNECTED                1          Mapped to Edge TPU
SOFTMAX                        1          Mapped to Edge TPU

なお、Full Integer quant modelを変換すると、CPU側にオフロードされるOpeはない。

edgetpu_compiler -s --out_dir /content/models /content/models/mobilenet_v2_full_integer_quant1.tflite
Edge TPU Compiler version 2.0.267685300

Model compiled successfully in 300 ms.

Input model: /content/models/mobilenet_v2_full_integer_quant1.tflite
Input size: 2.58MiB
Output model: /content/models/mobilenet_v2_full_integer_quant1_edgetpu.tflite
Output size: 2.77MiB
On-chip memory available for caching model parameters: 6.91MiB
On-chip memory used for caching model parameters: 2.71MiB
Off-chip memory used for streaming uncached model parameters: 0.00B
Number of Edge TPU subgraphs: 1
Total number of operations: 72
Operation log: /content/models/mobilenet_v2_full_integer_quant1_edgetpu.log

Operator                       Count      Status

MEAN                           1          Mapped to Edge TPU
SOFTMAX                        1          Mapped to Edge TPU
FULLY_CONNECTED                1          Mapped to Edge TPU
ADD                            10         Mapped to Edge TPU
QUANTIZE                       2          Mapped to Edge TPU
PAD                            5          Mapped to Edge TPU
CONV_2D                        35         Mapped to Edge TPU
DEPTHWISE_CONV_2D              17         Mapped to Edge TPU

推論

Pythonでの推論コードは以下。

# モデルをロードしてinterpreterを取得
interpreter = tf.lite.Interpreter(model_path=path_to_mode_file)
interpreter.allocate_tensors()

# set_tensorで入力画像を設定、invokeで推論を実行する
interpreter.set_tensor(interpreter.get_input_details()[0]["index"], image)
interpreter.invoke()

# 推論結果の取り出し。predictions に推論結果（各ラベルの予測値）となる
predictions = interpreter.get_tensor(interpreter.get_output_details()[0]["index"])

TF-Lite、Weight、Float16、Integer quant modelは、入力は学習時と同じサイズで正規化して入力する。
MobileNet v2の場合、入力は[1, 224, 224, 3]で0.0 - 1.0で正規化したFloat32で入力する。

im = Image.open(path_to_image_file)
im = im.resize((224, 224))
im = np.array(im, np.float32)
im = im / 255.0
im = image[np.newaxis, :, :] # 1, 224, 224, 3にするため

Full Integer quant modelは、正規化せずUINT8（0 - 255）で入力する。

im = Image.open(path_to_image_file)
im = im.resize((224, 224))
im = np.array(im, np.uint8)
im = image[np.newaxis, :, :] # 1, 224, 224, 3にするため

それぞれのモデルの比較

作成したそれぞれのモデルについて、

ファイルサイズ
精度
推論時間（Google Colab、Raspberry Pi3 B+、Jetson Nano、Edge TPU）

を比較してみる。

なお、DenseNetのEdge TPU modelは作成できなかった。

※September 2019 Updates によって、コンパイルできなくなった模様。古いコンパイラ使えとある...

モデルのファイルサイズを比較

ファイルサイズを比較すると、
TF-Lite model > Float16 quant model > Weight、Integer、Full integer quant model

TF-Lite modelから比較すると、

Weight、Integer、Full Integer quantization modelは、1/2〜1/4程度。
Float16 Integer quantization modelは、同じもしくは、1/2程度。
モデルによって小さくなるサイズが異なる（が傾向は同じ）。

Model	File size [MB]
	Keras model	TF-Lite model	Post quantization model				Edge TPU model
	Keras model	TF-Lite model	Weight quant	Float16 quant	Integer quant	Full integer quant	Edge TPU model
Original CNN	197	50	25	50	25	25	25
MobileNet v2 1.0	16	9	2	4	3	3	3
Inception v3	127	84	21	42	22	22	23
DenseNet121	36	27	7	14	7	7

各モデルのサイズを比較

精度の比較

各モデルの精度を比較する。
実行は Google Colab のCPU上での実行。

データセットを分割し、学習、バリデーションに使ってない残りのテストデータで評価した。
ここでは、精度の良さは関係なく、Keras model に対してどの程度、低下があるかを確認する。Edge TPU model は Integer、Full Integer quant modelと同等であるので省略。

TF-Lite、Float16 quant model は Keras modelとほぼ同等で精度の低下なし
Float16 quant model はGPU delegateの場合については未確認。
Weight quant model は本来精度の低下は少ないはずだが、MobileNet v2,、Inception v3で精度が大幅に低下した。
独自のCNNや DenseNetは精度の低下がないことから、モデルの構造が影響していると推測。
Integer quant、Full Integer quant model は、若干の精度低下が確認できる。
ただし、（これは対象にもよるだろうが）低下は0.01 〜 0.02程度。
実際にモデルを導入する際は、必ずテストデータで精度を確認する必要がある。

Model	Top-1 Accuracy
	Keras model	TF-Lite Model model	Post quantization model
	Keras model	TF-Lite Model model	Weight quant	Float16 quant	Integer quant	Full integer quant
Original CNN	0.5639	0.5639	0.5694	0.5611	0.5556	0.5556
MobileNet v2 1.0	0.7583	0.7583	0.4583	0.7556	0.7389	0.7417
Inception v3	0.8500	0.8500	0.8278	0.8500	0.8417	0.8444
DenseNet121	0.8528	0.8528	0.8556	0.8500	0.8778	0.8778

各モデルの精度を比較

処理時間の比較

各モデルの推論の処理時間を比較する。

Keras modelの場合は、predictの前後で計測。
各TF-Lite modelの場合は、set_tensor、invoke、get_tensorのまでの区間で計測。
Raspberry Pi、Jetson Nanoでは set_num_threadによるマルチスレッドの効果を確認する。

Google Colab 上での実行

Keras modelは Tesla K80、各TF-Lite modelはCPU（Intel(R) Xeon(R) CPU @ 2.20GHz）での実行となる。

TF-Lite modelとFloat16 quant modelは同じ処理時間となる。
Weight quant modelはTF-Lite modelより処理時間がかかる。
Integer、Full Integer quant modelはかなり処理時間がかかる。
これは、x86_64に最適化されていないためである。
サーバー（クラウド）での実行は、TF-Liteもしくは Weight quant model が適している。

Model	Time per inference, in milliseconds (iterations = 20)
	Keras model	TF-Lite model	Post quantization model
	Keras model	TF-Lite model	Weight quant	Float16 quant model	Integer quant model	Full integer quant model
Original CNN	54	52	83	86	1201	1199
MobileNet v2 1.0	78	40	86	42	1228	1229
Inception v3	127	379	1228	376	22313	11450
DenseNet121	131	244	677	244	11503	11293

推論の処理時間（Google Corab CPU, GPU）

Raspberry Pi 3 B+ での実行

Integer、Full Integer quant modelは、整数量子化による効果が確認できる。
Weight quant modelは、モデルによってTF-Lite modelとの差がある。
独自のCNNモデルのみ処理時間が早くなっているが、それ以外のモデルでは遅くなっている。おそらく、モデルの構造に影響すると推測。
また、Weight quant model はマルチスレッドの効果がほとんど無いことも確認できる。
これは、マルチスレッドをサポートしていないためである。
TF-Lite、Float16 quant、Integer、Full integer quant model でマルチスレッドの効果が確認できる。ただし、Jetson Nanoもだが、3スレッド以上は効果が小さくなる。このため、モバイルなどのメニーコアのCPUなどでむやみにスレッド数を指定しても効果はない可能性がある。

Model	set_num_threads	Time per inference, in milliseconds (iterations = 20)
		TF-Lite model	Post quantization model
		TF-Lite model	Weight quant	Float16 quant	Integer quant	Full integer quant
Original CNN	1	216	94	214	58	60
	2	181	83	182	36	36
	3	172	81	169	31	31
	4	169	80	169	29	28
MobileNet v2 1.0	1	366	452	361	272	274
	2	239	418	240	154	153
	3	251	422	240	118	116
	4	218	415	217	99	98
Inception v3	1	2129	2462	2169	841	845
	2	1175	2458	1168	456	458
	3	824	2457	824	327	329
	4	704	2452	702	263	266
DenseNet121	1	2584	2670	2534	1165	1187
	2	1481	2653	1493	715	719
	3	1164	2653	1176	571	568
	4	1130	2653	1156	509	498

推論の処理時間（Raspberry Pi 3 B+ Original CNN）

推論の処理時間（Raspberry Pi 3 B+ MobileNet v2 1.0）

推論の処理時間（Raspberry Pi 3 B+ Inception v3）

推論の処理時間（Raspberry Pi 3 B+ DenseNet121）

Jetson Nanoでの実行

こちらも、Raspberry Pi と同じ特性が見られる。
Raspberry Pi 3 B+と比較すると、おおよそ1/2〜1/3の処理時間である。
これは、CPU、メモリ性能とOS（32 or 64 bit）が影響している。
また、最適化も影響している可能性もある（TF2.0以降にARM 32bitの最適化がかなり入っていそうで、差が縮む可能性あり）
エッジ（ARM CPU）では、Integer quant、Full integer quant modelが適していそう（精度とのトレードオフ）。
精度を重視するのであれば、TF-Lite modelが適している。

Model	set_num_threads	Time per inference, in milliseconds (iterations = 20)
		TF-Lite model	Post quantization
		TF-Lite model	Weight quant	Float16 quant	Integer quant	Full integer quant
Original CNN	1	62	50	78	30	30
	2	45	43	60	16	16
	3	41	42	55	12	12
	4	39	41	54	10	10
MobileNet v2 1.0	1	152	233	151	98	98
	2	89	214	89	60	60
	3	78	213	78	47	47
	4	83	212	82	41	41
Inception v3	1	4643	4390	4615	1906	1887
	2	2678	4381	2711	1034	1031
	3	2005	4364	2030	754	760
	4	1820	4360	1836	608	683
DenseNet121	1	1286	1521	1301	566	574
	2	730	1525	729	360	362
	3	564	1520	570	295	293
	4	520	1523	517	262	261

推論の処理時間（Jetson Nano Original CNN）

推論の処理時間（Jetson Nano MobileNet v2 1.0）

推論の処理時間（Jetson Nano Inception v3）

推論の処理時間（Jetson Nano DenseNet121）

Edge TPU（USB Accelerator）での実行

Raspberry Pi、Jetson Nanoでの実行を比較

USB2.0、3.0の効果が確認できる。

Time per inference, in milliseconds (iterations = 20)

Model	Device	Edge TPU model
Original CNN	Raspberry Pi 3 B+	706
	Jetson Nano	63
MobileNet v2 1.0	Raspberry Pi 3 B+	13
	Jetson Nano	3
Inception v3	Raspberry Pi 3 B+	488
	Jetson Nano	46

その他

Float16 quant modelはAndroidの PU delegateで確認したい。
（持っているスマホでできるかな？）
DenseNetのEdge TPU modelの実行を確認したい。
（次のアップデートで対応されるかな？）
Weight quant modelの精度を落ちる理由は？
（Issueで報告してみる？）
Raspberry Pi 3 B+の64bitってどうなの？
Object detection modelを試してみたい。
tflite_convertはどうなった？

最後に

Post training quantizationについて、学習から推論まで一通りをまとめてみた。

精度と処理時間（とファイルサイズ）はトレードオフの関係にあるため、H/Wや目的にあわせてチョイスする必要がある。

TensorFlow Lite 2019 Roadmapにもあるとおり、まだ大幅な機能追加があるため、将来この内容は役に立たない可能性があることに注意（ただし、大きな流れは変更ないはず？）。

nb.oの日記

2019年10月16日水曜日

TF2.0のKerasでPost-training quantization

修正

目的

動機