:::: MENU ::::

InvalidArgumentError: Assign requires shapes of both tensors to match. や Incompatible shapes でTensorflowの学習が止まる

Tensorflow Object Detection API を使って遊んでいたりするのだが、よく同じようなエラーに悩まされたのでメモを残しておく。

models/running_pets.md at master · tensorflow/models
やっていたことは tutorial として用意されているペットの種類識別のデータセットを変えて、自分が分類したい画像にするというだけ。

現象

TFRecord の作成は問題なかったが、学習中に以下のようなエラーが出て止まる。
環境は GCP の ML Engine。

The replica master 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
[...]
groundtruth_classes_with_background_list))
File "/root/.local/lib/python2.7/site-packages/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1421, in _loss_box_classifier
batch_reg_targets, weights=batch_reg_weights) / normalizer
File "/root/.local/lib/python2.7/site-packages/object_detection/core/losses.py", line 71, in __call__
return self._compute_loss(prediction_tensor, target_tensor, ##params)
File "/root/.local/lib/python2.7/site-packages/object_detection/core/losses.py", line 157, in _compute_loss
diff = prediction_tensor - target_tensor
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 794, in binary_op_wrapper
return func(x, y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2775, in _sub
result = _op_def_lib.apply_op("Sub", x=x, y=y, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Incompatible shapes: [1,63,4] vs. [1,64,4]
[[Node: Loss/BoxClassifierLoss/Loss/sub = Sub[T=DT_FLOAT, _device="/job:master/replica:0/task:0/gpu:0"](Loss/BoxClassifierLoss/Reshape_9, Loss/BoxClassifierLoss/stack_4)]]
[[Node: total_loss_1_G1426 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/cpu:0", send_device="/job:master/replica:0/task:0/gpu:0", send_device_incarnation=-5926190012419481980, tensor_name="edge_6736_total_loss_1", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 296, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 776, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 706, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 256, in prepare_session
config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 188, in _restore_checkpoint
saver.restore(sess, ckpt.model_checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1428, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [2048,3] rhs shape= [2048,2]
[[Node: save/Assign_820 = Assign[T=DT_FLOAT, _class=["loc:@SecondStageBoxPredictor/ClassPredictor/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](SecondStageBoxPredictor/ClassPredictor/weights, save/RestoreV2_820)]]
[[Node: save/restore_all/NoOp_1_S8 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/gpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-4427910840146810295, tensor_name="edge_6_save/restore_all/NoOp_1", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/gpu:0"]()]]

Caused by op u'save/Assign_820', defined at:
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 281, in train
keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1040, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1070, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 675, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 414, in _AddRestoreOps
assign_ops.append(saveable.restore(tensors, shapes))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 155, in restore
self.op.get_shape().is_fully_defined())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_state_ops.py", line 47, in assign
use_locking=use_locking, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [2048,3] rhs shape= [2048,2]
[[Node: save/Assign_820 = Assign[T=DT_FLOAT, _class=["loc:@SecondStageBoxPredictor/ClassPredictor/weights"], use_locking=true, validate_shape=true, _device="/job:ps/replica:0/task:0/cpu:0"](SecondStageBoxPredictor/ClassPredictor/weights, save/RestoreV2_820)]]
[[Node: save/restore_all/NoOp_1_S8 = _Recv[client_terminated=false, recv_device="/job:master/replica:0/task:0/gpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=-4427910840146810295, tensor_name="edge_6_save/restore_all/NoOp_1", tensor_type=DT_FLOAT, _device="/job:master/replica:0/task:0/gpu:0"]()]]

原因

学習用として与えた画像の大きさが不適切だったようだ。

学習用の conf には最短と最長の長さを定義しておいた。

model {
faster_rcnn {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 240
max_dimension: 1280
}
}

これにより小さかったり大きかったりする画像はこのサイズまでリサイズされると思ったんだが、
その影響で指定していた矩形の座標が狂って画像の範囲外を指定してしまっている?
what the field "keep_aspect_ratio_resizer" means in the .config file? · Issue #1794 · tensorflow/models

と、思ったら違った。どんな画像でもこのサイズに変更してから CNN に入れ込むらしい。

論文紹介: Fast R-CNN&Faster R-CNN

でもその手法は R-CNN で Faster R-CNN だと RoI pooling layer とかいうので処理しているらしい。
わけがわからなくなってきた。あとでちゃんと調べよう。

回避策

事前に与える画像のサイズを決めておいて、conf にもそのサイズを記載しておく。

model {
faster_rcnn {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 480
max_dimension: 640
}
}

TFRecord を作るときに矩形の座標を割合で入れて入れば他には特に変更すべきところはないと思う。

とりあえず指定した最短に足りなかったり、最長を超えていたりすることがなければエラーにならないようだ。原因が理解できてないから気持ち悪いな。