
1 前置知识点 基本概念https://www.yuque.com/docs/share/04b60c4c-90ec-49c7-8a47-0dae7d3c78c7?#(部分符合的定义在这里) 要理解PPO,就必须先理解Actor-Critic.Actor负责输出policy,也就是在某个状态下执行各种action的概率分布Critic负责输出Vaue of state。Actor和Critic的默契:Actor相信Critic给的状态的value就是真的; Critic也相信Actor选送过来的(s,a)中的a就是最优的action。通过不断的迭代,强化这个信任关系的正确性。(这体现了我们的价值观 [因为信任,所以简单],哈哈哈~) 所以这样就不难理解Critic的Loss是怎么来的了,Critic的输出就是state的Value,那就让Critic模型的输出使得以下公式成立:$$V_s=r_{s,a}+\gamma V_{s'}$$ 其中,$r_{s,a}, s,a,s'$是训练Critic需要的数据,$s'$是在状态$s$下执行动作$a$得到新状态, $r_{s,a}$是reward, $\gamma$ 是discount factor。跟基础概念的区别是,这里的系统假定是执行动作$a$只能到$s'$, 没有体现执行$a$可以得到不同的状态; (但是其实这种概率可以体现在训练数据中,因为$(s,a,r_{s,a})$$和$$s'$ 不一定是一一对应,其概率可以通过sampling得到的数据分布体现)所以Critic的Loss就是$|r_{s,a}+\gamma V_{s'}-Vs|$,也就是所谓的TD(Time Difference)-Error的L1,或者L2也可以. 那么Actor的Loss怎么计算呢?这里就先来明白Advantage的概念,其实也就是TD-Error $$Adv=r_{s,a}+\gamma V_{s'}-Vs$$ 之所以称之为Advantage,是因为假如Advantage>0, 就是说实际执行$a$之后,发现当前的状态Value实际上比当前Critic估计出来的要大,所以这是个好的Action,它能够让$V_s$ 变大,Actor应该增大这个action的概率;反之,Advantage<0,这个action就是不好的,应该减小执行它概率。所以Actor的Loss就是$$-log(\pi(a|s))*Adv$$, 因为要最小化Loss,所以前面加个负号;Adv的符号代表了应该增大这个action的输出概率,还是减小这个action的输出概率;Adv的大小代表了增加或减小的幅度应该有多大。 2 Proximal Policy Optimization(PPO) 2.1 PPO主要是来解决什么问题? 它是为了让我们在训练Agent的时候能够take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse。就是更新policy的时候尽可能步子迈大点,但也要防止扯着蛋,即模型参数崩了。 2.2 PPO怎么解决这个问题的? 简单来说,相同的网络结构,维护两组模型参数Old和New,在每次迭代的时候去更新New的参数,但是要控制New的模型输出Policy和Old的Policy不要差距太大,本轮迭代结束后,把New的参数覆盖掉Old的参数。 怎么去控制差距不要太大呢?作者给了两种方式: PPO-Penalty, PPO-Clip 2.2.1 PPO-Clip 先说PPO-Clip, 它通过下面的公式来更新策略:$$\theta_{k+1}=arg max_{\theta}E_{s,a \sim \pi_{\theta_k}}[L(s,a,\theta_k,\theta)]$$ 就是最大化$L(s,a,\theta_k,\theta)$,$$L(s,a,\theta_k,\theta)=min \left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a) , clip \left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, 1-\epsilon, 1+\epsilon \right)A^{\pi_{\theta_k}}(s,a) \right)$$这个形式主要是为了让我们理解为啥叫PPO-Clip(我感觉直接用后面那个Clip项其实就够了,这个表达有点冗余),$\theta_k$ 就是当前Old的参数,$\theta$ 是New的参数。$\pi_{\theta}(a|s)$ 是New Actor输出的Policy上状态$s$时执行$a$的概率,$\pi_{\theta_k}(a|s)$ 表示的Old Actor输出的Policy上状态$s$时执行$a$的概率。$A^{\pi_{\theta_k}}(s,a)$是基于Old Critic得到的Advantage.对这个公式进行改写,更容易看出它的真实目的,$$L(s,a,\theta_k,\theta)=min \left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a) , g \left( \epsilon, A^{\pi_{\theta_k}}(s,a) \ \right) \right)$$其中, $$g \left( \epsilon, A \right)=\left\{ \begin{aligned} &(1+\epsilon)A & A\ge 0 \\ &(1-\epsilon)A & A< 0 \end{aligned} \right.$$ 当Advantage>=0的时候, $$L(s,a,\theta_k,\theta)=min \left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1+\epsilon) \right)A^{\pi_{\theta_k}}(s,a) $$ 这就清楚的说明,这时候应该增大$\pi_{\theta}(a|s)$,也就是认为这个action是好的,增加选择$a$的概率。但是通过$1+\epsilon$ 限制增大的幅度。 同理,当Advantage<0的时候 $$L(s,a,\theta_k,\theta)=min \left( \frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}, (1-\epsilon) \right)A^{\pi_{\theta_k}}(s,a) $$ 缩小$\pi_{\theta}(a|s)$,但是幅度不能小于$1-\epsilon$ 另外,根据我的理解,$\pi_{\theta_k}(a|s)$应该截断梯度,也就是反向传到的时候用不着去更新Old Actor的参数。在OpenAI Spinningup的代码([https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py](https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py))确实是这样处理的,但是在Tianshou的代码里([https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/ppo.py](https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/ppo.py))没有做截断,结果也OK,想来对于$\pi_{\theta}(a|s)$来说,$\pi_{\theta_k}(a|s)$就是一个scalar factor, 这个factor是变量还是静态值,也许影响不那么大,而且本轮迭代结束后$\theta_k$也会被覆盖掉,反向传导更新了也白搭。 到这里,其实说的都是如何更新Actor。 怎么更新Critic的参数呢? $$L_c(s,a,r_{s,a},s')=|r_{s,a}+V^{\pi_{\theta_k}}_{s'}-V^{\pi_{\theta}}|$$ 唯一的不同是target value是用Old Critic计算的,这也是DRL领域的常规操作了. 小结一下,PPO-Clip就是通过Clip操作来防止步子迈太大的。作者实验证明Clip的效果比Penalty好。 ### 2.2.2 PPO-Penalty $$L^{KLPEN}(\theta)=\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a) -\beta KLD\left( \pi_{\theta}(*|s), \pi_{\theta_k}(*|s) \right)$$ 理解上上面的,这个理解起来也就容易了,就是增加一个新旧Policy差异的惩罚项,差异通过KL divergence来衡量 (PS: 如理解有误支持,欢迎批评指正~)
传送门:模型量化加速:https://www.atatech.org/articles/132554?spm=ata.13261165.0.0.453bd67dhgqUkdTF量化训练:https://www.atatech.org/articles/127543?spm=ata.13261165.0.0.453bd67dhgqUkd 量化对端上深度学习模型推理可以说是必选项了,前面的文章已经多次介绍过Quantization-aware training怎么操作,本文简单介绍下Post Training量化。 话说二者有什么区别呢,Quantization-aware training是在训练过程中进行量化,能够更好保持量化后模型的性能。Post training quantization意思是训练玩的模型直接拿来量化,通过在一组sample data推理模型,统计量化所需要的参数[min,max]。通常Post training quantization的精度损失大于Quantization-aware training,所以在以往的工作中我们主要推荐使用Quantization-aware training。 那Post training quantization是不是就毫无用处了呢?显然不是。 Post training quantization在一些相对”重“的模型上,精度损失很小 Quantization-aware training训练速度会慢一些, Post training quantization只需要模型就可以完成,比如如果模型是从其他训练框架转换而来,这时候就只能使用Post training quantization 根据官方的训练教程:https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/lite/tutorials/full_integer_post_training_quant.ipynb我改造了一个更加方便适配各种输入模型的脚本, 直接从文件系统读取Sample图像: import tensorflow.compat.v1 as tf #Eager Mode is essential! tf.enable_eager_execution() import sys import glob if sys.version_info.major >= 3: import pathlib else: import pathlib2 as pathlib import random import cv2 import numpy as np tf.logging.set_verbosity(tf.logging.DEBUG) pb_file='model.pb' input_arrays=['net_input'] output_arrays=['net_output'] input_shapes=[1, 128, 128, 3] sample_img_dir='/tmp' converter = tf.lite.TFLiteConverter.from_frozen_graph(graph_def_file=pb_file, input_arrays=input_arrays, output_arrays=output_arrays, input_shapes={input_arrays[0]:input_shapes}) converter.allow_custom_ops=True tflite_model = converter.convert() tflite_models_dir = pathlib.Path("./tmp/tflite_models/") tflite_models_dir.mkdir(exist_ok=True, parents=True) tflite_model_file = tflite_models_dir/"model.tflite" tflite_model_file.write_bytes(tflite_model) tf.logging.set_verbosity(tf.logging.DEBUG) converter.optimizations = [tf.lite.Optimize.DEFAULT] def preprocess_img(img): # preprocess img = cv2.resize(img, (128, 128)) # FLOAT32 img = img.astype(np.float32) # img=(img/128.)-1. img = img - 128. return img def create_datastream_from_imgs(img_dir): img_path_list=glob.glob(img_dir+'/*.jpg') random.shuffle(img_path_list) img_path_list=img_path_list[:200] imgs_list=[] for path in img_path_list: img=cv2.imread(path) img=preprocess_img(img) imgs_list.append(img) imgs=np.stack(imgs_list,axis=0) return tf.data.Dataset.from_tensor_slices((imgs)).batch(1) ds=create_datastream_from_imgs(sample_img_dir) def representative_data_gen(): for input_value in ds.take(20): yield [input_value] converter.representative_dataset = representative_data_gen tflite_quant_model = converter.convert() tflite_model_quant_file = tflite_models_dir/"model_quant.tflite" tflite_model_quant_file.write_bytes(tflite_quant_model) 最初我在tf1.14上测试,遇到段错误的问题,也许某些同学会遇到: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) 这时候,只需要升级到tf-nightly版本就可以解决了。 成功后得到了这样的模型:和Quantization-aware training得到的模型略有不同的是,输入输出仍然是Float的,而Quantization-aware training的模型输入输出都是int8的,所以输入Node之后和输出的Node之前,它相应添加了Quantize和DeQuantize的Node。 在一个简单的分类任务上测试了一下,精度下降1.4%,马马虎虎还可以接受
在早期版本,比如1.12上,按照官方的教程操作,基本可以稳稳编译通过;(路径改成./tensorflow/contirb/lite/) 本地编译 这已经在Raspberry Pi 3b,Raspbian GNU / Linux 9.1(stretch),gcc版本6.3.0 20170516(Raspbian 6.3.0-18 + rpi1)上进行了测试。 登录Raspberry Pi,安装工具链。 sudo apt-get install build-essential 首先,克隆TensorFlow库。在库的根目录运行: ./tensorflow/lite/tools/make/download_dependencies.sh 请注意,您只需要执行一次此操作。 然后您便能够编译: ./tensorflow/lite/tools/make/build_rpi_lib.sh 这将编译一个静态库,它位于: tensorflow/lite/tools/make/gen/lib/rpi_armv7/libtensorflow-lite.a 。 但是到了1.14版本,编译就出现各种问题了。 1) Dowload dependencies时 eigen下载失败 修改tensorflow/worksapce.bzl"http://mirror.tensorflow.org/bitbucket.org/eigen/eigen/get/a0d250e79c79.tar.gz"这个链接失效了,把它和另外一个链接调整下位置 tf_http_archive( name = "eigen_archive", build_file = clean_dep("//third_party:eigen.BUILD"), patch_file = clean_dep("//third_party/eigen3:gpu_packet_math.patch"), sha256 = "0dde8fb87f5dad2e409c9f4ea1bebc54e694cf4f3b633081b0d51a55c00f9c9f", strip_prefix = "eigen-eigen-a0d250e79c79", urls = [ "https://bitbucket.org/eigen/eigen/get/a0d250e79c79.tar.gz", "http://mirror.tensorflow.org/bitbucket.org/eigen/eigen/get/a0d250e79c79.tar.gz", ], ) 2) build_rpi_lib.sh 报错: 找不到 在./tensorflow/lite/tools/make/Makefile的include相关配置的地方加上: -I$(MAKEFILE_DIR)/downloads/googletest/googletest/include \ -I$(MAKEFILE_DIR)/downloads/googletest/googlemock/include \ 3)build_rpi_lib.sh 报错:undefined reference to `flatbuffers::ClassicLocale::instance_' 修改./tensorflow/lite/tools/make/Makefile add $(wildcard tensorflow/lite/tools/make/downloads/flatbuffers/src/util.cpp) at ther end of CORE_CC_ALL_SRC change LIB_OBJS and BENCHMARK_OBJS to $(patsubst %.cc,%.o,$(patsubst %.c,%.o,$(patsubst %.cpp,%.o,$(TF_LITE_CC_SRCS))))) and $(patsubst %.cc,%.o,$(patsubst %.c,%.o,$(patsubst %.cpp,%.o,$(BENCHMARK_SRCS))))) add $(OBJDIR)%.o: %.cpp @mkdir -p $(dir $@) $(CXX) $(CXXFLAGS) $(INCLUDES) -c $< -o $@ near other OBJDIR declarations. #LIB_OBJS := $(addprefix $(OBJDIR), \ #$(patsubst %.cc,%.o,$(patsubst %.c,%.o,$(TF_LITE_CC_SRCS)))) LIB_OBJS := $(addprefix $(OBJDIR), \ $(patsubst %.cc,%.o,$(patsubst %.c,%.o,$(patsubst %.cpp,%.o,$(TF_LITE_CC_SRCS))))) #BENCHMARK_OBJS := $(addprefix $(OBJDIR), \ #$(patsubst %.cc,%.o,$(patsubst %.c,%.o,$(BENCHMARK_SRCS)))) BENCHMARK_OBJS := $(addprefix $(OBJDIR),\ $(patsubst %.cc,%.o,$(patsubst %.c,%.o,$(patsubst %.cpp,%.o,$(BENCHMARK_SRCS))))) # For normal manually-created TensorFlow Lite CPP source files. $(OBJDIR)%.o: %.cpp @mkdir -p $(dir $@) $(CXX) $(CXXFLAGS) $(INCLUDES) -c $< -o $@ 4) build_rpi_lib.sh 报错: arm-linux-gnueabihf-g++ -O3 -DNDEBUG -fPIC --std=c++11 -march=armv7-a -mfpu=neon-vfpv4 -funsafe-math-optimizations -ftree-vectorize -fPIC -I. -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/../../../../../ -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/../../../../../../ -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/ -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/eigen -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/absl -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/gemmlowp -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/neon_2_sse -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/farmhash/src -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/flatbuffers/include -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/googletest/googletest/include -I/home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/downloads/googletest/googlemock/include -I -I/usr/local/include \ -o /home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/gen/rpi_armv7l/bin/benchmark_model \ /home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/gen/rpi_armv7l/lib/benchmark-lib.a -Wl,--no-export-dynamic -Wl,--exclude-libs,ALL -Wl,--gc-sections -Wl,--as-needed -lstdc++ -lpthread -lm -ldl /home/pi/Develop/tensorflow-r1.14/tensorflow/lite/tools/make/gen/rpi_armv7l/lib/benchmark-lib.a(test_main.o):在函数‘main’中: test_main.cc:(.text.startup+0x20):对‘testing::InitGoogleTest(int*, char**)’未定义的引用 test_main.cc:(.text.startup+0x2b4):对‘tflite::SingleOpModel::SetForceUseNnapi(bool)’未定义的引用 test_main.cc:(.text.startup+0x2c0):对‘testing::UnitTest::GetInstance()’未定义的引用 test_main.cc:(.text.startup+0x2c4):对‘testing::UnitTest::Run()’未定义的引用 去掉makefile里面对test_main.cc的编译在CORE_CC_EXCLUDE_SRCS中加入$(wildcard tensorflow/lite/kernels/test_main.cc) # Remove any duplicates. CORE_CC_ALL_SRCS := $(sort $(CORE_CC_ALL_SRCS)) CORE_CC_EXCLUDE_SRCS := \ $(wildcard tensorflow/lite/*test.cc) \ $(wildcard tensorflow/lite/*/*test.cc) \ $(wildcard tensorflow/lite/*/*/*test.cc) \ $(wildcard tensorflow/lite/*/*/*/*test.cc) \ $(wildcard tensorflow/lite/kernels/*test_util.cc) \ $(wildcard tensorflow/lite/kernels/test_main.cc) \ $(MINIMAL_SRCS)
TensorFlow 2.0是对1.x版本做了一次大的瘦身,Eager Execution默认开启,并且使用Keras作为默认高级API,这些改进大大降低的TensorFlow使用难度。 本文主要记录了一次曲折的使用Keras+TensorFlow2.0的BatchNormalization的踩坑经历,这个坑差点要把TF2.0的新特性都毁灭殆尽,如果你在学习TF2.0的官方教程,不妨一观。 问题的产生 从教程[1]https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn(讲述如何Transfer Learning)说起: IMG_SHAPE = (IMG_SIZE, IMG_SIZE, 3) # Create the base model from the pre-trained model MobileNet V2 base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False,weights='imagenet') model = tf.keras.Sequential([ base_model, tf.keras.layers.GlobalAveragePooling2D(), tf.keras.layers.Dense(NUM_CLASSES) ]) 简单的代码我们就复用了MobileNetV2的结构创建了一个分类器模型,接着我们就可以调用Keras的接口去训练模型: model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate), loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy']) model.summary() history = model.fit(train_batches.repeat(), epochs=20, steps_per_epoch = steps_per_epoch, validation_data=validation_batches.repeat(), validation_steps=validation_steps) 输出的结果看,一起都很完美: Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mobilenetv2_1.00_160 (Model) (None, 5, 5, 1280) 2257984 _________________________________________________________________ global_average_pooling2d (Gl (None, 1280) 0 _________________________________________________________________ dense (Dense) (None, 2) 1281 ================================================================= Total params: 2,259,265 Trainable params: 1,281 Non-trainable params: 2,257,984 _________________________________________________________________ Epoch 11/20 581/581 [==============================] - 134s 231ms/step - loss: 0.4208 - accuracy: 0.9484 - val_loss: 0.1907 - val_accuracy: 0.9812 Epoch 12/20 581/581 [==============================] - 114s 197ms/step - loss: 0.3359 - accuracy: 0.9570 - val_loss: 0.1835 - val_accuracy: 0.9844 Epoch 13/20 581/581 [==============================] - 116s 200ms/step - loss: 0.2930 - accuracy: 0.9650 - val_loss: 0.1505 - val_accuracy: 0.9844 Epoch 14/20 581/581 [==============================] - 114s 196ms/step - loss: 0.2561 - accuracy: 0.9701 - val_loss: 0.1575 - val_accuracy: 0.9859 Epoch 15/20 581/581 [==============================] - 119s 206ms/step - loss: 0.2302 - accuracy: 0.9715 - val_loss: 0.1600 - val_accuracy: 0.9812 Epoch 16/20 581/581 [==============================] - 115s 197ms/step - loss: 0.2134 - accuracy: 0.9747 - val_loss: 0.1407 - val_accuracy: 0.9828 Epoch 17/20 581/581 [==============================] - 115s 197ms/step - loss: 0.1546 - accuracy: 0.9813 - val_loss: 0.0944 - val_accuracy: 0.9828 Epoch 18/20 581/581 [==============================] - 116s 200ms/step - loss: 0.1636 - accuracy: 0.9794 - val_loss: 0.0947 - val_accuracy: 0.9844 Epoch 19/20 581/581 [==============================] - 115s 198ms/step - loss: 0.1356 - accuracy: 0.9823 - val_loss: 0.1169 - val_accuracy: 0.9828 Epoch 20/20 581/581 [==============================] - 116s 199ms/step - loss: 0.1243 - accuracy: 0.9849 - val_loss: 0.1121 - val_accuracy: 0.9875 然而这种写法还是不方便Debug,我们希望可以精细的控制迭代的过程,并能够看到中间结果,所以我们训练的过程改成了这样: optimizer = tf.keras.optimizers.RMSprop(lr=base_learning_rate) train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy') @tf.function def train_cls_step(image, label): with tf.GradientTape() as tape: predictions = model(image) loss = tf.keras.losses.SparseCategoricalCrossentropy()(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_accuracy(label, predictions) for images, labels in train_batches: train_cls_step(images,labels) 重新训练后,结果依然很完美! 但是,这时候我们想对比一下Finetune和重头开始训练的差别,所以把构建模型的代码改成了这样: base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False,weights=None) 使得模型的权重随机生成,这时候训练结果就开始抽风了,Loss不下降,Accuracy稳定在50%附近游荡: Step #10: loss=0.6937199831008911 acc=46.5625% Step #20: loss=0.6932525634765625 acc=47.8125% Step #30: loss=0.699873685836792 acc=49.16666793823242% Step #40: loss=0.6910845041275024 acc=49.6875% Step #50: loss=0.6935917139053345 acc=50.0625% Step #60: loss=0.6965731382369995 acc=49.6875% Step #70: loss=0.6949992179870605 acc=49.19642639160156% Step #80: loss=0.6942993402481079 acc=49.84375% Step #90: loss=0.6933775544166565 acc=49.65277862548828% Step #100: loss=0.6928421258926392 acc=49.5% Step #110: loss=0.6883170008659363 acc=49.54545593261719% Step #120: loss=0.695658802986145 acc=49.453125% Step #130: loss=0.6875559091567993 acc=49.61538314819336% Step #140: loss=0.6851695775985718 acc=49.86606979370117% Step #150: loss=0.6978713274002075 acc=49.875% Step #160: loss=0.7165156602859497 acc=50.0% Step #170: loss=0.6945627331733704 acc=49.797794342041016% Step #180: loss=0.6936900615692139 acc=49.9305534362793% Step #190: loss=0.6938323974609375 acc=49.83552551269531% Step #200: loss=0.7030564546585083 acc=49.828125% Step #210: loss=0.6926192045211792 acc=49.76190185546875% Step #220: loss=0.6932414770126343 acc=49.786930084228516% Step #230: loss=0.6924526691436768 acc=49.82337188720703% Step #240: loss=0.6882281303405762 acc=49.869789123535156% Step #250: loss=0.6877702474594116 acc=49.86249923706055% Step #260: loss=0.6933954954147339 acc=49.77163314819336% Step #270: loss=0.6944763660430908 acc=49.75694274902344% Step #280: loss=0.6945018768310547 acc=49.49776840209961% 我们将predictions的结果打印出来,发现batch内每个输出都是一模一样的: 0 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 1 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 2 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 3 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 4 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 5 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 6 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 7 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 8 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 9 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 只是修改了初始权重,为何会产生这样的结果? 问题排查 实验1 是不是训练不够充分,或者learning rate设置的不合适?经过几轮调整,发现无论训练多久,learning rate变大变小,都无法改变这种结果 实验2 既然是权重的问题,是不是权重随机初始化的有问题,把初始权重拿出来统计了一下,一切正常 实验3 这种问题根据之前的经验,在导出Inference模型的时候BatchNormalization没有处理好会出现这种一个batch内所有结果都一样的问题。但是如何解释训练的时候为什么会出现这个问题?而且为什么Finetue不会出现问题呢?只是改了权重的初始值而已呀按照这个方向去Google的一番,发现了Keras的BatchNormalization确实有很多issue,其中一个问题是在保存模型的是BatchNormalzation的moving mean和moving variance不会被保存[6]https://github.com/tensorflow/tensorflow/issues/16455,而另外一个issue提到问题就和我们问题有关系的了:[2] https://github.com/tensorflow/tensorflow/issues/19643[3] https://github.com/tensorflow/tensorflow/issues/23873最后,这位作者找到了原因,并且总结在了这里:[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/ 根据这个提示,我们做了如下尝试: 实验3.1 改用model.fit的写法进行训练,在最初的几个epoch里面,我们发现好的一点的是training accuracy已经开始缓慢提升了,但是validation accuracy存在原来的问题。而且通过model.predict_on_batch()拿到中间结果,发现依然还是batch内输出都一样。 Epoch 1/20 581/581 [==============================] - 162s 279ms/step - loss: 0.6768 - sparse_categorical_accuracy: 0.6224 - val_loss: 0.6981 - val_sparse_categorical_accuracy: 0.4984 Epoch 2/20 581/581 [==============================] - 133s 228ms/step - loss: 0.4847 - sparse_categorical_accuracy: 0.7684 - val_loss: 0.6931 - val_sparse_categorical_accuracy: 0.5016 Epoch 3/20 581/581 [==============================] - 130s 223ms/step - loss: 0.3905 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.6996 - val_sparse_categorical_accuracy: 0.4984 Epoch 4/20 581/581 [==============================] - 131s 225ms/step - loss: 0.3113 - sparse_categorical_accuracy: 0.8660 - val_loss: 0.6935 - val_sparse_categorical_accuracy: 0.5016 但是,随着训练的深入,结果出现了逆转,开始变得正常了(tf.function的写法是无论怎么训练都不会变化,幸好没有放弃治疗)(追加:其实这里还是有问题的,继续看后面,当时就觉得怪怪的,不应该收敛这么慢) Epoch 18/20 581/581 [==============================] - 131s 226ms/step - loss: 0.0731 - sparse_categorical_accuracy: 0.9725 - val_loss: 1.4896 - val_sparse_categorical_accuracy: 0.8703 Epoch 19/20 581/581 [==============================] - 130s 225ms/step - loss: 0.0664 - sparse_categorical_accuracy: 0.9748 - val_loss: 0.6890 - val_sparse_categorical_accuracy: 0.9016 Epoch 20/20 581/581 [==============================] - 126s 217ms/step - loss: 0.0631 - sparse_categorical_accuracy: 0.9768 - val_loss: 1.0290 - val_sparse_categorical_accuracy: 0.9031 通多model.predict_on_batch()拿到的结果也和这个Accuracy也是一致的 实验3.2 通过上一个实验,我们验证了确实如果只通过Keras的API去训练,是正常。更深层的原因是什么呢?是不是BatchNomalization没有update moving mean和moving variance导致的呢?答案是Yes我们分别在两中训练方法前后,打印 moving mean和moving variance的值: def get_bn_vars(collection): moving_mean, moving_variance = None, None for var in collection: name = var.name.lower() if "variance" in name: moving_variance = var if "mean" in name: moving_mean = var if moving_mean is not None and moving_variance is not None: return moving_mean, moving_variance raise ValueError("Unable to find moving mean and variance") mean, variance = get_bn_vars(model.variables) print(mean) print(variance) 我们发现,确实如果使用model.fit()进行训练,mean和variance是在update的(虽然更新的速率看着有些奇怪),但是对于tf.function那种写法这两个值就没有被update 那这里我们也可以解释为什么Finetune不会出现问题了,因为imagenet训练的mean, variance已经是一个比较好的值了,即使不更新也可以正常使用 实验3.3 是不是改成[4]里面说的方法构建动态的Input_Shape的模型就OK了呢? class MyModel(Model): def __init__(self): super(MyModel, self).__init__() self.conv1 = Conv2D(32, 3, activation='relu') self.batch_norm1=BatchNormalization() self.flatten = Flatten() self.d1 = Dense(128, activation='relu') self.d2 = Dense(10, activation='softmax') def call(self, x): x = self.conv1(x) x = self.batch_norm1(x) x = self.flatten(x) x = self.d1(x) return self.d2(x) model = MyModel() #model.build((None,28,28,1)) model.summary() @tf.functiondef train_step(image, label): with tf.GradientTape() as tape: predictions = model(image) loss = loss_object(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_loss(loss) train_accuracy(label, predictions) 模型如下: Model: "my_model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) multiple 320 _________________________________________________________________ batch_normalization_v2 (Batc multiple 128 _________________________________________________________________ flatten (Flatten) multiple 0 _________________________________________________________________ dense (Dense) multiple 2769024 _________________________________________________________________ dense_1 (Dense) multiple 1290 ================================================================= Total params: 2,770,762 Trainable params: 2,770,698 Non-trainable params: 64 从Output Shape看,构建模型没问题跑了一遍MINST,结果也很不错!以防万一,我们同样测试了一下mean和variance是否被更新,然而结果出乎意料,并没有!也就是说[4]里面说的方案在我们这里并不可行 实验3.4 既然我们定位问题是在BatchNormalization这里,所以就想到BatchNormalization的training和testing时候行为是不一致的,在testing的时候moving mean和variance是不需要update的,那么会不会是tf.function的这种写法并不会自动更改这个状态呢?查看源码,发现BatchNormalization的call()存在一个training参数,而且默认是False Call arguments: inputs: Input tensor (of any rank). training: Python boolean indicating whether the layer should behave in training mode or in inference mode. - `training=True`: The layer will normalize its inputs using the mean and variance of the current batch of inputs. - `training=False`: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training. 所以,做了如下改进: class MyModel(Model): def __init__(self): super(MyModel, self).__init__() self.conv1 = Conv2D(32, 3, activation='relu') self.batch_norm1=BatchNormalization() self.flatten = Flatten() self.d1 = Dense(128, activation='relu') self.d2 = Dense(10, activation='softmax') def call(self, x,training=True): x = self.conv1(x) x = self.batch_norm1(x,training=training) x = self.flatten(x) x = self.d1(x) return self.d2(x) model = MyModel() #model.build((None,28,28,1)) model.summary() @tf.functiondef train_step(image, label): with tf.GradientTape() as tape: predictions = model(image,training=True) loss = loss_object(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_loss(loss) train_accuracy(label, predictions) @tf.functiondef test_step(image, label): predictions = model(image,training=False) t_loss = loss_object(label, predictions) test_loss(t_loss) test_accuracy(label, predictions) 结果显示,moving mean和variance开始更新啦,测试Accuracy也是符合预期所以,我们可以确定问题的根源在于需要指定BatchNormalization是在training还是在testing! 实验3.5 3.4中方法虽然解决了我们的问题,但是它是使用构建Model的subclass的方式,而我们之前的MobileNetV2是基于更加灵活Keras Functional API构建的,由于无法控制call()函数的定义,没有办法灵活切换training和testing的状态,另外用Sequential的方式构建时也是一样。[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html[7]https://github.com/keras-team/keras/issues/7085[8]https://github.com/keras-team/keras/issues/6752从5[8]中,我了解到两个情况, tf.keras.backend.set_learning_phase()可以改变training和testing的状态; model.updates和layer.updates 存着old_value和new_value的Assign Op 所以我首先尝试: tf.keras.backend.set_learning_phase(True) 结果,MobileNetV2构建的模型也可以正常工作了。而且收敛的速度似乎比model.fit()还快了很多,结合之前model.fit()收敛慢的困惑,这里又增加的一个实验,在model.fit()的版本里面也加上这句话,发现同样收敛速度也变快了!1个epoch就能得到不错的结果了!因此,这里又产生了一个问题model.fit()到底有没有设learning_phase状态?如果没有是怎么做moving mean和variance的update的?第二个方法,由于教程中讲述的是如何在1.x的版本构建,而在eager execution模式下,似乎没有办法去run这些Assign Operation。仅做参考吧 update_ops = [] for assign_op in model.updates: update_ops.append(assign_op)) #但是不知道拿到这些update_ops在eager execution模式下怎么处理呢? 结论 总结一下,我们从[4]找到了解决问题的启发点,但是最终证明[4]里面的问题和解决方法用到我们这里并不能真正解决问题,问题的关键还是在于Keras+TensorFlow2.0里面我们如何处理在training和testing状态下行为不一致的Layer;以及对于model.fit()和tf.funtion这两种训练方法的区别,最终来看model.fit()里面似乎包含很多诡异的行为。最终的使用建议如下: 在使用model.fit()或者model.train_on_batch()这种Keras的API训练模型时,也推荐手动设置tf.keras.backend.set_learning_phase(True),可以加快收敛 如果使用eager execution这种方法, 1)使用构建Model的subclass,但是针对call()设置training的状态,对于BatchNoramlization,Dropout这样的Layer进行不同处理 2)使用Functional API或者Sequential的方式构建Model,设置tf.keras.backend.set_learning_phase(True),但是注意在testing的时候改变一下状态 最后,为什么TF 2.0的教程里面没有提及这些?默认你已经精通Keras了吗?[捂脸哭] 感谢 感谢柏涛 帆月 应知老师提供的帮助 [1]https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn[2] https://github.com/tensorflow/tensorflow/issues/19643[3] https://github.com/tensorflow/tensorflow/issues/23873[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html[6]https://github.com/tensorflow/tensorflow/issues/16455[7]https://github.com/keras-team/keras/issues/7085[8]https://github.com/keras-team/keras/issues/6752
TensorFlow Object Detection API 上提供了使用SSD部署到TFLite运行上去的方法, 可是这套API封装太死板, 如果你要自己实现了一套SSD的训练算法,应该怎么才能部署到TFLite上呢? 首先,抛开后处理的部分,你的SSD模型(无论是VGG-SSD和Mobilenet-SSD), 你最终的模型的输出是对class_predictions和bbox_predictions; 并且是encoded的 Encoding的方式: class_predictions: M个Feature Layer, Feature Layer的大小(宽高)视网络结构而定; 每个Feature Layer有Num_Anchor_Depth_of_this_layer x Num_classes个channels box_predictions: M个Feature Layer; 每个Feature Layer有Num_Anchor_Depth_of_this_layer x 4个channes 这4个channel分别代表dy,dx,h,w, 即bbox中心距离anchor中心坐标的偏移量和宽高 注:通常,为了平衡loss之间的大小, 不会直接编码dy,dx,w,h的原始值,而是dy/anchor_h*scale0, dx/anchor_w*scale0, log(h/anchor_h)*scale1, log(w/anchor_w)*scale1, 也就是偏移量的绝对值除anchor宽高得到相对值,然后再乘上一个scale, 经验值 scale0取5,scale1取10; 对于h,w是对得到相对值后先取log再乘以scale, h/anchor_h的范围在1附近, 取log后可以转换到0附近;所以在解码的时候需要做对应相反的变换; 在后面TFLite_Detection_PostProcess的Op实现里就有这么一段: 然后我们需要的是做的是decode出来对每个class的confidence和location的预测值 后处理 在Object Detection API的 export_tflite_ssd_graph_lib.py文件中,你可以看到,它区别与直接freeze pb的操作就在于最后替换了后处理的部分; Plain Text Plain Text Bash Basic C C++ C# CSS C++ Diff Git go GraphQL HTML HTTP Java JavaScript JSON JSX Kotlin Less Makefile Markdown MATLAB Nginx Objective-C Pascal Perl PHP PowerShell Ruby Protobuf Python R Ruby Scala Shell SQL Swift TypeScript VB.net XML YAML KaTeX Mermaid PlantUML Flow Graphviz frozen_graph_def = exporter.freeze_graph_with_def_protos( input_graph_def=tf.get_default_graph().as_graph_def(), input_saver_def=input_saver_def, input_checkpoint=checkpoint_to_use, output_node_names=','.join([ 'raw_outputs/box_encodings', 'raw_outputs/class_predictions', 'anchors' ]), restore_op_name='save/restore_all', filename_tensor_name='save/Const:0', clear_devices=True, output_graph='', initializer_nodes='') # Add new operation to do post processing in a custom op (TF Lite only) if add_postprocessing_op: transformed_graph_def = append_postprocessing_op( frozen_graph_def, max_detections, max_classes_per_detection, nms_score_threshold, nms_iou_threshold, num_classes, scale_values) else: # Return frozen without adding post-processing custom op transformed_graph_def = frozen_graph_def 后处理的部分,其实看代码也很简单,就是增加了一个叫TFLite_Detection_PostProcess的node,用于解码和非极大抑制. 这个node的输入就是上面提到的box_predictions和class_predictions, 还有anchors的编码; 用这个node的目的只TFLite并不支持tf.contrib.image.non_max_surpression操作 Reshape过程: 这里需要明确,TFLite_Detection_PostProcess 这个op对raw_outputs/box_encodings, raw_outputs/class_predictions, anchors的Shape是有一个定制要求的 raw_outputs/box_encodings.shape=[1, num_anchors,4] raw_outputs/class_predictions.shape=[1, num_anchors,num_classes+1] anchors.shape=[1,num_anchors,4] 这里需要注意:1, 这三个都必须是3维的Tensor; 2.raw_outputs/class_predictions.shape的最后一个维度是包含background的classes, 也就是是num_classes+1; TFLite_Detection_PostProcess还有一个参数num_classes, 这个参数值是不包含background的, 所以也就导致TFLite_Detection_PostProcess的输出的class index是从0计数的; Plain Text Plain Text Bash Basic C C++ C# CSS C++ Diff Git go GraphQL HTML HTTP Java JavaScript JSON JSX Kotlin Less Makefile Markdown MATLAB Nginx Objective-C Pascal Perl PHP PowerShell Ruby Protobuf Python R Ruby Scala Shell SQL Swift TypeScript VB.net XML YAML KaTeX Mermaid PlantUML Flow Graphviz with tf.variable_scope('raw_outputs'): cls_pred = [tf.reshape(pred, [-1, num_classes]) for pred in cls_pred] location_pred = [tf.reshape(pred, [-1, 4]) for pred in location_pred] cls_pred = tf.concat(cls_pred, axis=0) location_pred = tf.expand_dims(tf.concat(location_pred, axis=0),0, name='box_encodings') cls_pred=tf.nn.softmax(cls_pred) tf.identity(tf.expand_dims(cls_pred,0), name='class_predictions') 这段代码就是用来reshape成要求的输入的, 需要注意的是对class_prediction需要做依次softmax或者sigmoid, 具体选择哪种取决于你是否允许一个anchor对应多个类; 对于anchors, 这其实是一constant的值: Plain Text Plain Text Bash Basic C C++ C# CSS C++ Diff Git go GraphQL HTML HTTP Java JavaScript JSON JSX Kotlin Less Makefile Markdown MATLAB Nginx Objective-C Pascal Perl PHP PowerShell Ruby Protobuf Python R Ruby Scala Shell SQL Swift TypeScript VB.net XML YAML KaTeX Mermaid PlantUML Flow Graphviz num_anchors = anchor_cy.get_shape().as_list() with tf.Session() as sess: y_out, x_out, h_out, w_out = sess.run([anchor_cy, anchor_cx, anchor_h, anchor_w]) encoded_anchors = tf.constant( np.transpose(np.stack((y_out, x_out, h_out, w_out))), dtype=tf.float32, shape=[num_anchors[0], 4]) 注意: 之前我使用tf.stack合成这个值的时候发现,TFLite只支持axis=0的时候的tf.stack, 否则就会转换是吧 导出pb 添加完后处理,既可以导出一个带有后处理功能的pb文件了; 如果你不添加后处理,把它放在CPU上后续去做,其实也可以省去不少麻烦; Plain Text Plain Text Bash Basic C C++ C# CSS C++ Diff Git go GraphQL HTML HTTP Java JavaScript JSON JSX Kotlin Less Makefile Markdown MATLAB Nginx Objective-C Pascal Perl PHP PowerShell Ruby Protobuf Python R Ruby Scala Shell SQL Swift TypeScript VB.net XML YAML KaTeX Mermaid PlantUML Flow Graphviz binary_graph = os.path.join(output_dir, 'tflite_graph.pb') with tf.gfile.GFile(binary_graph, 'wb') as f: f.write(transformed_graph_def.SerializeToString()) txt_graph = os.path.join(output_dir, 'tflite_graph.pbtxt') with tf.gfile.GFile(txt_graph, 'w') as f: f.write(str(transformed_graph_def)) 注意: 导出的pb如果包含后处理, 是没办法用正常的TF执行的,必须转成tflite执行 导出tflite Plain Text Plain Text Bash Basic C C++ C# CSS C++ Diff Git go GraphQL HTML HTTP Java JavaScript JSON JSX Kotlin Less Makefile Markdown MATLAB Nginx Objective-C Pascal Perl PHP PowerShell Ruby Protobuf Python R Ruby Scala Shell SQL Swift TypeScript VB.net XML YAML KaTeX Mermaid PlantUML Flow Graphviz bazel run --config=opt tensorflow/contrib/lite/toco:toco -- \ --input_file=$OUTPUT_DIR/tflite_graph.pb \ --output_file=$OUTPUT_DIR/detect.tflite \ --input_shapes=1,300,300,3 \ --input_arrays=normalized_input_image_tensor \ --output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \ --inference_type=QUANTIZED_UINT8 \ --mean_values=128 \ --std_values=128 \ --change_concat_input_ranges=false \ --allow_custom_ops or bazel run -c opt tensorflow/lite/toco:toco -- \ --input_file=$OUTPUT_DIR/tflite_graph.pb \ --output_file=$OUTPUT_DIR/detect.tflite \ --input_shapes=1,300,300,3 \ --input_arrays=normalized_input_image_tensor \ --output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \ --inference_type=FLOAT \ --allow_custom_ops 导出的过程中,可能遇到Converting unsupported operation: TFLite_Detection_PostProcess 这个提示, 正常如果是TF在1.10以上就忽略这个提示好了 然后你可以先用python的程序加载这个tflite去测试一下 注意: 这时候会发现一个问题, TFLite_Detection_PostProcess的NMS操作是忽略类标签的,如果你设置max_classes_per_detection=1; 但是如果你设置成>1的值, 会发现它吧background的标签也算进来了, 导致出来很多误检测的bbox; 部署Android 然后,你可以尝试部署到Android上, 在不使用NNAPI的时候正常,但是如果是NNAPI就需要自己实现相关操作了,否则会crash掉