使用Tensorflow Object Detection进行训练和推理

整体流程(以PASCAL VOC为例)

1.下载PASCAL VOC2012数据集，并将数据集转为tfrecord格式

2.选择并下载预训练模型

3.配置训练文件configuration（所有的训练参数都通过配置文件来配置）

4.训练模型

5.利用tensorboard查看训练过程中loss，accuracy等变化曲线

6.冻结模型参数

7.调用冻结pb文件进行预测

文件格式

首先建立一下文件结构，把models/research/object_detection/data下的label_map.pbtxt文件移动到自己建立的data下。

label_map.txt：定义了class id和class name的映射

文件结构如下：

.
├── data/
│   ├── eval-00000-of-00001.tfrecord  	# file
│   ├── label_map.txt  								 	# file
│   ├── train-00000-of-00002.tfrecord  	# file
│   └── train-00001-of-00002.tfrecord  	# file
└── models/
    └── my_model_dir/
        ├── eval/                 # Created by evaluation job.
        ├── my_model.config  			# pipeline config
        └── model_ckpt-100-data@1 #
        └── model_ckpt-100-index  # Created by training job.
        └── checkpoint            #

把label_map.pbtxt移动过去（以PASCAL VOC2012为例）：

cp /xxx/models/research/object_detection/data/pascal_label_map.pbtxt ./data/

准备输入数据

Tensorflow Object Detection API使用TFRecord格式的数据。提供了create_pascal_tf_record.py 和create_pet_tf_record.py两个脚本来转换PASCAL VOC和Pet数据集到TFRecord格式。

产生PASCAL VOC的TFRecord文件

如果本地没有数据集的话，使用如下命令下载数据集（here）：

wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xvf VOCtrainval_11-May-2012.tar

使用如下命令将PSACAL VOC转换成TFRecord格式：

Examples：data_dir改为自己的数据集路径

# From tensorflow/models/research/
python object_detection/dataset_tools/create_pascal_tf_record.py 
    --label_map_path=/root/data/pascal_label_map.pbtxt 
    --data_dir=/data2/VOC2007/VOCdevkit --year=VOC2007 --set=train 
    --output_path=/root/data/pascal_train.record
python object_detection/dataset_tools/create_pascal_tf_record.py 
    --label_map_path=/root/data/pascal_label_map.pbtxt 
    --data_dir=/data2/VOC2007/VOCdevkit --year=VOC2007 --set=val 
    --output_path=/root/data/pascal_val.record

data_dir：PASCAL VOC的数据集的路径
output_dir：想保存TFRecord的路径

执行完上述命令后可以在research文件夹下，看到pascal_train.record和pascal_val.record两个文件。

Generating the COCO TFRecord files.

COCO数据集的位置： here.
使用如下命令将COCO转换成TFRecord格式：

Examples：路径改为自己的路径

# From tensorflow/models/research/
python object_detection/dataset_tools/create_coco_tf_record.py --logtostderr 
  --train_image_dir=/data2/datasets/coco/train2017 
  --val_image_dir=/data2/datasets/coco/val2017 
  --test_image_dir=/data2/datasets/coco/unlabeled2017 
  --train_annotations_file=/data2/datasets/coco/annotations/instances_train2017.json 
  --val_annotations_file=/data2/datasets/coco/annotations/instances_val2017.json 
  --testdev_annotations_file=/data2/datasets/coco/annotations/image_info_test-dev2017.json 
  --output_dir=/root/data

执行完上述命令后可以在research文件夹下，可以看到coco开头的许多文件。

同时要把coco的pbtxt移动到output_dir下。

使用Tensorflow1进行训练和推理

配置训练的Pipeline

Tensorflow Object Detection API使用protobuf文件来配置训练和推理流程。训练的Pipeline模板可以在object_detection/protos/pipeline.proto中找到。同时object_detection/samples/configs 文件夹中提供了简单的可以直接使用的配置。

下面主要介绍配置的具体内容。

整个配置文件可以分成五个部分：

model：
train_config
eval_config
train_input_config
eval_input_config

整体结构如下：

model {
(... Add model config here...)
}

train_config : {
(... Add train_config here...)
}

train_input_reader: {
(... Add train_input configuration here...)
}

eval_config: {
}

eval_input_reader: {
(... Add eval_input configuration here...)
}

选择模型参数

需要注意修改 num_classes 的值去适配自己的任务。

定义输入

支持TFRecord格式的输入。需要指明training和evaluation的文件位置，label map的位置。traning和evaluation数据集的label map应该是相同的。

例子：

tf_record_input_reader {
  input_path: "/usr/home/username/data/train.record"
}
label_map_path: "/usr/home/username/data/label_map.pbtxt"

配置Trainer

train_config定义了三部分训练流程：

模型参数初始化
输入预处理：可选的
SGD参数

例子：

batch_size: 1
optimizer {
  momentum_optimizer: {
    learning_rate: {
      manual_step_learning_rate {
        initial_learning_rate: 0.0002
        schedule {
          step: 0
          learning_rate: .0002
        }
        schedule {
          step: 900000
          learning_rate: .00002
        }
        schedule {
          step: 1200000
          learning_rate: .000002
        }
      }
    }
    momentum_optimizer_value: 0.9
  }
  use_moving_average: false
}
fine_tune_checkpoint: "/usr/home/username/tmp/model.ckpt-#####"
from_detection_checkpoint: true
load_all_detection_checkpoint_vars: true
gradient_clipping_by_norm: 10.0
data_augmentation_options {
  random_horizontal_flip {
  }
}

配置Evaluator

eval_config中主要的设置为num_examples和metrics_set。

num_examples：batches的大小
metrics_set：在evaluation的时候使用什么metrics

Model Parameter Initialization

关于checkpoint的使用。配置文件中的train_config部分提供了两个已经存在的checkpoint：

fine_tune_checkpoint：一个路径前缀(ie:"/usr/home/username/checkpoint/model.ckpt-#####").
fine_tune_checkpoint_type：classification/detection

A list of classification checkpoints can be found here.

A list of detection checkpoints can be found here.

Training

单机单卡

Template:

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py 
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} 
    --model_dir=${MODEL_DIR} 
    --num_train_steps=${NUM_TRAIN_STEPS} 
    --sample_1_of_n_eval_examples=${SAMPLE_1_OF_N_EVAL_EXAMPLES} 
    --alsologtostderr

Examples：

python object_detection/model_main.py 
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config 
    --model_dir=/root/my_models/checkpoint 
    --num_train_steps=1

${PIPELINE_CONFIG_PATH} ：pipeline config的路径
${MODEL_DIR}：训练产生的checkpoint的保存文件路径
num_train_steps：train steps的数量
num_worker：
- = 1：MirroredStrategy
- > 1：MultiWorkerMirroredStrategy.

单机多卡

单机多卡和单机单卡使用的不是用一个启动程序

Examples：

CUDA_VISIBLE_DEVICES=0,1 python object_detection/legacy/train.py 
		--pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config 
		--train_dir=/root/my_models/checkpoint 
		--num_clones=2 
		--ps_tasks=1

train_dir：训练产生的checkpoint的保存文件路径
num_clones：通常有几个gpu就是几
ps_tasks：parameter server的数量。Default:0，不使用ps

多机多卡

官方没有给出多机多卡的使用方式，google查到的一个是基于hadoop集群实现的分布式训练

Evaluation

单机单卡

Template:

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
CHECKPOINT_DIR=${MODEL_DIR}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py 
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} 
    --model_dir=${MODEL_DIR} 
    --checkpoint_dir=${CHECKPOINT_DIR} 
    --alsologtostderr

Examples:

python object_detection/model_main_tf2.py 
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config 
    --model_dir=/root/my_models 
    --checkpoint_dir=/root/my_models/checkpoint

${CHECKPOINT_DIR} ：训练产生的checkpoint的地址。如果使用了这个参数，就会是eval-only的模式，evaluation metrix会存在model_dir路径下。
${MODEL_DIR/eval}：推理产生的events的地址

单机多卡

Examples：

CUDA_VISIBLE_DEVICES=0,1 python object_detection/legacy/eval.py 
		--checkpoint_dir=/root/my_models/checkpoint 
		--eval_dir=/root/my_models/eval 
		--pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config

使用Tensorflow2进行训练和推理

Training

Template：

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py 
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} 
    --model_dir=${MODEL_DIR} 
    --alsologtostderr

Examples：

python object_detection/model_main_tf2.py 
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config 
    --model_dir=/root/my_models/checkpoint

${PIPELINE_CONFIG_PATH} ：pipeline config的路径
${MODEL_DIR}：训练产生的checkpoint的保存文件路径

注：tf2下默认使用MirroredStrategy()，会直接使用当前机器上的全部GPU进行训练。如果只用一部分卡可以指定卡号，如strategy = tf.compat.v2.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])，使用了第0号和第1号卡。

Evaluation

Template：

# From the tensorflow/models/research/ directory
PIPELINE_CONFIG_PATH={path to pipeline config file}
MODEL_DIR={path to model directory}
CHECKPOINT_DIR=${MODEL_DIR}
MODEL_DIR={path to model directory}
python object_detection/model_main_tf2.py 
    --pipeline_config_path=${PIPELINE_CONFIG_PATH} 
    --model_dir=${MODEL_DIR} 
    --checkpoint_dir=${CHECKPOINT_DIR} 
    --alsologtostderr

Examples：

python object_detection/model_main_tf2.py 
    --pipeline_config_path=/root/my_models/faster_rcnn_resnet101_voc07.config 
    --model_dir=/root/my_models/checkpoint 
    --checkpoint_dir=/root/my_models/checkpoint/eval

${CHECKPOINT_DIR}：training产生的checkpoints的路径
${MODEL_DIR/eval}：evaluation events保存的路径

多机多卡

参考Tensorflow1.X的多机多卡部分

常见问题

单机多卡训练时报错：ValueError: not enough values to unpack (expected 7, got 0)

配置文件中batchsize设置成了1。batchsize需要设置成和num_clones同样的大小。
Tensorflow2.X下使用Faster-RCNN模型报错：RuntimeError: Groundtruth tensor boxes has not been provide

Tensorflow object detection api在2021/2之后的某次更新中新引入的bug，可以checkout到旧的commit id（31e86e8）。然后重新安装object detection api。

Reference

(。・∀・)ノ干杯

相关阅读:
图片验证码制作
 上传图片加水印
 组合查询加分页
 C# 数据类型数据转换自己的见解和方式
 C# 基础控制台程序的创建，输出，输入，定义变量，变量赋值，值覆盖，值拼接，值打印
 关于Spring注解
 java I/O
关于web.xml配置
 第7章使用springMVC构建Web应用程序 7.1 springMVC配置
 js配合c3制作一个动态钟表
原文地址：https://www.cnblogs.com/jyroy/p/14704964.html