Customize Datasets

一般来说，如果要创建自己自定义的数据集，可以参考现有的 config 借鉴一下。但是我们自定义的数据集可以分两种情况：

使用现有的数据集格式（COCO, Pascal VOC, etc...）处理自己的数据集
使用一种全新的数据集格式处理自己的数据集

下面以 COCO 格式举例：

'images': [
    {
        'file_name': 'COCO_val2014_000000001268.jpg',
        'height': 427,
        'width': 640,
        'id': 1268
    },
    ...
],

'annotations': [
    {
        'segmentation': [[192.81,
            247.09,
            ...
            219.03,
            249.06]],  # if you have mask labels
        'area': 1035.749,
        'iscrowd': 0,
        'image_id': 1268,
        'bbox': [192.81, 224.8, 74.73, 33.43],
        'category_id': 16,
        'id': 42986
    },
    ...
],

'categories': [
    {'id': 0, 'name': 'car'},
 ]

其中必要的三个键对应的值：images，annotations，categories

使用现有的数据集格式

当我们想使用一种不是官方指定的数据集训练时，有以下两种预处理的方式：

修改配置数据集的配置文件
检查当前新数据集的格式是不是现有的数据集格式 (COCO, Psacal VOC, etc...)

tip

比较推荐的一种方式是先把你自己的数据集的格式转换到现有的数据集格式，因为这样还可以使用 COCO 格式的配置文件，而你只需要修改对应的路径和训练的类别即可，这样比较方便。

修改数据集的配置文件

configs/my_custom_config.py

# the new config inherits the base configs to highlight the necessary modification
_base_ = './cascade_mask_rcnn_r50_fpn_1x_coco.py'

# 1. dataset settings
dataset_type = 'CocoDataset'
classes = ('a', 'b', 'c', 'd', 'e')
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        # explicitly add your class names to the field `classes`
        classes=classes,
        ann_file='path/to/your/train/annotation_data',
        img_prefix='path/to/your/train/image_data'),
    val=dict(
        type=dataset_type,
        # explicitly add your class names to the field `classes`
        classes=classes,
        ann_file='path/to/your/val/annotation_data',
        img_prefix='path/to/your/val/image_data'),
    test=dict(
        type=dataset_type,
        # explicitly add your class names to the field `classes`
        classes=classes,
        ann_file='path/to/your/test/annotation_data',
        img_prefix='path/to/your/test/image_data'))

# 2. model settings

# explicitly over-write all the `num_classes` field from default 80 to 5.
model = dict(
    roi_head=dict(
        bbox_head=[
            dict(
                type='Shared2FCBBoxHead',
                # explicitly over-write all the `num_classes` field from default 80 to 5.
                num_classes=5),
            dict(
                type='Shared2FCBBoxHead',
                # explicitly over-write all the `num_classes` field from default 80 to 5.
                num_classes=5),
            dict(
                type='Shared2FCBBoxHead',
                # explicitly over-write all the `num_classes` field from default 80 to 5.
                num_classes=5)],
    # explicitly over-write all the `num_classes` field from default 80 to 5.
    mask_head=dict(num_classes=5)))

主要需要修改 data 和 model 两个字典。

data: 修改 data 中 train，val，test 里面的内容

model: 修改 model 中 num_classes

检查新的数据集格式是否满足现有的数据集格式

如果新的数据集是 COCO 格式，那么确保你的标签是满足 COCO 格式的。

检查标签中 categories 键对应值的长度是否等于你 config 当中 classes 元组的长度
检查 classes 元组中的类别和标签中 categories 键对应值名字是否对上··
标签中的 category_id 要合法，所有的值要属于 categories 当中的 id

有效的样例：

'annotations': [
    {
        'segmentation': [[192.81,
            247.09,
            ...
            219.03,
            249.06]],  # if you have mask labels
        'area': 1035.749,
        'iscrowd': 0,
        'image_id': 1268,
        'bbox': [192.81, 224.8, 74.73, 33.43],
        'category_id': 16,
        'id': 42986
    },
    ...
],

# MMDetection automatically maps the uncontinuous `id` to the continuous label indices.
'categories': [
    {'id': 1, 'name': 'a'}, {'id': 3, 'name': 'b'}, {'id': 4, 'name': 'c'}, {'id': 16, 'name': 'd'}, {'id': 17, 'name': 'e'},
 ]

使用一种全新的数据格式

如果以上方式不能解决你的问题，那么我们就给 mmdetection 创建一种新的数据集格式来读取自己的数据集。首先，数据集的标签要是一个由字典组成的 list，list 当中每一个字典对应一个图片。一些数据集可能会提供像人群/差异/忽略 bboxes 的注释，我们使用 bboxes_ignore 和 labels_ignore 来覆盖它们。

样例如下：

[
    {
        'filename': 'a.jpg',
        'width': 1280,
        'height': 720,
        'ann': {
            'bboxes': <np.ndarray, float32> (n, 4),
            'labels': <np.ndarray, int64> (n, ),
            'bboxes_ignore': <np.ndarray, float32> (k, 4),
            'labels_ignore': <np.ndarray, int64> (k, ) (optional field)
        }
    },
    ...
]

你有两种方式来自定义你的数据集：

你可以写一个继承自 CustomDataset 的类，重写 load_annotations(self, ann_file) 和 get_ann_info(self, idx)，可以参考 COCO 和 VOC 的数据集类。
你可以将上面的格式转换为你想要的数据格式，比如 pascal_voc.py 然后你可以简单用 CustomDataset 实现。

举个例子

假设现在你的 annotation.txt 长这样：

000001.jpg
1280 720
2
10 20 40 60 1
20 40 50 60 2
#
000002.jpg
1280 720
3
50 20 40 60 2
20 40 30 45 2
30 40 50 60 3

创建一个新数据集，在 mmdet/datasets/my_dataset.py 来加载数据。

import mmcv
import numpy as np

from .builder import DATASETS
from .custom import CustomDataset


@DATASETS.register_module()
class MyDataset(CustomDataset):

    CLASSES = ('person', 'bicycle', 'car', 'motorcycle')

    def load_annotations(self, ann_file):
        ann_list = mmcv.list_from_file(ann_file)

        data_infos = []
        for i, ann_line in enumerate(ann_list):
            if ann_line != '#':
                continue

            img_shape = ann_list[i + 2].split(' ')
            width = int(img_shape[0])
            height = int(img_shape[1])
            bbox_number = int(ann_list[i + 3])

            anns = ann_line.split(' ')
            bboxes = []
            labels = []
            for anns in ann_list[i + 4:i + 4 + bbox_number]:
                bboxes.append([float(ann) for ann in anns[:4]])
                labels.append(int(anns[4]))

            data_infos.append(
                dict(
                    filename=ann_list[i + 1],
                    width=width,
                    height=height,
                    ann=dict(
                        bboxes=np.array(bboxes).astype(np.float32),
                        labels=np.array(labels).astype(np.int64))
                ))

        return data_infos

    def get_ann_info(self, idx):
        return self.data_infos[idx]['ann']

然后在配置文件当中使用 MyDataset 类：

dataset_A_train = dict(
    type='MyDataset',
    ann_file = 'image_list.txt',
    pipeline=train_pipeline
)

数据集 wrapper

mmdection 也支持几个方式的数据集操作：

RepeatDataset

dataset_A_train = dict(
        type='RepeatDataset',
        times=N,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )

ClassBalancedDataset

dataset_A_train = dict(
        type='ClassBalancedDataset',
        oversample_thr=1e-3,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )

ConcatDataset

dataset_A_train = dict(
    type='Dataset_A',
    ann_file = ['anno_file_1', 'anno_file_2'],
    pipeline=train_pipeline
)

如果混合的两种数据集都有验证集和测试集，那么可以使用 seperate_eval 来设置是否是单独测试还是混合测试

dataset_A_train = dict(
    type='Dataset_A',
    ann_file = ['anno_file_1', 'anno_file_2'],
    separate_eval=False,
    pipeline=train_pipeline
)

下面这种情况是分别测试设置：

dataset_A_train = dict()
dataset_B_train = dict()

data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train = [
        dataset_A_train,
        dataset_B_train
    ],
    val = dataset_A_val,
    test = dataset_A_test
    )

同时，也可以支持下面的用法：

dataset_A_val = dict()
dataset_B_val = dict()

data = dict(
    imgs_per_gpu=2,
    workers_per_gpu=2,
    train=dataset_A_train,
    val=dict(
        type='ConcatDataset',
        datasets=[dataset_A_val, dataset_B_val],
        separate_eval=False))

这种方式允许用户通过设置 separate_eval=False 将所有数据集作为一个整体进行评估。

Customize Datasets

使用现有的数据集格式​

修改数据集的配置文件​

检查新的数据集格式是否满足现有的数据集格式​

使用一种全新的数据格式​

举个例子​

数据集 wrapper​