Downloading, preprocessing, and uploading the COCO dataset
COCO is a large-scale object detection, segmentation, and captioning dataset. Machine learning models that use the COCO dataset include:
- Mask-RCNN
- Retinanet
- ShapeMask
Before you can train a model on a Cloud TPU, you must prepare the training data.
This topic describes how to prepare the COCO dataset for
models that run on Cloud TPU. The COCO dataset can only be prepared after you
have created a Compute Engine VM. The script used to prepare the data,
download_and_preprocess_coco.sh
,
is installed on the VM and must be run on the VM.
After preparing the data by running the download_and_preprocess_coco.sh
script, you can bring up the Cloud TPU and run the training.
To fully download/preprocess and upload the COCO dataset to a Google Cloud storage bucket takes approximately 2 hours.
In your Cloud Shell, configure
gcloud
with your project ID.export PROJECT_ID=project-id gcloud config set project ${PROJECT_ID}
In your Cloud Shell, create a Cloud Storage bucket using the following command:
gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 gs://bucket-name
Launch a Compute Engine VM instance.
This VM instance will only be used to download and preprocess the COCO dataset. Fill in the instance-name with a name of your choosing.
$ gcloud compute tpus execution-groups create \ --vm-only \ --name=instance-name \ --zone=europe-west4-a \ --disk-size=300 \ --machine-type=n1-standard-16 \ --tf-version=2.12.0
Command flag descriptions
vm-only
- Create a VM only. By default the
gcloud compute tpus execution-groups
command creates a VM and a Cloud TPU. name
- The name of the Cloud TPU to create.
zone
- The zone where you plan to create your Cloud TPU.
disk-size
- The size of the hard disk in GB of the VM created by the
gcloud compute tpus execution-groups
command. machine-type
- The machine type of the Compute Engine VM to create.
tf-version
- The version of Tensorflow
gcloud compute tpus execution-groups
installs on the VM.
If you are not automatically logged in to the Compute Engine instance, log in by running the following
ssh
command. When you are logged into the VM, your shell prompt changes fromusername@projectname
tousername@vm-name
:$ gcloud compute ssh instance-name --zone=europe-west4-a
Set up two variables, one for the storage bucket you created earlier and one for the directory that holds the training data (DATA_DIR) on the storage bucket.
(vm)$ export STORAGE_BUCKET=gs://bucket-name
(vm)$ export DATA_DIR=${STORAGE_BUCKET}/coco
Install the packages needed to pre-process the data.
(vm)$ sudo apt-get install -y python3-tk && \ pip3 install --user Cython matplotlib opencv-python-headless pyyaml Pillow && \ pip3 install --user "git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI"
Run the
download_and_preprocess_coco.sh
script to convert the COCO dataset into a set of TFRecords (*.tfrecord
) that the training application expects.(vm)$ git clone https://github.com/tensorflow/tpu.git (vm)$ sudo bash tpu/tools/datasets/download_and_preprocess_coco.sh ./data/dir/coco
This installs the required libraries and then runs the preprocessing script. It outputs a number of
*.tfrecord
files in your local data directory. The COCO download and conversion script takes approximately 1 hour to complete.Copy the data to your Cloud Storage bucket
After you convert the data into TFRecords, copy them from local storage to your Cloud Storage bucket using the
gsutil
command. You must also copy the annotation files. These files help validate the model's performance.(vm)$ gsutil -m cp ./data/dir/coco/*.tfrecord ${DATA_DIR} (vm)$ gsutil cp ./data/dir/coco/raw-data/annotations/*.json ${DATA_DIR}
Clean up the VM resources
Once the COCO dataset has been converted to TFRecords and copied to the DATA_DIR on your Cloud Storage bucket, you can delete the Compute Engine instance.
Disconnect from the Compute Engine instance:
(vm)$ exit
Your prompt should now be
username@projectname
, showing you are in the Cloud Shell.Delete your Compute Engine instance.
$ gcloud compute instances delete instance-name --zone=europe-west4-a