Skip to content

dnuffer/open_images_downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Images Dataset Downloader

This program is built for downloading, verifying and resizing the images and metadata of the Open Images dataset (https://github.com/openimages/dataset). It is designed to run as fast as possible by taking advantage of the available hardware and bandwidth by using asynchronous I/O and parallelism. Each image's size and md5 sum are validated against the values found in the dataset metadata. The download results are stored in CSV files with the same format as the original images.csv so that subsequent use for training, etc. can be done knowing that all the images are available. Many (over 2%) of the original images are no longer available or have changed, so these and any other failed downloads are stored in a separate results file. If you use a naive script or program such as curl or wget to download the images, you will end up with a lot of "unavailable" png images, xml files and some images that don't match the originals. This is why it's important to validate the size and md5 sum when downloading. In addition some of flicker's servers may randomly be down, so it is important to properly handle the resulting failures.

Changelog

  • 2017-11-20 Version 3.0 released. Add support for downloading Open Images Dataset V3 and V1 (in addition to V2).
  • 2017-09-08 Version 1.0 released.

Installation

The application is written in scala and depends on the Java-8 jre to run. If you don't have a java jre installed, see https://www.java.com/en/download/help/download_options.xml for instructions on how to install it.

Open Images Downloader is distributed with a shell script and batch file generated by sbt-native-packager (http://www.scala-sbt.org/sbt-native-packager/index.html), so all you need to run it is to download and extract the distribution and then execute open_images_downloader (or the .bat on windows).

The resizing functionality depends on the ImageMagick program convert, so if you want to do resizing, convert must be on the PATH. ImageMagick provides excellent quality resizing and very fast performance. It's easy to install on a linux distribution using a package manager (e.g. apt or yum), and not too hard on most other OSes. See https://www.imagemagick.org/script/binary-releases.php.

Usage

The application is a command line application. If you're running it on a server, I'd recommend using screen or tmux so that it continues running if the ssh connection is interrupted.

The code is written in a portable manner, but I haven't tested on any OS besides Ubuntu Linux, so if you use a different OS and run into issues let me know by opening an issue on github and I'll do my best to help you out.

This program is flexible and can be used for a number of use cases depending on how much storage you want to use and how you want to use the data. The original images can optionally be stored, and you can also choose whether to resize and store the resized images. Also the metadata download and extraction is optional. If the original images are found locally because you previously downloaded them, they will be used as the source for a resize. Also a resize is skipped if a file with size > 0 is found. This is so the program can be interrupted and restarted and you can resume it where it left off. Also if you have the original images stored locally you can resize all the images with different parameters without needing to re-download any images.

Example usages

If you want to minimize the amount of space used, only store small images 224x224 compressed at jpeg quality 50, and use less bandwidth by downloading the 300K urls, use the following command line options:

$ open_images_downloader --nodownload-metadata --download-300k \
    --resize-mode FillCrop --resize-compression-quality 50  

If you want to save the images with a max side of 640 with original aspect ratio at original jpeg quality, and use less bandwidth by downloading the 300K urls, use the following command line options. Note that the 300K don't look as nice as the original images resized by ImageMagick. The 300k urls return images that are 640 pixels on the largest side, so the resize step only changes images that are larger than 640. Not all images have 300K urls, and in that case, the original url is used and these images are resized.

$ open_images_downloader --nodownload-metadata --download-300k \
    --resize-box-size 640

If you want to download and save all the original images and metadata, and also resize them to 1024 max side, and save them in a subdirectory named images-resized-1024:

$ open_images_downloader --save-original-images --resize-box-size 1024 \
    --resized-images-subdirectory images-resized-1024

Command Line Options

There are also options for controlling how many concurrent http connections are made (don't worry about flickr, they can easily handle a few hundred connections from a single system and you downloading as fast as possible, and you won't be blocked for "abuse") which you may want to use to reduce the impact you have on your local network.

Here is the complete command line help:

open_images_downloader 3.0 by Dan Nuffer
Usage: open_images_downloader[.bat] [OPTION]...

Options:

      --check-md5-if-exists                   If an image already exists locally
                                              in <image dir> and is the same
                                              size as the original, check the
                                              md5 sum of the file to determine
                                              whether to download it. Default is
                                              on
      --nocheck-md5-if-exists
      --dataset-version  <arg>                The version of the dataset to
                                              download. 1, 2, or 3. 3 was
                                              released 2017-11-16, 2 was
                                              released 2017-07-20, and 1 was
                                              released 2016-09-28. Default is 3.
      --download-300k                         Download the image from the url in
                                              the Thumbnail300KURL field. This
                                              disables verifying the size and
                                              md5 hash and results in lower
                                              quality images, but may be much
                                              faster and use less bandwidth and
                                              storage space. These are resized
                                              to a max dim of 640, so if you use
                                              --resize-mode=ShrinkToFit and
                                              --resize-box-size=640 you can get
                                              a full consistently sized set of
                                              images. For the few images that
                                              don't have a 300K url the original
                                              is downloaded and needs to be
                                              resized. Default is off
      --nodownload-300k
      --download-images                       Download and extract
                                              images_2017_07.tar.gz and all
                                              images. Default is on
      --nodownload-images
      --download-metadata                     Download and extract the metadata
                                              files (annotations and classes).
                                              Default is on
      --nodownload-metadata
      --http-pipelining-limit  <arg>          The maximum number of parallel
                                              pipelined http requests per
                                              connection. Default is 4
      --log-file  <arg>                       Write a log to <file>. Default is
                                              to not write a log
      --log-to-stdout                         Write the log to stdout. Default
                                              is on
      --nolog-to-stdout
      --max-host-connections  <arg>           The maximum number of parallel
                                              connections to a single host.
                                              Default is 5
      --max-retries  <arg>                    Number of times to retry failed
                                              downloads. Default is 15.
      --max-total-connections  <arg>          The maximum number of parallel
                                              connections to all hosts. Must be
                                              a power of 2 and > 0. Default is
                                              128
      --original-images-subdirectory  <arg>   name of the subdirectory where the
                                              original images are stored.
                                              Default is images-original
      --resize-box-size  <arg>                The number of pixels used by
                                              resizing for the side of the
                                              bounding box. Default is 224
      --resize-compression-quality  <arg>     The compression quality. If
                                              specified, it will be passed with
                                              the -quality option to imagemagick
                                              convert. See
                                              https://www.imagemagick.org/script/command-line-options.php#quality
                                              for the meaning of different
                                              values and defaults for various
                                              output formats. If unspecified,
                                              -quality will not be passed and
                                              imagemagick will use its default
      --resize-images                         Resize images. Default is on
      --noresize-images
      --resize-mode  <arg>                    One of ShrinkToFit, FillCrop, or
                                              FillDistort. ShrinkToFit will
                                              resize images larger than the
                                              specified size of bounding box,
                                              preserving aspect ratio. Smaller
                                              images are unchanged. FillCrop
                                              will fill the bounding box, by
                                              first either shrinking or growing
                                              the image and then doing a
                                              center-crop on the larger
                                              dimension. FillDistort will fill
                                              the bounding box, by either
                                              shrinking or growing the image,
                                              modifying the aspect ratio as
                                              necessary to fit. Default is
                                              ShrinkToFit
      --resize-output-format  <arg>           The format (and extension) to use
                                              for the resized images. Valid
                                              values are those supported by
                                              ImageMagick. See
                                              https://www.imagemagick.org/script/formats.php
                                              and/or run identify -list format.
                                              Default is jpg
      --resized-images-subdirectory  <arg>    name of the subdirectory where the
                                              resized images are stored. Default
                                              is images-resized
      --root-dir  <arg>                       top-level directory for storing
                                              the Open Images dataset. Default
                                              is . (current working directory)
      --save-original-images                  Save full-size original images.
                                              This will use over 18 TB of space.
                                              Default is off
      --nosave-original-images
      --save-tar-balls                        Save the downloaded .tar.gz and
                                              .tar files. This uses more space
                                              but can save time when resuming
                                              from an interrupted execution.
                                              Default is off
      --nosave-tar-balls
      --help                                  Show help message
      --version                               Show version of this program

Building from source

Install sbt from http://www.scala-sbt.org/

Run sbt compile to build.

Run sbt test to run unit tests.

Run sbt universal:packageBin to create the distribution .zip

Run sbt universal:packageZipTarball to create the distribution .tgz. The output is stored in target/universal.

Future enhancements

  • UI
    • Show # of images/sec
    • Show # of bytes/sec read from internet
    • Show # of bytes/sec written to disk
    • Show last 100(?) successful downloads
    • Show last 100(?) failed downloads
    • Show downloads in progress w/percentage complete
    • Show # of resizes in progress
    • Show overall number of downloads w/% of total completed
    • ncurses based progress UI
    • graphical progress UI
      • display images as they are downloaded
  • optionally convert from non-standard cmyk format
  • optionally validate the image is a proper jpeg
    • I made a script to do this before: https://github.com/dnuffer/detect_corrupt_jpeg. It uses magic mime-type detection, and PIL open(). I also tested out using jpeginfo and ImageMagick identify.
    • There is a list of running my detect_corrupt_jpeg script at:
      • partial with jpeginfo and identify: /storage/data/pub/ml_datasets/openimages/images_2016_08/train/detect_corrupt_jpeg.out
      • on oi v1 with magic and PIL: /storage/data/pub/ml_datasets/openimages/images_2016_08/train/detect_corrupt_jpeg2.out
    • Don't want to be too strict. If it's got some weird format, but can still be decoded into an image, that's fine.
    • Don't want to be too lenient. If it's missing part of the image, that should fail.
    • ImageMagick identify -verbose https://www.imagemagick.org/discourse-server/viewtopic.php?t=20045
    • The end-goal is to check it can be decoded by tensorflow into a valid image. That uses libjpeg-turbo under the covers, so I could use the program djpeg from libjpeg-turbo-progs package. I'm not sure how good it is at detecting corruption however. I'll need to experiment with it.
  • option to enable/disable 3-letter image subdirs
  • option to do multiple image processes - resize to multiple sizes, multiple resize modes, convert to other formats.
  • output to tfrecord
  • optionally save metadata to a db (jpa?, shiny?)
  • optionally save annotations to a db (jpa?, shiny?)
  • distribute across machines using akka
  • resume downloading partially downloaded files? Seems a bit pointless, none of the files are that big. Would be interesting to implement anyway! The file images_2017_07.tar.gz is the largest.
  • check the license (scrape the website?)