Retail Object Detection

NGC | Catalog

For downloads and more information, please view on a desktop device.

Description

EfficientDet based object detection network to detect retail objects on a checkout counter.

Publisher

Latest Version

deployable_100_onnx_v1.0

Modified

March 13, 2024

Size

129.82 MB

Retail Object Detection

Model Overview

The models described in this card detect retail items within an image and return a bounding box around each detected item. The retail items are generally packaged commercial goods with barcodes and ingredients labels on them.

Two detection models are provided here. The 100-class Retail Item Detection detects 100 specific retail subjects. The Binary-class Retail Item Detection detects general retail items and returns a single category.

Model Architecture

These models are based on EfficientDet-D5. EfficientDet is a one-stage detector with the following architecture components:

NvImageNet pretrained EfficientNet-B5 backbone
Weighted bi-directional feature pyramid network (BiFPN)
Bounding and classification box head
A compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time

Training

These models are trained using the EfficientDet-TF2 entrypoint in TAO. The training is carried out in two phases. In the first phase, the networks are trained on 1.5 million synthetic data. In the second phase, the networks are fine-tuned by 600 real samples.

Training Data

These models are trained on 1.5 million proprietary synthetic images. The synthetic data randomizes several simulation domains including:

light types, light intensities
object sizes, orientations, and locations
camera locations
background textures flying distractors

The background textures are real images sampled from:

Proprietary real images
images taken from a retail checkout counter
HDRI texture maps create by NVIDIA Omniverse

Each synthetic image contains 1 target retail item. This dataset is set up to simulate the diverse environments in the real world and to have the detector learn to extract retail items from noisy backgrounds. The logos on retail items were smudged.

dataset	total #images	train #images	val #images
Synthetic data	1,500,000	1,425,000	75,000
Real data - checkout counter 45 overhead	107	85	23
Real data - shelf	107	85	22
Real data - conveyor belt	106	84	22
Real data - basket	106	84	22
Real data - checkout counter barcode scanner view	125	100	25
Real data - checkout counter overhead	98	80	18

Fine-tuning Data

This model is fine-tuned on about 600 real proprietary images from 6 different real environments. In each environment, only 1 image per item is collected.

The fine tuning data are captured under random camera heights and field of views. All fine tuning data were collected indoor, having retail items placed at the checkout counter, shelf, baskets, and conveyor belt. The camera is typically set up at approximately 10 feet height, 45-degree angle off the vertical axis and has close field-of-view. This content was chosen to decrease the simulation-to-reality gap of the model trained on synthetic data, and to improve the accuracy and the robustness of the model. The logos on retail items were smudged.

Fine-tuning Data Ground-truth Labeling Guidelines

The fine tuning data are created by labeling ground-truth bounding-boxes and categories by human-labelers. The following guidelines were used while labeling the training data for NVIDIA Retail Detection models. If you are looking to transfer-learn or to fine-tune the models to adapt to your target environment and classes, please follow the guidelines below for better model accuracy.

All objects that fall under the definition of retail items and are larger than the smallest bounding-box limit for the corresponding class (height >= 10px OR width >= 10px) are labeled with the appropriate class label.
Occlusion: For partially occluded objects that are visible approximately 60% or are marked as visible objects with a bounding box around the visible part of the object. These objects are marked as partially occluded. Objects under 60% visibility are not annotated.
Truncation: An object, at the edge of the frame, which is 60% or more visible is marked with the truncation flag.
Each frame is not required to have an object.

Performance

Evaluation Data

The evaluation of the Retail Detection model was measured against more than 5000 proprietary real images across a variety of environments. The frames are cropped from 1080p resolution videos and were resized to 416x416 pixels before passing to the Retail Detection model.

Methodology and KPI

AP50 is calculated using intersection-over-union (IOU) criterion greater than 0.5. The KPI for the evaluation data are reported in the table below. Model is evaluated based on AP50 and AR0.5:0.95. Both AR and AP numbers are based on 100 maximum detections each image. Please note that “unseen items” measurements are irrelevant to the 100-class detection model.

Binary-class Retail Item Detection Model

scene	seen items result (AP50)	seen items result (AR MaxDets=100)	unseen items result (AP50)	unseen items result (AR MaxDets=100)
checkout counter 45 degree overhead	0.960	0.791	0.959	0.753
shelf	0.983	0.888	0.978	0.841
conveyor belt	1.000	0.921	0.995	0.887
basket	0.956	0.851	0.959	0.861
checkout counter barcode scanner view	0.858	0.789	0.744	0.655
checkout counter overhead	0.990	0.915	0.993	0.910
overall (mean of all scenes)	0.959	0.859	0.938	0.818

100-class Retail Item Detection Model

scene	seen items result (AP50)	seen items result (AR MaxDets=100)
checkout counter 45 overhead	0.564	0.741
shelf	0.933	0.860
conveyor belt	0.872	0.888
basket	0.536	0.722
checkout counter barcode scanner view	0.845	0.758
checkout counter overhead	0.926	0.859
overall (mean of all scenes)	0.779	0.805

Real-time Inference Performance

The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 416x416. The inference performance is run using trtexec on Jetson AGX Orin 64GB and A10. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.

model	device	batch size	latency (ms)	images per second
Retail Item Detection (binary)	Jetson AGX Orin 64GB	1	10.43	96
Retail Item Detection (binary)	Jetson AGX Orin 64GB	16	131.79	121
Retail Item Detection (binary)	Jetson AGX Orin 64GB	32	258.44	124
Retail Item Detection (binary)	Tesla A10	1	4.27	234
Retail Item Detection (binary)	Tesla A10	16	44.94	356
Retail Item Detection (binary)	Tesla A10	64	174.46	367
Retail Item Detection (100 class)	Jetson AGX Orin 64GB	1	10.94	91
Retail Item Detection (100 class)	Jetson AGX Orin 64GB	16	140.94	114
Retail Item Detection (100 class)	Jetson AGX Orin 64GB	32	279.59	114
Retail Item Detection (100 class)	Tesla A10	1	4.46	224
Retail Item Detection (100 class)	Tesla A10	16	47.81	335
Retail Item Detection (100 class)	Tesla A10	64	187.54	338

How to use this model

Instructions to use unpruned model with TAO

In order to use these models as pretrained weights for transfer learning, please use the snippet below as template for the model component of the experiment spec file to train a Efficientdet-TF2 model. For more information on the experiment spec file, please refer to the [RetailDetector notebook and the EfficientDdet-TFtf2 TAO doc].

% spec file
model:
    name: 'efficientdet-d5'
data:
    loader:
      prefetch_size: 4
      shuffle_file: True
    num_classes: 101 # switch to 2 for RetailDetector_binary model
    image_size: '416x416'
    max_instances_per_image: 10
    train_tfrecords:
       - [train tfrecords]
    val_tfrecords:
       - [validation tfrecords]
    val_json_file: [validation annotation json file path]
train: 
    num_examples_per_epoch: 10000 # change to train set size
    ...

evaluate:
    num_samples: 500 # change to test set size
    label_map: # label map file here
    ...

Instructions to deploy these models with DeepStream

Input

RGB Image of dimensions: 416 X 416 X 3 (W x H x C) Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (416), W = Width of the images (416) Input scale: 1.0 Mean subtraction: 0

Output

Category labels and bounding-box coordinates for each detected retail item in the input image.

Here is an example of using the Retail Item Embedder together with the Retail Item Detector [TODO: add Retail Item Embedder url here] for an end-to-end video analytic application. To do so, deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of these models into the deepstream sample app.

To deploy these models with DeepStream 6.2, please follow the instructions below:

Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:

/opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.

The sample config files are provided in NVIDIA-AI-IOT(TODO: Update the URL when deepstream_tao_apps are merged with???). Assume the repo is cloned under $DS_TAO_APPS_HOME, in $DS_TAO_APPS_HOME/configs/retailDetector_tao,

# 100-class detector (the primary GIE) inference setting 
pgie_retailDetector_100_config.yml
pgie_retailDetector_100_config.txt
# Binary-class detector (the primary GIE) inference setting 
pgie_retailDetector_binary_config.yml
pgie_retailDetector_binary_config.txt

Key Parameters in pgie_retailDetector_100_tao_config.yml

property:
  gpu-id: 0
  net-scale-factor: 1
  offsets: 0;0;0
  model-color-format: 0
  tlt-model-key: nvidia_tlt
  tlt-encoded-model: ../../models/retailDetector/retailDetector_100.etlt
  model-engine-file: ../../models/retailDetector/retailDetector_100.etlt_b1_gpu0_fp16.engine
  labelfile-path:    ../../models/retailDetector/retailDetector_100_labels.txt
  network-input-order: 1
  infer-dims: 3;416;416
  maintain-aspect-ratio: 1
  batch-size: 1
  ## 0=FP32, 1=INT8, 2=FP16 mode
  network-mode: 2
  num-detected-classes: 100
  interval: 0
  cluster-mode: 3
  output-blob-names: num_detections;detection_boxes;detection_scores;detection_classes
  parse-bbox-func-name: NvDsInferParseCustomEfficientDetTAO
  custom-lib-path: ../../post_processor/libnvds_infercustomparser_tao.so
#Use the config params below for NMS clustering mode
class-attrs-all:
  pre-cluster-threshold: 0.5

In order to decode the bounding box information from the EfficientDet output tensor, the custom parser function and library have to be specified. To inference the model, please run:

cd $DS_TAO_APPS_HOME/configs/retailDetector_tao
$DS_TAO_APPS_HOME/apps/tao_detection/ds-tao-detection -c retailDetector_100_config.txt -i file://$DS_TAO_APPS_HOME/samples/streams/retailDetector_h264.mp4

The "Deploying to DeepStream" chapter of TAO User Guide provides more details.

Input Image

The logos on retail items were smudged.

Output image

The logos on retail items were smudged.

Limitations

Very Small and Crowded Objects

NVIDIA Retail Item Detection model was trained to detect objects larger than 10x10 pixels. Therefore it may not be able to detect objects that are smaller than 10x10 pixels. However, we suggest having the target object take >10% of the frame so the model is less likely to be distracted from the noise in the backgrounds.

The Retail Item Detection model was trained on images having one target item per frame, mimicking the retail checkout scene in the real world, so we suggest not challenging the Retail Item Detection model with multiple target objects in one frame.

Occluded Objects

When objects are occluded or truncated such that less than 40% of the object is visible, they may not be detected by the Retail Item Detection model. Partial occlusion by hand is acceptable as the model was trained with hand occlusion examples.

Monochrome or Infrared Camera Images

The Retail Item Detection models were trained on RGB images. Therefore, images captured by monochrome or IR cameras may not provide good detection results.

Warped and Blurry Images

The Retail Item Detection models were not trained on fish-eye lens cameras or moving cameras. Therefore, the models may not perform well for warped images and images that have, e.g. motion-induced, blurs.

Noisy Backgrounds

Although the Retail Item Detection models were trained in diverse environments, without the fine tuning on the target environment, we find the models perform well in images with a clean background, such as a checkout plate or a conveyor belt. Thus it’s recommended to reduce the complex texture in the background as much as possible.

Model versions

trainable_100_v1.0
deployable_100_v1.0
trainable_binary_v1.0
deployable_binary_v1.0

References

Citations

Tobin, Josh, et al. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2017.
Morrical, Nathan, et al. "NViSII: A scriptable tool for photorealistic image generation." arXiv preprint arXiv:2105.13962 (2021).

Using TAO Pre-trained Models

Technical blogs

Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
Train like a ‘pro’ without being an AI expert using TAO AutoML
Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
Customize Action Recognition with TAO and deploy with DeepStream
Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO

License

License to use these models is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses.

Ethical Considerations

NVIDIA Retail Item Detection model detects retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies. NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.

Retail Object Detection

Retail Object Detection

Model Overview

Model Architecture

Training

Training Data

Fine-tuning Data

Fine-tuning Data Ground-truth Labeling Guidelines

Performance

Evaluation Data

Methodology and KPI

Binary-class Retail Item Detection Model

100-class Retail Item Detection Model

Real-time Inference Performance

How to use this model

Instructions to use unpruned model with TAO

Instructions to deploy these models with DeepStream

Input

Output

Input Image

Output image

Limitations

Very Small and Crowded Objects

Occluded Objects

Monochrome or Infrared Camera Images

Warped and Blurry Images

Noisy Backgrounds

Model versions

References

Citations

Using TAO Pre-trained Models

Technical blogs

Suggested reading

License

Ethical Considerations