Skip to content

Referee Gesture Detection

This site documents the ready network, which detects the "standby to ready" gesture (ready or none_ready), and the kick model, which classifies the kick-in gesture (left, right, none). Both networks have a similar data pipeline.

Dataset Structure

The dataset used in this project is organized into several class directories within a common dataset folder:

dataset/
├── left
├── right
├── none_kick
├── ready
└── none_ready

The folders left, right, and none_kick are used for training the kick network, while ready and none_ready are used for the ready network. Each image filename contains three important values:

  1. Frame time - the timestamp of the image in the log.
  2. Distance - the distance between the NAO and the referee (in millimeters).
  3. Log name - a string or numerical identifier

All images are captured by the top camera of a NAO with a resolution of 480×640 pixels (RGB format). The dataset includes original and segmented images from tests and RoboCup games. Since data collection is complex and resources are limited, the dataset was augmented by mirroring samples and adjusting the corresponding labels. Unfortunately, we are unable to share this dataset publicly, as it contains images of individuals who did not consent to a public release.

Data Pipeline

From the original dataset, smaller image patches are extracted to reduce model complexity.

  • Kick Network: dynamic patch extraction; resized to 256×256 pixels, with distances normalized to [0, 1]
  • Ready Network: static extraction of 256×200 pixels, with distances normalized to [0, 1]

Each dataset entry consists of the cropped image, the normalized distance, and the corresponding label:

((Image, Norm Distance), Label)

Labels are stored in one-hot format (e.g., (1, 0) for Ready-Net or (1, 0, 0) for Kick-Net). Data is split 80% for training and 20% for validation per class.

Before training, data augmentation is applied. Pixel values are normalized to [0–1] (instead of [0-255]), images are converted to the YUV color space, and the dataset is shuffled and batched.

Data Augmentation

To improve robustness, random samples are slightly modified through:

  • Brightness variation
  • Small pixel shifts
  • Noise

An example looks like this:

image augmented image

This helps reduce overfitting and improves the model's ability to generalize to real-world variations. The same augmentation strategy is applied to both networks.

Network Architecture

Layer Kick Shape Ready Shape
Input Image 256 × 256 × 3 256 × 200 × 3
Conv2D (5×5, stride 2) + BN + ReLU 128 × 128 × 8 128 × 100 × 8
SeparableConv2D (3×3) + BN + ReLU 128 × 128 × 16 128 × 100 × 16
SeparableConv2D (3×3, stride 2) + ReLU 64 × 64 × 16 64 × 50 × 16
SeparableConv2D (3×3) + BN + ReLU 64 × 64 × 32 64 × 50 × 32
SeparableConv2D (3×3, stride 2) + ReLU 32 × 32 × 32 32 × 25 × 32
SeparableConv2D (3×3) + BN + ReLU 32 × 32 × 32 32 × 25 × 32
SeparableConv2D (3×3) + BN + ReLU 32 × 32 × 32 32 × 25 × 32
SeparableConv2D (3×3, stride 2) + ReLU 16 × 16 × 32 16 × 13 × 32
SeparableConv2D (3×3) + BN + ReLU 16 × 16 × 32 16 × 13 × 32
SeparableConv2D (3×3) + BN + ReLU 16 × 16 × 32 16 × 13 × 32
SeparableConv2D (3×3, stride 2) + ReLU 8 × 8 × 32 8 × 7 × 32
SeparableConv2D (3×3) + BN + ReLU 8 × 8 × 32 8 × 7 × 32
SeparableConv2D (3×3) + BN + ReLU 8 × 8 × 16 8 × 7 × 16
MaxPooling2D 4 × 4 × 16 4 × 3 × 16
Flatten 256 192
Input Distance 1 1
Dense + ReLU 8 8
Concatenate (Image + Distance) 264 200
Dense + ReLU 24 24
Dropout (0.3) 24 24
Dense (Output Layer) 3 2

The Softmax layer is omitted since CompiledNN does not support it natively. Instead, Softmax is applied manually after the model prediction.


Last update: October 12, 2025