Referee Gesture Detection¶

This site documents the ready network, which detects the "standby to ready" gesture (ready or none_ready), and the kick model, which classifies the kick-in gesture (left, right, none). Both networks have a similar data pipeline.

Dataset Structure¶

The dataset used in this project is organized into several class directories within a common dataset folder:

dataset/
├── left
├── right
├── none_kick
├── ready
└── none_ready

The folders left, right, and none_kick are used for training the kick network, while ready and none_ready are used for the ready network. Each image filename contains three important values:

Frame time - the timestamp of the image in the log.
Distance - the distance between the NAO and the referee (in millimeters).
Log name - a string or numerical identifier

All images are captured by the top camera of a NAO with a resolution of 480×640 pixels (RGB format). The dataset includes original and segmented images from tests and RoboCup games. Since data collection is complex and resources are limited, the dataset was augmented by mirroring samples and adjusting the corresponding labels. Unfortunately, we are unable to share this dataset publicly, as it contains images of individuals who did not consent to a public release.

Data Pipeline¶

From the original dataset, smaller image patches are extracted to reduce model complexity.

Kick Network: dynamic patch extraction; resized to 256×256 pixels, with distances normalized to [0, 1]
Ready Network: static extraction of 256×200 pixels, with distances normalized to [0, 1]

Each dataset entry consists of the cropped image, the normalized distance, and the corresponding label:

((Image, Norm Distance), Label)

Labels are stored in one-hot format (e.g., (1, 0) for Ready-Net or (1, 0, 0) for Kick-Net). Data is split 80% for training and 20% for validation per class.

Before training, data augmentation is applied. Pixel values are normalized to [0–1] (instead of [0-255]), images are converted to the YUV color space, and the dataset is shuffled and batched.

Data Augmentation¶

To improve robustness, random samples are slightly modified through:

Brightness variation
Small pixel shifts
Noise

An example looks like this:

augmented image

This helps reduce overfitting and improves the model's ability to generalize to real-world variations. The same augmentation strategy is applied to both networks.

Network Architecture¶

Layer	Kick Shape	Ready Shape
Input Image	256 × 256 × 3	256 × 200 × 3
Conv2D (5×5, stride 2) + BN + ReLU	128 × 128 × 8	128 × 100 × 8
SeparableConv2D (3×3) + BN + ReLU	128 × 128 × 16	128 × 100 × 16
SeparableConv2D (3×3, stride 2) + ReLU	64 × 64 × 16	64 × 50 × 16
SeparableConv2D (3×3) + BN + ReLU	64 × 64 × 32	64 × 50 × 32
SeparableConv2D (3×3, stride 2) + ReLU	32 × 32 × 32	32 × 25 × 32
SeparableConv2D (3×3) + BN + ReLU	32 × 32 × 32	32 × 25 × 32
SeparableConv2D (3×3) + BN + ReLU	32 × 32 × 32	32 × 25 × 32
SeparableConv2D (3×3, stride 2) + ReLU	16 × 16 × 32	16 × 13 × 32
SeparableConv2D (3×3) + BN + ReLU	16 × 16 × 32	16 × 13 × 32
SeparableConv2D (3×3) + BN + ReLU	16 × 16 × 32	16 × 13 × 32
SeparableConv2D (3×3, stride 2) + ReLU	8 × 8 × 32	8 × 7 × 32
SeparableConv2D (3×3) + BN + ReLU	8 × 8 × 32	8 × 7 × 32
SeparableConv2D (3×3) + BN + ReLU	8 × 8 × 16	8 × 7 × 16
MaxPooling2D	4 × 4 × 16	4 × 3 × 16
Flatten	256	192
Input Distance	1	1
Dense + ReLU	8	8
Concatenate (Image + Distance)	264	200
Dense + ReLU	24	24
Dropout (0.3)	24	24
Dense (Output Layer)	3	2

The Softmax layer is omitted since CompiledNN does not support it natively. Instead, Softmax is applied manually after the model prediction.

Last update: October 13, 2025