Referee Gesture Detection¶
This site documents the ready
network, which detects the "standby to ready" gesture (ready or none_ready), and the kick
model, which classifies the kick-in gesture (left, right, none). Both networks have a similar data pipeline.
Dataset Structure¶
The dataset used in this project is organized into several class directories within a common dataset
folder:
dataset/
├── left
├── right
├── none_kick
├── ready
└── none_ready
The folders left, right, and none_kick are used for training the kick
network, while ready and none_ready are used for the ready
network.
Each image filename contains three important values:
- Frame time - the timestamp of the image in the log.
- Distance - the distance between the NAO and the referee (in millimeters).
- Log name - a string or numerical identifier
All images are captured by the top camera of a NAO with a resolution of 480×640 pixels (RGB format). The dataset includes original and segmented images from tests and RoboCup games. Since data collection is complex and resources are limited, the dataset was augmented by mirroring samples and adjusting the corresponding labels. Unfortunately, we are unable to share this dataset publicly, as it contains images of individuals who did not consent to a public release.
Data Pipeline¶
From the original dataset, smaller image patches are extracted to reduce model complexity.
- Kick Network: dynamic patch extraction; resized to 256×256 pixels, with distances normalized to [0, 1]
- Ready Network: static extraction of 256×200 pixels, with distances normalized to [0, 1]
Each dataset entry consists of the cropped image, the normalized distance, and the corresponding label:
((Image, Norm Distance), Label)
Labels are stored in one-hot format (e.g., (1, 0)
for Ready-Net or (1, 0, 0)
for Kick-Net).
Data is split 80% for training and 20% for validation per class.
Before training, data augmentation is applied. Pixel values are normalized to [0–1] (instead of [0-255]), images are converted to the YUV color space, and the dataset is shuffled and batched.
Data Augmentation¶
To improve robustness, random samples are slightly modified through:
- Brightness variation
- Small pixel shifts
- Noise
An example looks like this:
This helps reduce overfitting and improves the model's ability to generalize to real-world variations. The same augmentation strategy is applied to both networks.
Network Architecture¶
Layer | Kick Shape | Ready Shape |
---|---|---|
Input Image | 256 × 256 × 3 | 256 × 200 × 3 |
Conv2D (5×5, stride 2) + BN + ReLU | 128 × 128 × 8 | 128 × 100 × 8 |
SeparableConv2D (3×3) + BN + ReLU | 128 × 128 × 16 | 128 × 100 × 16 |
SeparableConv2D (3×3, stride 2) + ReLU | 64 × 64 × 16 | 64 × 50 × 16 |
SeparableConv2D (3×3) + BN + ReLU | 64 × 64 × 32 | 64 × 50 × 32 |
SeparableConv2D (3×3, stride 2) + ReLU | 32 × 32 × 32 | 32 × 25 × 32 |
SeparableConv2D (3×3) + BN + ReLU | 32 × 32 × 32 | 32 × 25 × 32 |
SeparableConv2D (3×3) + BN + ReLU | 32 × 32 × 32 | 32 × 25 × 32 |
SeparableConv2D (3×3, stride 2) + ReLU | 16 × 16 × 32 | 16 × 13 × 32 |
SeparableConv2D (3×3) + BN + ReLU | 16 × 16 × 32 | 16 × 13 × 32 |
SeparableConv2D (3×3) + BN + ReLU | 16 × 16 × 32 | 16 × 13 × 32 |
SeparableConv2D (3×3, stride 2) + ReLU | 8 × 8 × 32 | 8 × 7 × 32 |
SeparableConv2D (3×3) + BN + ReLU | 8 × 8 × 32 | 8 × 7 × 32 |
SeparableConv2D (3×3) + BN + ReLU | 8 × 8 × 16 | 8 × 7 × 16 |
MaxPooling2D | 4 × 4 × 16 | 4 × 3 × 16 |
Flatten | 256 | 192 |
Input Distance | 1 | 1 |
Dense + ReLU | 8 | 8 |
Concatenate (Image + Distance) | 264 | 200 |
Dense + ReLU | 24 | 24 |
Dropout (0.3) | 24 | 24 |
Dense (Output Layer) | 3 | 2 |
The Softmax layer is omitted since CompiledNN
does not support it natively.
Instead, Softmax is applied manually after the model prediction.