UAV Imagery Labeling: Solutions and Challenges

Abstract

Overview

Unmanned Aerial Vehicles (UAVs) have become essential tools across surveillance, healthcare, security, and commercial sectors. However, the unique characteristics of UAV imagery — such as bird's-eye perspectives, varying resolutions, and small object sizes — pose significant challenges for accurate labeling and object detection.

While manual annotation is precise, it is impractical for large-scale applications. Existing automated labeling tools and detection models, primarily trained on ground-level imagery, struggle to generalize to aerial contexts, particularly for small objects in cluttered backgrounds.

🦅 Eagle-eye perspective

🔍 Small object detection

🌫️ Cluttered backgrounds

📏 Variable resolution & altitude

🏷️ Costly manual annotation

🤖 Domain generalization gap

Section 1.1

UAV Imagery: Overview

Unmanned Aerial Vehicles (UAVs) have emerged as a critical technology across different domains, including surveillance, healthcare, security, and commercial applications [12]. The imagery captured by UAVs serves as an essential data source; however, UAV imagery is known for its unique eagle-view characteristics.

Resolution, altitude, and field of view introduce challenging variability. Among these challenges, one of the major concerns lies in object detection and accurate image labeling. Many current studies focus on detecting pre-labeled datasets and using pre-trained models; however, these approaches struggle to generalize effectively.

Issues such as low resolution, small-scale targets, and cluttered or complex backgrounds degrade detection accuracy and often require costly model retraining. Because UAV imagery typically provides an eagle-eye view, models trained on non-UAV datasets face significant challenges detecting small objects.

Key Challenges

Hard Low resolution & small targets
Hard Cluttered & complex backgrounds
Hard Domain shift (ground → aerial)
Medium Variable altitude & field of view
Medium Costly model retraining

Section 1.2 — Part A

Major UAV Datasets

In supervised machine learning, labeling is essential to train any model. Several benchmark datasets have been captured by UAV platforms, each with distinct characteristics and purposes.

UAVDT Object Detection & Tracking

Benchmark for UAV object detection and tracking.

4Classes

Very SmallObject Size

Ref. [1]

AI-TOD Tiny Object Detection

Specialized for detecting tiny objects in aerial imagery.

8Categories

TinyObject Size

Ref. [2]

VisDrone Most Popular

Known for small and tiny objects; a leading benchmark in drone vision research.

10Classes

Small/TinyObject Size

Ref. [3]

DIOR Remote Sensing

Large-scale remote sensing dataset with richly diverse categories.

190Categories

RemoteSensing

Ref. [4]

EAGLE Vehicle Detection

Large-scale dataset for vehicle detection in real-world aerial scenarios.

2Categories

Large-ScaleDataset

Ref. [5]

Section 1.2 — Part B

Labeling & Detection Models

Labeling models can be categorized into two primary architectures: Transformers and Convolutional Neural Networks (CNNs). Most are pre-trained on non-UAV (ground-level) datasets, creating a domain gap for aerial tasks.

⚡ Transformer-based

SAM2 Segmentation

Mask-based automatic segmentation. Trained on the SA-V dataset using a transformer with point and mask prompts. Pre-trained on non-UAV images.

Meta AI, 2023 [8]

Grounding DINO Detection

Text-prompt-based bounding box generation. Pre-trained on non-UAV datasets.

Liu et al., 2023 [7]

Grounded SAM Multi-task

Combines open-world models for diverse visual tasks (detection + segmentation).

Ren et al., 2024 [6]

🔲 CNN-based

YOLO11 Multi-task

Detection, Classification, Segmentation, Pose & Oriented detection. Pre-trained on non-UAV data.

Jocher & Qiu, 2024 [9]

DETR End-to-End

End-to-end object detection with Transformers (CNN backbone). Segmentation & detection.

Carion et al., 2020 [10]

Detectron2 Framework

Bounding boxes, Segmentation, Dense Pose, Keypoint & Semantic Segmentation. Multi-dataset pre-training.

Wu et al., 2019 [11]

⚠️

Domain Gap Issue: All reviewed models are pre-trained on non-UAV (ground-level) datasets. When applied to aerial imagery, they face a significant domain shift that degrades detection accuracy, especially for small or densely packed objects.

Analysis

Challenges & Potential Solutions

Challenge

Manual annotation is precise but impractical at scale — time-consuming and expensive.

→

Direction

Semi-automatic pipelines combining human-in-the-loop review with automated pre-labeling.

Challenge

Models trained on ground-level images fail to generalize to eagle-eye perspectives.

→

Direction

Domain adaptation & fine-tuning on aerial-specific datasets like VisDrone or UAVDT.

Challenge

Small object detection suffers at altitude due to limited pixel representation.

→

Direction

Super-resolution preprocessing and feature pyramid networks (FPN) for multi-scale detection.

Challenge

Cluttered and complex backgrounds cause high false-positive rates.

→

Direction

Context-aware detection architectures and background-aware attention mechanisms.

Section 09

Future Direction

Several directions are planned to improve auto-labeling performance on UAV imagery.

🔄

Incremental Dataset Training

Rather than training on the full dataset at once, the strategy is to train the model in sequential 10% portions. Each portion produces a checkpoint (.pth) that is fed into the next training stage, building a progressively refined model.

10%
Subset

→

Train
Model

→

Checkpoint
(.pth)

→

Updated
Model

↺

Repeat across all dataset portions until the full dataset has been used.

🎯

UAV-Specific Fine-Tuning

Fine-tune models on UAV-like datasets (e.g., VisDrone, UAVDT) to close the domain gap between ground-level pre-training and aerial imagery.

⚖️

Address Class Imbalance

Balancing strategies such as oversampling minority classes or weighted loss functions can improve per-class detection performance.

Limitations

⚠️

Pre-training Domain Mismatch

All evaluated models are pre-trained on COCO or non-UAV datasets. This causes a significant domain shift when applied to overhead UAV imagery.

🏷️

Manual Ground-Truth Annotation

Manual annotation introduces human error and limits scalability for larger datasets.

Conclusion

Summary & Outlook

UAV imagery presents unique challenges that make accurate object detection and labeling difficult. The eagle-eye view, cluttered and complex backgrounds, and high altitude create obstacles that traditional computer vision approaches struggle to overcome.

Manual labeling, while accurate, is extremely time-consuming and expensive. Meanwhile, existing automated models — mostly trained on ground-level images — struggle to generalize on UAV imagery, especially when it comes to spotting small objects.

✓ UAV datasets are distinct and require domain-specific training approaches

✓ Transformer-based models show promise but require aerial fine-tuning

✓ Semi-automatic labeling pipelines are key to scalable annotation

✓ Small object detection remains an open research challenge

Bibliography

References

📁 Cited in Proposal

D. Du, Y. Qi, H. Yu, et al., "The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking," arXiv:1804.00518, 2018.
J. Wang, W. Yang, H. Guo, R. Zhang, G.-S. Xia, "Tiny Object Detection in Aerial Images," ICPR 2021, pp. 3791–3798. doi: 10.1109/ICPR48806.2021.9413340.
P. Zhu, L. Wen, X. Bian, H. Ling, Q. Hu, "Vision Meets Drones: A Challenge," arXiv:1804.07437, 2018.
K. Li, G. Wan, G. Cheng, "DIOR," IEEE Dataport, 2025. doi: 10.21227/tq1e-nx82.
S. M. Azimi, R. Bahmanyar, C. Henry, F. Kurz, "EAGLE: Large-scale Vehicle Detection Dataset in Real-World Scenarios using Aerial Imagery," arXiv:2007.06124, 2020.
T. Ren et al., "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks," arXiv:2401.14159, 2024.
S. Liu et al., "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection," arXiv:2303.05499, 2023.
A. Kirillov et al., "Segment Anything," ICCV 2023, pp. 3992–4003.
G. Jocher, J. Qiu, "Ultralytics YOLO11," v11.0.0, 2024.
N. Carion, F. Massa, G. Synnaeve, et al., "End-to-End Object Detection with Transformers," arXiv:2005.12872, 2020.
Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, "Detectron2," 2019.
P. Zhu et al., "Detection and Tracking Meet Drones Challenge," arXiv:2001.06303, 2021.

📚 Additional Related References

Z. Cao, H. Wang, W. Li, Z. Zheng, H. Liu, "D2-Det: Towards High Quality Object Detection and Instance Segmentation for UAV Imagery," Remote Sensing, vol. 13, no. 4, 2021. doi: 10.3390/rs13040666.
S. Hsieh, L. Lin, "Drone-Based Object Counting by Spatially Regularized Regional Proposal Network," ICCV 2017, pp. 4145–4153.
K. Chen, J. Wang, J. Pang, et al., "MMDetection: Open MMLab Detection Toolbox and Benchmark," arXiv:1906.07155, 2019.
X. Yang, J. Yang, J. Yan, et al., "SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects," ICCV 2019, pp. 8232–8241.
N. Carion, F. Massa, G. Synnaeve, N. Usunier, "DETR for Pedestrian Detection," ECCV Workshop, 2020.
T.-Y. Lin, P. Dollár, R. Girshick, K. He, et al., "Feature Pyramid Networks for Object Detection," CVPR 2017, pp. 2117–2125.
W. Liu, D. Anguelov, D. Erhan, et al., "SSD: Single Shot MultiBox Detector," ECCV 2016, pp. 21–37.
M. Everingham, L. Van Gool, C. K. I. Williams, et al., "The PASCAL Visual Object Classes (VOC) Challenge," IJCV, vol. 88, pp. 303–338, 2010.
S. Razakarivony, F. Jurie, "Vehicle Detection in Aerial Imagery: A Small Target Detection Benchmark," Journal of Visual Communication and Image Representation, vol. 34, 2016.
Y. Bai, Y. Zhang, M. Ding, B. Ghanem, "SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network," ECCV 2018.
Z. Zheng, M. Zhong, J. Wang, Y. Gu, X. Zhang, "HyperLi-Net: A Hyper-Light Deep Learning Network for High-Accurate and High-Speed Ship Detection from SAR Images," ISPRS Journal, vol. 167, pp. 123–153, 2020.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," ICLR 2021. arXiv:2010.11929.
J. Redmon, A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv:1804.02767, 2018.
H. Caesar, V. Bankiti, A. H. Lang, et al., "nuScenes: A Multimodal Dataset for Autonomous Driving," CVPR 2020, pp. 11621–11631.
T. Lin, M. Maire, S. Belongie, et al., "Microsoft COCO: Common Objects in Context," ECCV 2014, pp. 740–755.
P. Zhu, L. Wen, D. Du, et al., "VisDrone-DET2019: The Vision Meets Drones Object Detection Challenge Results," ICCV Workshop, 2019.