Overview
Unmanned Aerial Vehicles (UAVs) have become essential tools across surveillance, healthcare, security, and commercial sectors. However, the unique characteristics of UAV imagery — such as bird's-eye perspectives, varying resolutions, and small object sizes — pose significant challenges for accurate labeling and object detection.
While manual annotation is precise, it is impractical for large-scale applications. Existing automated labeling tools and detection models, primarily trained on ground-level imagery, struggle to generalize to aerial contexts, particularly for small objects in cluttered backgrounds.
UAV Imagery: Overview
Unmanned Aerial Vehicles (UAVs) have emerged as a critical technology across different domains, including surveillance, healthcare, security, and commercial applications [12]. The imagery captured by UAVs serves as an essential data source; however, UAV imagery is known for its unique eagle-view characteristics.
Resolution, altitude, and field of view introduce challenging variability. Among these challenges, one of the major concerns lies in object detection and accurate image labeling. Many current studies focus on detecting pre-labeled datasets and using pre-trained models; however, these approaches struggle to generalize effectively.
Issues such as low resolution, small-scale targets, and cluttered or complex backgrounds degrade detection accuracy and often require costly model retraining. Because UAV imagery typically provides an eagle-eye view, models trained on non-UAV datasets face significant challenges detecting small objects.
Key Challenges
- Hard Low resolution & small targets
- Hard Cluttered & complex backgrounds
- Hard Domain shift (ground → aerial)
- Medium Variable altitude & field of view
- Medium Costly model retraining
Major UAV Datasets
In supervised machine learning, labeling is essential to train any model. Several benchmark datasets have been captured by UAV platforms, each with distinct characteristics and purposes.
Benchmark for UAV object detection and tracking.
Specialized for detecting tiny objects in aerial imagery.
Known for small and tiny objects; a leading benchmark in drone vision research.
Large-scale remote sensing dataset with richly diverse categories.
Large-scale dataset for vehicle detection in real-world aerial scenarios.
Labeling & Detection Models
Labeling models can be categorized into two primary architectures: Transformers and Convolutional Neural Networks (CNNs). Most are pre-trained on non-UAV (ground-level) datasets, creating a domain gap for aerial tasks.
Mask-based automatic segmentation. Trained on the SA-V dataset using a transformer with point and mask prompts. Pre-trained on non-UAV images.
Text-prompt-based bounding box generation. Pre-trained on non-UAV datasets.
Combines open-world models for diverse visual tasks (detection + segmentation).
Detection, Classification, Segmentation, Pose & Oriented detection. Pre-trained on non-UAV data.
End-to-end object detection with Transformers (CNN backbone). Segmentation & detection.
Bounding boxes, Segmentation, Dense Pose, Keypoint & Semantic Segmentation. Multi-dataset pre-training.
Challenges & Potential Solutions
Manual annotation is precise but impractical at scale — time-consuming and expensive.
Semi-automatic pipelines combining human-in-the-loop review with automated pre-labeling.
Models trained on ground-level images fail to generalize to eagle-eye perspectives.
Domain adaptation & fine-tuning on aerial-specific datasets like VisDrone or UAVDT.
Small object detection suffers at altitude due to limited pixel representation.
Super-resolution preprocessing and feature pyramid networks (FPN) for multi-scale detection.
Cluttered and complex backgrounds cause high false-positive rates.
Context-aware detection architectures and background-aware attention mechanisms.
Future Direction
Several directions are planned to improve auto-labeling performance on UAV imagery.
Incremental Dataset Training
Rather than training on the full dataset at once, the strategy is to train the model
in sequential 10% portions. Each portion produces a checkpoint (.pth)
that is fed into the next training stage, building a progressively refined model.
Subset
Model
(.pth)
Model
Repeat across all dataset portions until the full dataset has been used.
UAV-Specific Fine-Tuning
Fine-tune models on UAV-like datasets (e.g., VisDrone, UAVDT) to close the domain gap between ground-level pre-training and aerial imagery.
Address Class Imbalance
Balancing strategies such as oversampling minority classes or weighted loss functions can improve per-class detection performance.
All evaluated models are pre-trained on COCO or non-UAV datasets. This causes a significant domain shift when applied to overhead UAV imagery.
Manual annotation introduces human error and limits scalability for larger datasets.
Summary & Outlook
UAV imagery presents unique challenges that make accurate object detection and labeling difficult. The eagle-eye view, cluttered and complex backgrounds, and high altitude create obstacles that traditional computer vision approaches struggle to overcome.
Manual labeling, while accurate, is extremely time-consuming and expensive. Meanwhile, existing automated models — mostly trained on ground-level images — struggle to generalize on UAV imagery, especially when it comes to spotting small objects.
References
📁 Cited in Proposal
- D. Du, Y. Qi, H. Yu, et al., "The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking," arXiv:1804.00518, 2018.
- J. Wang, W. Yang, H. Guo, R. Zhang, G.-S. Xia, "Tiny Object Detection in Aerial Images," ICPR 2021, pp. 3791–3798. doi: 10.1109/ICPR48806.2021.9413340.
- P. Zhu, L. Wen, X. Bian, H. Ling, Q. Hu, "Vision Meets Drones: A Challenge," arXiv:1804.07437, 2018.
- K. Li, G. Wan, G. Cheng, "DIOR," IEEE Dataport, 2025. doi: 10.21227/tq1e-nx82.
- S. M. Azimi, R. Bahmanyar, C. Henry, F. Kurz, "EAGLE: Large-scale Vehicle Detection Dataset in Real-World Scenarios using Aerial Imagery," arXiv:2007.06124, 2020.
- T. Ren et al., "Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks," arXiv:2401.14159, 2024.
- S. Liu et al., "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection," arXiv:2303.05499, 2023.
- A. Kirillov et al., "Segment Anything," ICCV 2023, pp. 3992–4003.
- G. Jocher, J. Qiu, "Ultralytics YOLO11," v11.0.0, 2024.
- N. Carion, F. Massa, G. Synnaeve, et al., "End-to-End Object Detection with Transformers," arXiv:2005.12872, 2020.
- Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, "Detectron2," 2019.
- P. Zhu et al., "Detection and Tracking Meet Drones Challenge," arXiv:2001.06303, 2021.
📚 Additional Related References
- Z. Cao, H. Wang, W. Li, Z. Zheng, H. Liu, "D2-Det: Towards High Quality Object Detection and Instance Segmentation for UAV Imagery," Remote Sensing, vol. 13, no. 4, 2021. doi: 10.3390/rs13040666.
- S. Hsieh, L. Lin, "Drone-Based Object Counting by Spatially Regularized Regional Proposal Network," ICCV 2017, pp. 4145–4153.
- K. Chen, J. Wang, J. Pang, et al., "MMDetection: Open MMLab Detection Toolbox and Benchmark," arXiv:1906.07155, 2019.
- X. Yang, J. Yang, J. Yan, et al., "SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects," ICCV 2019, pp. 8232–8241.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, "DETR for Pedestrian Detection," ECCV Workshop, 2020.
- T.-Y. Lin, P. Dollár, R. Girshick, K. He, et al., "Feature Pyramid Networks for Object Detection," CVPR 2017, pp. 2117–2125.
- W. Liu, D. Anguelov, D. Erhan, et al., "SSD: Single Shot MultiBox Detector," ECCV 2016, pp. 21–37.
- M. Everingham, L. Van Gool, C. K. I. Williams, et al., "The PASCAL Visual Object Classes (VOC) Challenge," IJCV, vol. 88, pp. 303–338, 2010.
- S. Razakarivony, F. Jurie, "Vehicle Detection in Aerial Imagery: A Small Target Detection Benchmark," Journal of Visual Communication and Image Representation, vol. 34, 2016.
- Y. Bai, Y. Zhang, M. Ding, B. Ghanem, "SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network," ECCV 2018.
- Z. Zheng, M. Zhong, J. Wang, Y. Gu, X. Zhang, "HyperLi-Net: A Hyper-Light Deep Learning Network for High-Accurate and High-Speed Ship Detection from SAR Images," ISPRS Journal, vol. 167, pp. 123–153, 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," ICLR 2021. arXiv:2010.11929.
- J. Redmon, A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv:1804.02767, 2018.
- H. Caesar, V. Bankiti, A. H. Lang, et al., "nuScenes: A Multimodal Dataset for Autonomous Driving," CVPR 2020, pp. 11621–11631.
- T. Lin, M. Maire, S. Belongie, et al., "Microsoft COCO: Common Objects in Context," ECCV 2014, pp. 740–755.
- P. Zhu, L. Wen, D. Du, et al., "VisDrone-DET2019: The Vision Meets Drones Object Detection Challenge Results," ICCV Workshop, 2019.