Overview
Modern combine harvesters collect real-time, geo-located yield data during harvesting. However, these yield maps often contain measurement errors due to sensor noise, positional inaccuracies, and data transmission delays. Cleaning this data is crucial for training accurate machine learning models for sub-field level yield prediction.
Challenges
Raw yield maps often contain systematic errors, such as:
- GPS inaccuracies leading to positional shifts.
- Delayed yield registration, causing mismatched spatial locations.
- Speed fluctuations and cutting width inconsistencies, affecting reported yield values.
These challenges introduce noise in machine learning models, reducing their ability to make reliable predictions at the sub-field scale.
Multi-Stage Data Cleaning Pipeline
We developed a comprehensive pipeline for outlier detection and yield data cleaning, which includes:
Data Harmonization
Standardized unit conversions, header translations, and projection transformations across datasets.
Regional Thresholding
Applied agronomic domain knowledge to filter out physically unrealistic yield values.
Statistical Outlier Detection
Used Three-Sigma Rule & Inter-Quartile Range (IQR) methods to detect anomalies in yield values.
Spatio-Temporal Filtering
Applied DBSCAN-based clustering to detect and correct spatial inconsistencies.
Key Findings
Our study found that:
- For field-level yield predictions, simple regional thresholding was sufficient.
- For sub-field level yield predictions, advanced techniques such as IQR-based statistical filtering and Spatio-Temporal cleaning provided the best results.
- The most computationally expensive methods, like advanced Spatio-Temporal filtering, improved prediction accuracy but were not always necessary.
Publication
This research was presented at IGARSS 2023 - IEEE International Geoscience and Remote Sensing Symposium.
📄 Read the Full Paper
Sample Yield Maps
Want to Know More?
Feel free to reach out for details or collaboration opportunities.
Email Me