The classification of land cover using eo-learn. Part 1

Hi, Habr! I present to your attention the translation of the article " Land Cover Classification with eo-learn: Part 1 " by Matic Lubej.

Part 2

Foreword

About six months ago, the first commit was made to the eo-learn repository on GitHub. Today, eo-learn has turned into a wonderful open source library, ready for use by anyone who is interested in EO (Earth Observation) data. Everyone in the Sinergise team was waiting for the moment of transition from the stage of building the necessary tools to the stage of their use for machine learning. It's time to introduce you to a series of articles on the classification of land cover using eo-learn

eo-learn is an open source Python library that acts as a bridge connecting Earth Observation / Remote Sensing with the Python machine learning library ecosystem. We have already written a separate post on our blog , which we recommend to read. The library uses primitives from the numpy and shapely libraries to store and manipulate satellite data. At the moment it is available in the GitHub repositories , and documentation is available via the corresponding link to ReadTheDocs .

Image from satellite Sentinel-2 and NDVI-mask of a small territory in Slovenia, in winter

To demonstrate the capabilities of eo-learn , we decided to use our multi-temporal pipeline to classify the cover of the territory of the Republic of Slovenia (the country where we live), using data for 2017. Since the complete procedure may be too complicated for one article, we decided to divide it into three parts. Due to this, there will be no need to skip steps and immediately go to machine learning - first we have to really understand the data with which we work. Each article will be accompanied by an example in the form of a Jupyter Notebook. Also, for those interested, we have already prepared a complete example covering all stages.

In the first article, we will guide you through the procedure for selecting / dividing the zone of interest (hereinafter - AOI, area of interest), and obtaining the necessary information, such as data from satellite sensors and cloud masks. We will also show an example of how to create a raster mask from vector data about the actual area coverage. All this is necessary steps to obtain a reliable result.
In the second part, we dive headlong into preparing the data for the machine learning procedure. This process involves taking random samples for training / validating pixels, removing cloud images, interpolating temporal data to fill holes, and so on.
The third part will be considered training and validation of the classifier, as well as, of course, beautiful graphics!

Image from satellite Sentinel-2 and NDVI-mask of a small territory in Slovenia, in summer

Zone of interest? Choose!

The eo-learn library allows you to split AOI into small chunks that can be processed with limited computing resources. In this example, the border of Slovenia was taken from Natural Earth , however, you can select a zone of any size. We also added a buffer to the border, after which the AOI dimension was approximately 250x170 km. Using geopandas and shapely library magic, we created an AOI break tool. In this case, we divided the territory into 25x17 squares of the same size, with the result that we got ~ 300 fragments of 1000x1000 pixels, in a resolution of 10m. The decision about the division into fragments is made depending on the available computing power. As a result of this step, we get a list of squares covering AOI.

AOI (territory of Slovenia) is divided into small squares approximately 1000x1000 pixels in resolution of 10m.

Receive data from Sentinel satellites

After determining the squares, eo-learn allows you to automatically download data from Sentinel satellites. In this example, we get all the Sentinel-2 L1C images that were taken in 2017. It is worth noting that Sentinel-2 L2A products, as well as additional data sources (Landsat-8, Sentinel-1) can be added to the pipeline in a similar way. It is also worth noting that the use of L2A products can improve the classification results, but we decided to use L1C for the universality of the solution. This was done using sentinelhub-py , a library that works like a wrapper on Sentinel-Hub services. The use of these services is free for research institutes and start-ups, but in other cases it is necessary to subscribe.

Color images of the same fragment on different days. Some images are cloudy, which means that a cloud detector is needed.

In addition to Sentinel data, eo-learn allows you to transparently access cloud data and cloud probabilities through the s2cloudless library. This library provides the means to automatically detect clouds pixel by pixel . Details can be read here .

Cloud probability masks for the images above. The color indicates the probability of cloudiness of a particular pixel (blue - low probability, yellow - high).

Add real data

Teaching with a teacher requires a card with real data, or truth . The latter term should not be taken literally, since in reality the data are only an approximation of what is on the surface. Unfortunately, the behavior of the classifier strongly depends on the quality of this card ( however, as for most other tasks in machine learning ). Marked maps are most often available in the form of vector data in a shapefile format (for example, provided by the state or community ). eo-learn contains tools for rasterizing vector data in the form of a raster mask.

The process of rasterizing data into masks on the example of one square. The left image shows polygons in the vector file, in the middle there are raster masks for each label — black and white indicate the presence and absence of a specific feature, respectively. The right image shows a combined raster mask, in which different colors indicate different labels.

We put everything together

All of these tasks behave like building blocks, which can be combined into a convenient sequence of actions performed for each square. Due to the potentially extremely large number of such fragments, pipeline automation is absolutely necessary.

 #   eo_workflow = eolearn.core.LinearWorkflow( add_sentinel2_data, #     Sentinel-2 add_cloud_mask, #     append_ndvi, #  NDVI append_ndwi, #  NDWI append_norm, #     add_valid_mask, #     add_count_valid, #  -      *reference_task_array, #        save_task #   )

Familiarization with the actual data - the first step in working with tasks of this kind. Using cloud masks paired with data from Sentinel-2, you can determine the number of qualitative observations of all pixels, as well as the average probability of clouds for a particular zone. This allows you to better understand existing data, and use this to debug further problems.

A color image (left), a mask for the quantity of quality measurements for 2017 (center), and the average probability of cloudiness for 2017 (right) for a random fragment from AOI.

Some may be interested in the average NDVI for an arbitrary zone, ignoring clouds. Using cloud masks, you can calculate the average value of any attribute, ignoring pixels without data. Thus, thanks to the masks, we can clear the image of noise for almost any attribute in our data.

The average NDVI of all pixels in a random AOI fragment over the course of a year. The blue line shows the result of the calculation, obtained if you ignore the values inside the clouds. The orange line shows the average when all pixels are taken into account.

"But what about scaling?"

After we set up our pipeline with the example of one fragment, all that remains to be done is to launch a similar procedure for all fragments automatically (if resources allow - in parallel), while you relax with a cup of coffee and think about how the big boss will be pleasantly surprised the results of your work. After the work of the pipeline is finished, you can export the data you are interested in into a single image in GeoTIFF format. The gdal_merge.py script takes pictures and connects them, with the result that we get a picture covering the whole country.

The number of valid shots for AOI in 2017. Regions with a large number of images are located in the territory where the trajectory of the Sentinel-2A and Sentinel-2B satellites intersect. In the middle of this does not happen.

From the image above, we can conclude that the input data is heterogeneous - for some fragments the number of images is two times higher than for others. This means that we need to take measures to normalize the data, such as interpolation along the time axis.

The execution of the specified pipeline takes approximately 140 seconds for one fragment, which in total gives ~ 12 hours at the start of the process throughout the AOI. Most of this time is the download of satellite data. The average uncompressed fragment with the described configuration takes about 3 GB, which in total gives ~ 1TB of space for the entire AOI.

Jupyter Notebook Example

For more simple familiarization with the eo-learn code, we have prepared an example covering the topics discussed in this post. The example is designed as a Jupyter notebook, and you can find it in the examples directory of the eo-learn package.

Source: https://habr.com/ru/post/452284/

All Articles