Imagine being able to ask a satellite, “Show me all the solar panels in this city” or “Highlight areas of deforestation in this region.” This kind of natural language interaction with Earth Observation (EO) data is becoming possible thanks to advances in AI vision language models – but there’s a catch.
The Earth Observation Revolution
Earth Observation satellites are constantly collecting vast amounts of data about our planet. From tracking urban development to monitoring climate change, this wealth of satellite imagery has transformed how we understand global changes. But there’s a growing challenge: how do we efficiently analyse this mountain of data?
Traditional approaches require specialised algorithms designed for specific tasks – one system for detecting buildings, another for tracking forest coverage, and so on. Each new application needs its own custom solution, making it costly and time-consuming to develop new Earth Observation capabilities.
Vision Languages – A New Paradigm
Vision language models, like Microsoft’s newly released Florence-2, represent a powerful new approach. These AI systems can understand both images and natural language instructions, allowing users to interact with imagery using simple text prompts. Instead of building separate systems for each task, a single model can handle multiple applications based on text instructions.
Florence-2 is capable of handling many different tasks related to images and EO tasks.
- Image captioning: Generate a descriptive text summary of the contents of an image. Useful for classifying images by general land use, types of vegetation, or other features or background information that contributes to the classification of an image.
- Unprompted Object Detection: Detection of any/all objects within an image. Useful for extracting all individual features in an image, such as buildings, ships etc.
- Prompt-based object detection/segmentation: Text-based search for features in an image. Useful for extracting specified features of interest in an image.