The Promise of Vision Language Models: AI for Earth Observation

Imagine being able to ask a satellite, “Show me all the solar panels in this city” or “Highlight areas of deforestation in this region.” This kind of natural language interaction with Earth Observation (EO) data is becoming possible thanks to advances in AI vision language models – but there’s a catch.

The Earth Observation Revolution

Earth Observation satellites are constantly collecting vast amounts of data about our planet. From tracking urban development to monitoring climate change, this wealth of satellite imagery has transformed how we understand global changes. But there’s a growing challenge: how do we efficiently analyse this mountain of data?

Traditional approaches require specialised algorithms designed for specific tasks – one system for detecting buildings, another for tracking forest coverage, and so on. Each new application needs its own custom solution, making it costly and time-consuming to develop new Earth Observation capabilities.

Vision Languages – A New Paradigm

Vision language models, like Microsoft’s newly released Florence-2, represent a powerful new approach. These AI systems can understand both images and natural language instructions, allowing users to interact with imagery using simple text prompts. Instead of building separate systems for each task, a single model can handle multiple applications based on text instructions.

Earth Observation — Figure 1 : Florence-2 takes images and multi-task prompts as input into a standard multi-modality encoder- decoder model.

Florence-2 is capable of handling many different tasks related to images and EO tasks.

Image captioning: Generate a descriptive text summary of the contents of an image. Useful for classifying images by general land use, types of vegetation, or other features or background information that contributes to the classification of an image.
Unprompted Object Detection: Detection of any/all objects within an image. Useful for extracting all individual features in an image, such as buildings, ships etc.
Prompt-based object detection/segmentation: Text-based search for features in an image. Useful for extracting specified features of interest in an image.

Conclusion

At Compass Informatics, we’re not just thinking about the future of Earth Observation – we’re actively building it. Our team is pushing the envelope of what’s possible by developing vision language models specifically trained on Irish and European landscapes. Working with partners across agriculture, forestry, and urban planning sectors, we are exploring how these AI capabilities can transform the way organisations interact with Earth Observation data.

We’re always looking for forward-thinking organisations to collaborate with as we develop these groundbreaking capabilities. Whether you’re working in environmental monitoring, land management, or sustainable development, we’d love to explore how next-generation Earth Observation tools could support your mission.

The Promise of Vision Language Models: AI for Earth Observation

The Earth Observation Revolution

Vision Languages – A New Paradigm

Our Experiments with Florence-2

What Works Well:

Current Limitations:

Why We Need EO-Specific Training Data

The Path Forward

Conclusion