The Promise of Vision Language Models: AI for Earth Observation

Imagine being able to ask a satellite, “Show me all the solar panels in this city” or “Highlight areas of deforestation in this region.” This kind of natural language interaction with Earth Observation (EO) data is becoming possible thanks to advances in AI vision language models – but there’s a catch.


The Earth Observation Revolution 

Earth Observation satellites are constantly collecting vast amounts of data about our planet. From tracking urban development to monitoring climate change, this wealth of satellite imagery has transformed how we understand global changes. But there’s a growing challenge: how do we efficiently analyse this mountain of data?

Traditional approaches require specialised algorithms designed for specific tasks – one system for detecting buildings, another for tracking forest coverage, and so on. Each new application needs its own custom solution, making it costly and time-consuming to develop new Earth Observation capabilities.

Vision Languages – A New Paradigm 

Vision language models, like Microsoft’s newly released Florence-2, represent a powerful new approach. These AI systems can understand both images and natural language instructions, allowing users to interact with imagery using simple text prompts. Instead of building separate systems for each task, a single model can handle multiple applications based on text instructions.

Earth Observation
Figure 1 : Florence-2 takes images and multi-task prompts as input into a standard multi-modality encoder- decoder model.

 

Florence-2 is capable of handling many different tasks related to images and EO tasks.

  • Image captioning: Generate a descriptive text summary of the contents of an image. Useful for classifying images by general land use, types of vegetation, or other features or background information that contributes to the classification of an image.
  • Unprompted Object Detection: Detection of any/all objects within an image. Useful for extracting all individual features in an image, such as buildings, ships etc.
  • Prompt-based object detection/segmentation: Text-based search for features in an image. Useful for extracting specified features of interest in an image.

Our Experiments with Florence-2

To understand both the potential and current limitations of using general-purpose vision models for Earth Observation, we tested
Florence-2 on satellite imagery. The results revealed both exciting possibilities and important challenges:

What Works Well:

  1. Flexible Object Detection: The model can search for various features based on simple text prompts, without needing pre-defined categories.
  2. General Scene Understanding: It can distinguish between urban and rural environments and identify major landscape features.
  3. Water Body Detection: The model shows impressive accuracy in identifying and outlining water features like rivers.

 

Current Limitations:

  1. Specialised Features: The model struggles with EO-specific elements like distinguishing between different types of land cover.
  2. Scale Understanding: Common objects like buildings appear very different from satellite perspective.
  3. Technical Accuracy: The model’s use of poetic descriptions can sometimes compromise technical accuracy, limiting its applicability in scientific contexts.

Why We Need EO-Specific Training Data

While general vision language models show promise, our experiments highlight why we need large, specialised datasets for Earth Observation. The view from above is fundamentally different from ground-level imagery in several ways:

  • Perspective: Objects look radically different from a satellite’s viewpoint.
  • Scale: Images cover vast areas with varying levels of detail
  • Spectral Information: Satellites capture data beyond visible light
  • Feature Types: Many important features (like land use patterns) are abstract rather than discrete objects
Figure 3 : Florence-2 detects bounding boxes and segmentation masks for objects in EO images based on a text prompt.

 

The Path Forward

The potential benefits of adapting vision language models for Earth Observation are significant:

  • Faster Development: New applications could be developed through simple prompting rather than building specialised algorithms.
  • More Accessible: Non-technical users could interact with EO data using natural language.
  • Greater Flexibility: A single system could handle multiple types of analysis based on user needs.

However, realising these benefits requires building larger, high-quality datasets specifically for training Earth Observation AI models. This investment in data infrastructure is crucial for unlocking the full potential of AI in environmental monitoring and planetary understanding.


The convergence of Earth Observation and advanced AI models opens exciting possibilities for environmental monitoring, urban planning, and climate action. By developing the right training data and adapting vision language models to the unique challenges of satellite imagery, we can make Earth Observation more powerful and accessible than ever before.

Conclusion

At Compass Informatics, we’re not just thinking about the future of Earth Observation – we’re actively building it. Our team is pushing the envelope of what’s possible by developing vision language models specifically trained on Irish and European landscapes. Working with partners across agriculture, forestry, and urban planning sectors, we are exploring how these AI capabilities can transform the way organisations interact with Earth Observation data.

We’re always looking for forward-thinking organisations to collaborate with as we develop these groundbreaking capabilities. Whether you’re working in environmental monitoring, land management, or sustainable development, we’d love to explore how next-generation Earth Observation tools could support your mission.