Developing Object-centric Vision and Language Foundation Models for Robust Autonomous Sensing Platforms

Aim

In this project, we aim to develop novel Deep Learning techniques to extend the ability of autonomous agents to track objects in visual scene in a more robust and generalisable way.

Objectives

Design and collect challenging simulated environments and datasets as a testbed for visual object tracking, critically including hierarchical objects (that have sub-objects within them).
Design and implement novel Multimodal Generative AI models which mitigate the weaknesses of previous state-of-the-art, specifically we want to see if recognition based on sub-objects or other object fragments can help us solve the occlusion problem (recognising a partially hidden object).
Design effective ways of deploying such systems in low-latency scenarios, especially understanding what parts of the computational task are worth implementing in accelerators that already exist within EDI.
Work with industrial partners to identify use cases, ideally DSTL for military object recognition (e.g. aeroplanes/drones or tanks).

Descriptions

This project is situated within the scope of the “ASP: Autonomous Sensing Platforms” initiative, aiming to advance Machine Learning techniques to enhance the capabilities of current Multimodal Generative AI models. Our objective is to develop novel Deep Learning methodologies that extend the ability of autonomous agents to track objects in visual scenes with increased robustness and generalizability. By leveraging advanced neural network architectures for hierarchical object-centric representation learning, we aim to improve the accuracy and reliability of object tracking in complex and dynamic environments.

Potential applications of this research include autonomous vehicles, where enhanced object tracking can improve navigation and safety; surveillance systems, which can benefit from more accurate monitoring and anomaly detection; and robotics, where improved visual perception can enhance interaction with and manipulation of objects in various settings.

The intended final output is an end-to-end image recognition system capable of identifying various types of military hardware even when partially occluded.

References

Newman, K., Wang, S., Zang, Y., Heffren, D., & Sun, C. (2024). Do Pre-trained Vision-Language Models Encode Object States?. arXiv preprint arXiv:2409.10488.
Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., ... & Kipf, T. (2020). Object-centric learning with slot attention. Advances in neural information processing systems, 33, 11525-11538.
Pantazopoulos, G., Suglia, A., Lemon, O., & Eshghi, A. (2024, June). Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) (pp. 540-549).

Research theme:

Autonomous Sensing Platforms

Principal supervisor:

Dr Alessandro Suglia
Heriot-Watt University, School of Mathematical & Computer Sciences
A.Suglia@hw.ac.uk

Assistant supervisor:

Dr Alex Serb
University of Edinburgh, School of Engineering
aserb@ed.ac.uk

EPSRC and MoD CDT in Sensing, Processing, and AI for Defence and Security (SPADS)

Apply now >

Aim

Objectives

Descriptions

References

Research theme:

Principal supervisor:

Assistant supervisor:

SPADS CDT: Sensing, Processing, and AI for Defence and Security

Search form

Aim

Objectives

Descriptions

References

Research theme:

Principal supervisor:

Assistant supervisor:

SPADS CDT: Sensing, Processing, and AI for Defence and Security