KGE & TVE RAG for LLMs

Exploring Multimodal Image Inference Using FAISS, Knowledge Graph Embeddings, and Vision-Language Models

In the world of AI, combining text, images, and structured knowledge is pushing the boundaries of image understanding. In this post, we'll explore how FAISS-based Knowledge Graph Embeddings (KGE) and Traditional Vector Embeddings (TVE) work alongside a Vision-Language Model (VLM) to improve image analysis and retrieval.

What Are We Doing?

The focus of this section of the project was to enhance image inference using a hybrid of Knowledge Graph Embeddings (KGE), Traditional Vector Embeddings (TVE), and a Vision-Language Model (VLM). By leveraging FAISS (Facebook AI Similarity Search), we can quickly and efficiently search for images similar to a given test image.

The LLaVA VLM helps generate image captions and CLIP embeddings, providing a bridge between visual and textual modalities.

Components Breakdown

1. Vision-Language Model (VLM) - LLaVA

Vision-Language Models (VLMs) are trained to understand both visual and textual data. The LLaVA (Language and Vision Assistant) VLM used in this module takes an image and generates meaningful textual descriptions. In our case, the LLaVA model processed a test image, generating captions such as:

"A young girl is smiling and showing her teeth"
"She is wearing a colorful shirt and a brown scarf"

This step serves as the foundation for creating CLIP embeddings, which are essential for connecting visual and textual domains. These embeddings are used for the next steps, where they serve as the query for similarity searches.

2. CLIP Embeddings

CLIP (Contrastive Language–Image Pretraining) creates a shared latent space for both images and text. This capability allows it to generate embeddings that can be directly compared for cross-modal retrieval, i.e., comparing image embeddings with text descriptions or other images.

In this module, CLIP was used to generate embeddings from the test image. These embeddings act as vectors in the latent space that represent the image's characteristics. This becomes the key data structure that is compared to other embeddings stored in FAISS indices.

3. FAISS (Facebook AI Similarity Search)

FAISS is an efficient library for similarity search, particularly useful for working with large-scale embeddings. In this project, both KGE and TVE embeddings are stored in FAISS, allowing for rapid querying and retrieval.

Knowledge Graph Embeddings (KGE): These embeddings are trained using triplets extracted from structured data, like knowledge graphs, which contain relationships between entities. The PyKeen library is used here to generate KGE based on REBEL triplets (subject, relation, object).
Traditional Vector Embeddings (TVE): TVE, on the other hand, represents image features in a more traditional vector format without incorporating structured knowledge.

These two types of embeddings complement each other. KGEs add a layer of semantic understanding based on real-world entities and their relationships, while TVEs capture raw feature similarities in the image space.

4. Similarity Search in FAISS

Once the test image embedding is generated via CLIP, the next step is to perform a similarity search against the FAISS indices containing KGE and TVE.

For the given test image, FAISS produced the following similarity scores:

TVE Similarity:
- Image 1: Distance = 81.27
- Image 2: Distance = 77.19
KGE Similarity:
- Image 1: Distance = 121.69
- Image 2: Distance = 121.45

The distance metrics indicate how similar these retrieved images are to the test image, with smaller distances indicating higher similarity. By using both KGE and TVE, we capture a broader spectrum of similarity, from raw pixel-based comparisons to more semantic, relationship-based similarities.

Workflow Explained

To summarize, here's the structured workflow:

Image Captioning with VLM (LLaVA): The test image is fed into the VLM, generating textual descriptions. These captions are part of a larger multimodal embedding system.
CLIP Embeddings Generation: The CLIP processor extracts the latent image embeddings from the test image.
FAISS Index Loading: Pre-trained KGE and TVE embeddings are loaded into FAISS indices. These embeddings were trained using PyKeen on datasets like REBEL triplets, adding semantic knowledge to the retrieval process.
Similarity Search: A similarity search is performed by comparing the test image embeddings with those stored in FAISS indices. The result is a list of the most similar images based on both TVE and KGE.
Image Relevance and Ranking: The retrieved images are ranked by similarity score, offering insights into how well each FAISS index can retrieve relevant images based on different embedding strategies.

Fine-Tuning Knowledge Graph Embeddings

A crucial element of this project is the use of Knowledge Graph Embeddings (KGE), particularly how they are fine-tuned with triplet extraction. In this context, triplets refer to entities and their relationships, structured as (subject, relation, object). For example, "Paris is the capital of France" is a triplet that could inform KGE. The PyKeen library, which specializes in training and evaluating KGE models, is used to optimize these embeddings by incorporating such structured knowledge.

Dependency of the KGE FAISS-VLM outputs:

More fine tuned Triplet Extraction
PyKeen Training methods for Embedding generation
Combining Entity and Relation Embeddings.

Conclusion

The combination of Vision-Language Models (VLM), CLIP embeddings, Knowledge Graph Embeddings (KGE), and Traditional Vector Embeddings (TVE), using FAISS, is advancing image inference. By leveraging each method's strengths, we achieve more accurate and meaningful image retrieval. KGE adds structured knowledge, enabling context-aware searches beyond basic similarity.

This approach highlights the potential of multimodal AI systems to improve machine perception and decision-making by integrating text, visuals, and structured data. The field is still evolving, with ongoing research focused on improving accuracy, efficiency, and expanding the applications of these technologies.

Ohh Hi! 💙

🎉Congratulations on reaching the final section—few make it this far! Feel free to send us a message at our

Instagram Handle
LinkedIn or
WhatsApp community!

PreviousKnowledge Graph Embeddings

Last updated 9 months ago