KGE & TVE RAG for LLMs
Last updated
Last updated
In the world of AI, combining text, images, and structured knowledge is pushing the boundaries of image understanding. In this post, we'll explore how FAISS-based Knowledge Graph Embeddings (KGE) and Traditional Vector Embeddings (TVE) work alongside a Vision-Language Model (VLM) to improve image analysis and retrieval.
The LLaVA VLM helps generate image captions and CLIP embeddings, providing a bridge between visual and textual modalities.
Vision-Language Models (VLMs) are trained to understand both visual and textual data. The LLaVA (Language and Vision Assistant) VLM used in this module takes an image and generates meaningful textual descriptions. In our case, the LLaVA model processed a test image, generating captions such as:
"A young girl is smiling and showing her teeth"
"She is wearing a colorful shirt and a brown scarf"
This step serves as the foundation for creating CLIP embeddings, which are essential for connecting visual and textual domains. These embeddings are used for the next steps, where they serve as the query for similarity searches.
CLIP (Contrastive Language–Image Pretraining) creates a shared latent space for both images and text. This capability allows it to generate embeddings that can be directly compared for cross-modal retrieval, i.e., comparing image embeddings with text descriptions or other images.
In this module, CLIP was used to generate embeddings from the test image. These embeddings act as vectors in the latent space that represent the image's characteristics. This becomes the key data structure that is compared to other embeddings stored in FAISS indices.
FAISS is an efficient library for similarity search, particularly useful for working with large-scale embeddings. In this project, both KGE and TVE embeddings are stored in FAISS, allowing for rapid querying and retrieval.
Knowledge Graph Embeddings (KGE): These embeddings are trained using triplets extracted from structured data, like knowledge graphs, which contain relationships between entities. The PyKeen
library is used here to generate KGE based on REBEL triplets
(subject, relation, object).
Traditional Vector Embeddings (TVE): TVE, on the other hand, represents image features in a more traditional vector format without incorporating structured knowledge.
These two types of embeddings complement each other. KGEs add a layer of semantic understanding based on real-world entities and their relationships, while TVEs capture raw feature similarities in the image space.
Once the test image embedding is generated via CLIP, the next step is to perform a similarity search against the FAISS indices containing KGE and TVE.
For the given test image, FAISS produced the following similarity scores:
TVE Similarity:
Image 1: Distance = 81.27
Image 2: Distance = 77.19
KGE Similarity:
Image 1: Distance = 121.69
Image 2: Distance = 121.45
The distance metrics indicate how similar these retrieved images are to the test image, with smaller distances indicating higher similarity. By using both KGE and TVE, we capture a broader spectrum of similarity, from raw pixel-based comparisons to more semantic, relationship-based similarities.
To summarize, here's the structured workflow:
Image Captioning with VLM (LLaVA): The test image is fed into the VLM, generating textual descriptions. These captions are part of a larger multimodal embedding system.
CLIP Embeddings Generation: The CLIP processor extracts the latent image embeddings from the test image.
FAISS Index Loading: Pre-trained KGE and TVE embeddings are loaded into FAISS indices. These embeddings were trained using PyKeen on datasets like REBEL triplets, adding semantic knowledge to the retrieval process.
Similarity Search: A similarity search is performed by comparing the test image embeddings with those stored in FAISS indices. The result is a list of the most similar images based on both TVE and KGE.
Image Relevance and Ranking: The retrieved images are ranked by similarity score, offering insights into how well each FAISS index can retrieve relevant images based on different embedding strategies.
A crucial element of this project is the use of Knowledge Graph Embeddings (KGE), particularly how they are fine-tuned with triplet extraction. In this context, triplets refer to entities and their relationships, structured as (subject, relation, object). For example, "Paris is the capital of France" is a triplet that could inform KGE. The PyKeen library, which specializes in training and evaluating KGE models, is used to optimize these embeddings by incorporating such structured knowledge.
More fine tuned Triplet Extraction
PyKeen Training methods for Embedding generation
Combining Entity and Relation Embeddings.
The combination of Vision-Language Models (VLM), CLIP embeddings, Knowledge Graph Embeddings (KGE), and Traditional Vector Embeddings (TVE), using FAISS, is advancing image inference. By leveraging each method's strengths, we achieve more accurate and meaningful image retrieval. KGE adds structured knowledge, enabling context-aware searches beyond basic similarity.
This approach highlights the potential of multimodal AI systems to improve machine perception and decision-making by integrating text, visuals, and structured data. The field is still evolving, with ongoing research focused on improving accuracy, efficiency, and expanding the applications of these technologies.
WhatsApp community!
The focus of this section of the project was to enhance image inference using a hybrid of Knowledge Graph Embeddings (KGE)
, Traditional Vector Embeddings (TVE)
, and a Vision-Language Model (VLM). By leveraging , we can quickly and efficiently search for images similar to a given test image.
or