The Traditional Embeddings Problem
Last updated
Last updated
For this project, we leveraged a robust dataset to dive into the creation of embeddings across text, images, and knowledge graphs, laying the groundwork for our exploration.
COYO Dataset Overview
COYO-700M Dataset The COYO-700M dataset is an extensive collection of text-image pairs, offering a vast and diverse resource for embedding tasks. This large-scale dataset was instrumental in generating high-quality embeddings that could be analyzed for their contextual understanding and semantic depth. Link to COYO-700M Dataset
Subset of COYO-700M (1,000 text-image pairs) To tailor the dataset to the specific needs of our project, we created a focused subset of 1,000 text-image pairs. This carefully selected subset allowed us to conduct detailed analysis, creating embeddings that were not only meaningful but also manageable for visualization and contextual evaluation. Link to COYO-1K Reduced Subset
To Access the dataset:
For easy access to the datasets used in our embeddings, the dataset folder contains links to both the Google Drive and Kaggle repositories for the COYO-700M dataset and its subset.
To dive deep into understanding text embeddings, we employed 04 Transformer-based models, each offering a unique approach to capturing semantic meaning:
CLIP Text Encoder
BERT
Universal Sentence Encoder (USE)
InferSent
Using a focused subset of 1,000 rows from the COYO dataset, we generated embeddings from each of these models. Given that the embeddings produced were of varying dimensions, we applied three advanced dimensionality reduction techniques to standardize and compare them effectively:
PCA (Principal Component Analysis): A linear method that emphasizes variance and helps in identifying the principal components.
UMAP (Uniform Manifold Approximation and Projection): A non-linear technique that excels in preserving the global structure of data.
t-SNE (t-Distributed Stochastic Neighbor Embedding): A method particularly useful for visualizing clusters in high-dimensional data.
These techniques allowed us not only to compare the embeddings across models but also to visualize the relationships and differences in a more interpretable lower-dimensional space. The reduced embedding datasets, which serve as a crucial resource for our analysis, can be accessed here:
Infersent
Clip-text
BERT
Universal Sentence Encoder
The coyo-1k-TEXT dataset is one of the datasets used for comparison in this project. It is specifically designed to evaluate how well different dimensionality reduction techniques perform in terms of local and global structure preservation, geometric accuracy, and computational efficiency.
Below are the comparison scores of PCA, UMAP, and t-SNE techniques on the coyo-1k-TEXT dataset:
Trustworthiness
0.999700
0.999543
0.999591
Continuity
0.999799
0.999736
0.999751
Cosine Similarity
7.257074e-07
0.150968
0.002865
Linear Translation
1.319569e-25
22.482631
30.238292
Euclidean Correlation
0.999997
0.738646
0.981130
Geodesic Distance Preservation
0.572656
0.982638
0.948360
Trustworthiness: Measures local structure preservation. Higher scores indicate better preservation, with values ranging from 0 to 1.
Linked paper: https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf
Continuity: Measures global structure preservation by embeddings. Higher scores indicate better preservation, with values ranging from 0 to 1.
Linked paper: https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf
Geodesic Distance Preservation (GDP): Focuses on preserving the intrinsic geometry of the manifold. Lower values indicate better preservation.
Understanding GDP: https://groups.csail.mit.edu/gdpgroup/assets/68410_spring_2023/chapter6.pdf
Cosine Similarity: Indicates the preservation of angular relationships. Higher similarity implies better preservation.
Linear Translation (MSE): Measures the mean squared error (MSE) of the mapping. Lower values indicate better mapping.
Euclidean Correlation: Measures the linear correlation between the original and reduced spaces. Higher correlation indicates better technique performance.
To explore image embeddings, we utilized the CLIP model, which is uniquely designed to handle both text and image modalities. CLIP is trained on extensive {text:image}
pairs, and it converts these embeddings into a unified vector space, allowing direct comparisons between the semantic content of text and images.
By generating image embeddings, we were able to align the visual features with textual descriptions, capturing the intricate relationships between the two. This approach facilitates a more nuanced understanding of how visual and textual data correlate, enabling us to analyze them in a common representational space.
With Tensorboard, we visualised the text and image reduced embeddings with PCA/UMAP/t-SNE and understood the relation between texts and images.
Although many image and their corresponding embedding did match the clusters of similar texts, but some were off and here are examples of Tensorboard used. The below screenshot shows image PCA embeddings with corresponding Texts.
Tensorboard visualization of PCA reduced Image embedding labelled with text.
Two electronic items image and text are close, quite good!
The embeddings look reasonable but on closer look checking 2 close image embeddings and corresponding texts, there seems disparity between them.
To delve deeper into the problem using similarity scores, we focused on a few specific examples.
Let's examine the test images, categorized as follows:
People posing for a photo
Home decor and furniture
Motorbikes and cars
Now, what do you think the image embeddings would reveal? Which images would be closer in similarity? You might initially think, "This is easy, I've got it!"—but that’s not the case.
Consider an image that appears to show fitness equipment on a house floor. The corresponding text reads, "A black fitness mat lies on the ground beside a black kettlebell."
We plotted the embeddings of all the images using Tensorboard, with KMeans clustering and PCA for dimensionality reduction.
One might expect that the black fitness mat would be categorized with home equipment, given that it’s something typically found in a house. So, it seems logical that the closest image embeddings would relate to home decor, right?
Actually, no. The top 2 images closest to the "fitness mat and kettlebell" are completely unrelated in appearance. The corresponding texts for the top two images are:
"A black and silver motorcycle parked on a surface" 1.359
"A woman posing in front of a plain background" 1.379
The 3rd and 4th embeddings are related to home decor, but then we have an image of "A handsome property manager posing for a picture" at 5th place. This highlights a noticeable disparity.
This disparity highlights a critical issue when using traditional vector embeddings, especially when dealing with multi-modal data like images and text. The embeddings for images that seem contextually similar, such as home decor and fitness equipment, do not always cluster together as expected. Instead, the embeddings may align with unrelated categories, such as motorbikes or people posing, due to the limitations of vector-based similarity measures.
This brings us to a fundamental problem: traditional vector embeddings often lack the ability to capture the rich, nuanced relationships between different modalities. They treat each piece of data as an isolated point in space, without understanding the underlying context that connects them.
But what if we could represent these relationships in a graph? By connecting images, text, and other data types through nodes and edges, Knowledge Graphs allow us to capture deeper, more meaningful connections. This approach moves beyond simple vector embeddings, providing a structured, relational understanding of our data.
With Knowledge Graphs, we can improve tasks like information retrieval, recommendations, and large language models (LLMs) by making them more context-aware and accurate. In the next section, we'll explore how integrating Knowledge Graphs enhances these capabilities.