# The Traditional Embeddings Problem

## The dataset we used for embeddings

For this project, we leveraged a robust dataset to dive into the creation of embeddings across text, images, and knowledge graphs, laying the groundwork for our exploration.

**COYO Dataset Overview**

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FmDYA8r3Wj8Jrv6FRMUw9%2FCOYO_Preview.png?alt=media&#x26;token=76da7ba8-3db0-4118-89b1-a63ca1d5c756" alt="" width="375"><figcaption></figcaption></figure>

* **COYO-700M Dataset**\
  The COYO-700M dataset is an extensive collection of text-image pairs, offering a vast and diverse resource for embedding tasks. This large-scale dataset was instrumental in generating high-quality embeddings that could be analyzed for their contextual understanding and semantic depth.\
  [Link to COYO-700M Dataset](https://github.com/kakaobrain/coyo-dataset)
* **Subset of COYO-700M (1,000 text-image pairs)**\
  To tailor the dataset to the specific needs of our project, we created a focused subset of 1,000 text-image pairs. This carefully selected subset allowed us to conduct detailed analysis, creating embeddings that were not only meaningful but also manageable for visualization and contextual evaluation.\
  [Link to COYO-1K Reduced Subset](https://www.kaggle.com/datasets/anantjain1223/coyo-1k-reduced)

**To Access the dataset:**

For easy access to the datasets used in our embeddings, the [dataset folder](https://github.com/dsgiitr/kge-clip/tree/main/5.dataset) contains links to both the Google Drive and Kaggle repositories for the COYO-700M dataset and its subset.

***

## Text Embeddings

To dive deep into understanding text embeddings, we employed 04 Transformer-based models, each offering a unique approach to capturing semantic meaning:

1. **CLIP Text Encoder**
2. **BERT**
3. **Universal Sentence Encoder (USE)**
4. **InferSent**

Using a focused subset of 1,000 rows from the COYO dataset, we generated embeddings from each of these models. Given that the embeddings produced were of varying dimensions, we applied three advanced dimensionality reduction techniques to standardize and compare them effectively:

* <mark style="color:blue;">**PCA (Principal Component Analysis):**</mark> A linear method that emphasizes variance and helps in identifying the principal components.
* <mark style="color:blue;">**UMAP (Uniform Manifold Approximation and Projection):**</mark> A non-linear technique that excels in preserving the global structure of data.
* <mark style="color:blue;">**t-SNE (t-Distributed Stochastic Neighbor Embedding):**</mark> A method particularly useful for visualizing clusters in high-dimensional data.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FY9zwXEZrm2C4EbbPphBT%2Fimage.png?alt=media&#x26;token=960d7eaa-a019-4d2d-8820-d940b1547084" alt="" width="375"><figcaption><p>Screenshot of reduced embeddings</p></figcaption></figure>

These techniques allowed us not only to compare the embeddings across models but also to visualize the relationships and differences in a more interpretable lower-dimensional space. The reduced embedding datasets, which serve as a crucial resource for our analysis, can be accessed here:&#x20;

| Sentence Encoder           | PCA                                                                                                 | UMAP                                                                                                  | T-SNE                                                                                                  |
| -------------------------- | --------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
| Infersent                  | [PCA](https://www.kaggle.com/datasets/anantjain1223/infersent-coyo-1k?select=infersent_pca.csv)     | [UMAP](https://www.kaggle.com/datasets/anantjain1223/infersent-coyo-1k?select=infersent_umap.csv)     | [T-SNE](https://www.kaggle.com/datasets/anantjain1223/infersent-coyo-1k?select=infersent_tsne.csv)     |
| Clip-text                  | [PCA](https://www.kaggle.com/datasets/anantjain1223/clip-text-coyo-1k?select=clip_text_pca.csv)     | [UMAP](https://www.kaggle.com/datasets/anantjain1223/clip-text-coyo-1k?select=clip_text_umap+csv.csv) | [T-SNE](https://www.kaggle.com/datasets/anantjain1223/clip-text-coyo-1k?select=clip_text_tsne.csv)     |
| BERT                       | [PCA](https://www.kaggle.com/datasets/anantjain1223/sentence-transformer-coyo-1k?select=st_pca.csv) | [UMAP](https://www.kaggle.com/datasets/anantjain1223/sentence-transformer-coyo-1k?select=st_umap.csv) | [T-SNE](https://www.kaggle.com/datasets/anantjain1223/sentence-transformer-coyo-1k?select=st_tsne.csv) |
| Universal Sentence Encoder | [PCA](https://www.kaggle.com/datasets/anantjain1223/use-coyo-1k?select=use_pca.csv)                 | [UMAP](https://www.kaggle.com/datasets/anantjain1223/use-coyo-1k?select=use_umap.csv)                 | [T-SNE](https://www.kaggle.com/datasets/anantjain1223/use-coyo-1k?select=use_tsne.csv)                 |

## Reduced Embedding scores

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2Fvf0NTAvhEN2bmqCeCEEo%2Fimage.png?alt=media&#x26;token=54764a7f-c883-42ed-9c39-9c4cabcef36a" alt="" width="563"><figcaption><p>Two low dimensional data representation</p></figcaption></figure>

The coyo-1k-TEXT dataset is one of the datasets used for comparison in this project. It is specifically designed to evaluate how well different dimensionality reduction techniques perform in terms of local and global structure preservation, geometric accuracy, and computational efficiency.

#### <mark style="color:blue;">Scores for coyo-1k-TEXT Dataset</mark>

Below are the comparison scores of PCA, UMAP, and t-SNE techniques on the coyo-1k-TEXT dataset:

| Comparison                     | PCA          | UMAP      | t-SNE     |
| ------------------------------ | ------------ | --------- | --------- |
| Trustworthiness                | 0.999700     | 0.999543  | 0.999591  |
| Continuity                     | 0.999799     | 0.999736  | 0.999751  |
| Cosine Similarity              | 7.257074e-07 | 0.150968  | 0.002865  |
| Linear Translation             | 1.319569e-25 | 22.482631 | 30.238292 |
| Euclidean Correlation          | 0.999997     | 0.738646  | 0.981130  |
| Geodesic Distance Preservation | 0.572656     | 0.982638  | 0.948360  |

#### Key Metrics Explained

* **Trustworthiness:** Measures local structure preservation. **Higher** scores indicate better preservation, with values ranging from 0 to 1.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FPnfREkfSDUsW3BIAIY7i%2Fimage.png?alt=media&#x26;token=b9bd67e4-4fd4-4950-b8f5-81410f258e20" alt=""><figcaption><p>Mathematical formula for Trustworthiness</p></figcaption></figure>

Linked paper: <https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf>

* **Continuity:** Measures global structure preservation by embeddings. **Higher** scores indicate better preservation, with values ranging from 0 to 1.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FvCYfAjFUqiw7b4m54bop%2Fimage.png?alt=media&#x26;token=5fe486bc-09a0-4754-b614-e7e14992cdea" alt=""><figcaption><p>Mathematical formula for Continuity</p></figcaption></figure>

Linked paper: <https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf>

* **Geodesic Distance Preservation (GDP):** Focuses on preserving the intrinsic geometry of the manifold. **Lower** values indicate better preservation.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2F9Tc8qrSqoI6DQD2Ex51C%2FApproximate-geodesic-distance-between-A-and-B.png?alt=media&#x26;token=3d63377f-bb05-4373-9bcc-ac2c512f97c2" alt="" width="319"><figcaption><p>Geodesic distance between A and B</p></figcaption></figure>

Understanding GDP: <https://groups.csail.mit.edu/gdpgroup/assets/68410_spring_2023/chapter6.pdf>

* **Cosine Similarity:** Indicates the preservation of angular relationships. **Higher** similarity implies better preservation.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FyaeDAVV4Mrt7lobrXmpk%2FCosine%2BSimilarity%2BRadial%2BExamples.png?alt=media&#x26;token=c3b5c4f6-ccae-4d59-9141-5e13d45e6287" alt="" width="322"><figcaption><p>Cosine Similarity between Vectors</p></figcaption></figure>

* **Linear Translation (MSE):** Measures the mean squared error (MSE) of the mapping. **Lower** values indicate better mapping.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FduOFVoJwgF09Fi6MPem0%2F11111111.png?alt=media&#x26;token=5bad322d-03e0-42a4-b822-06a1284257a4" alt="" width="375"><figcaption><p>Linear Regression</p></figcaption></figure>

* **Euclidean Correlation:** Measures the linear correlation between the original and reduced spaces. **Higher** correlation indicates better technique performance.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FK8sNPb5cilgvu0kYE99V%2Fimage.png?alt=media&#x26;token=9f4eea2b-a6ac-49c3-87ac-948568a41e15" alt="" width="375"><figcaption><p>Euclidean Distance Formula</p></figcaption></figure>

## Image Embeddings

To explore image embeddings, we utilized the CLIP model, which is uniquely designed to handle both text and image modalities. CLIP is trained on extensive `{text:image}` pairs, and it converts these embeddings into a unified vector space, allowing direct comparisons between the semantic content of text and images.

By generating image embeddings, we were able to align the visual features with textual descriptions, capturing the intricate relationships between the two. This approach facilitates a more nuanced understanding of how visual and textual data correlate, enabling us to analyze them in a common representational space.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FdBHwuaNvXghZ0AYOq2Db%2Fimage.png?alt=media&#x26;token=06e00b2e-3892-4247-9296-aaac763f9458" alt=""><figcaption><p>An overview of CLIP's multimodal embeddings architecture</p></figcaption></figure>

## Tensorboard Visualization and Insights

With Tensorboard, we visualised the text and image reduced embeddings with PCA/UMAP/t-SNE and understood the relation between texts and images.

Although many image and their corresponding embedding did match the clusters of similar texts, but some were off and here are examples of Tensorboard used. The below screenshot shows image PCA embeddings with corresponding Texts.

| <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FErtnzeFZEKSR4Ym2mTRr%2Fimage.png?alt=media&#x26;token=4f9330e1-9d11-4cd3-8ee1-d896f5926587" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2F1xFBNZMKkQxh75lUlClE%2Fimage.png?alt=media&#x26;token=b16ec178-54ef-410a-81ea-2214c670eff5" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FKr6LkjrXLRiLHzvOF0aA%2Fimage.png?alt=media&#x26;token=137e7d37-cd07-42ea-b736-93d07f7292cf" alt="" data-size="original"> |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Tensorboard visualization of PCA reduced Image embedding labelled with text.**                                                                                                                                                                           | **Two electronic items image and text are close, quite good!**                                                                                                                                                                                             | **The embeddings look reasonable but on closer look checking 2 close image embeddings and corresponding texts, there seems disparity between them.**                                                                                                       |

To delve deeper into the problem using similarity scores, we focused on a few specific examples.

Let's examine the test images, categorized as follows:

1. People posing for a photo
2. Home decor and furniture
3. Motorbikes and cars

| People Posing                                                                                                                                                                                                                                              | Home Decor                                                                                                                                                                                                                                                 | Motorbike and Car                                                                                                                                                                                                                                          |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FN7TccjMz3FzVMrtwUZRj%2Fimage.png?alt=media&#x26;token=fb398625-d2f6-43a4-bef3-4c2b1f95d0ac" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FJSBTQ3pDAChN6Wgl4Qe5%2Fimage.png?alt=media&#x26;token=8c513eda-8c52-40d9-bb16-cf400c683519" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FcjNGhIZwnpM9i2f7DAe1%2Fimage.png?alt=media&#x26;token=8c051bf3-0e41-42fa-aca1-df46cb8bd2fc" alt="" data-size="original"> |
| <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FZz8BDHlSNEPgzcSAlRyr%2Fimage.png?alt=media&#x26;token=faba034c-5e0e-4a8e-be7a-3e0c0f0a12d3" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2F59m6cm4wvpgUUi8Rtmrm%2Fimage.png?alt=media&#x26;token=6e230c35-4268-446e-8e2f-1590d90241af" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2Fusx2mJyHYewrjUS6PUsq%2Fimage.png?alt=media&#x26;token=313c177c-8b84-48d6-bfe1-6d024a713f8c" alt="" data-size="original"> |
| <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FzvHYFZDywemtg2lCS5Rv%2Fimage.png?alt=media&#x26;token=967861f2-1776-41f0-8116-1bf566bb738e" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FK91LXlXeLIOqrC1LnzE1%2Fimage.png?alt=media&#x26;token=ef9f9399-73a5-48e6-8679-021ebdf8d76d" alt="" data-size="original"> | <img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FgmmdnOkOvfTaL1xpcVyu%2Fimage.png?alt=media&#x26;token=9866cfc2-0e36-4b51-a7d0-b76cd8a5ee3f" alt="" data-size="original"> |

Now, what do you think the image embeddings would reveal? Which images would be closer in similarity? You might initially think, "This is easy, I've got it!"—but that’s not the case.

Consider an image that appears to show fitness equipment on a house floor. The corresponding text reads, <mark style="color:blue;">**"A black fitness mat lies on the ground beside a black kettlebell."**</mark>

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FcrmBx36JCQOMT4lMyI9a%2Fimage.png?alt=media&#x26;token=577d5493-370b-4424-99ca-ee48c6e7aeed" alt="" width="243"><figcaption></figcaption></figure>

We plotted the embeddings of all the images using Tensorboard, with KMeans clustering and PCA for dimensionality reduction.

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2FwxrVz7aMO9tRzUy6hysA%2Fimage.png?alt=media&#x26;token=b52763cb-da23-46fc-a568-092e83a1eca4" alt="" width="373"><figcaption><p>Visual representation of image embeddings in 3D space</p></figcaption></figure>

One might expect that the black fitness mat would be categorized with home equipment, given that it’s something typically found in a house. So, it seems logical that the closest image embeddings would relate to home decor, right?

<figure><img src="https://3558521670-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F8mI5ieugJlfU4ng7eyP9%2Fuploads%2Fp4LXDfsdtyaMHTeEieBQ%2Fimage.png?alt=media&#x26;token=0d67ca28-0b52-4bf9-8fc8-2a9b00ba8c61" alt=""><figcaption><p>Euclidean distance metrics of embeddings wrt. Fitness mat image</p></figcaption></figure>

Actually, no. The top 2 images closest to the "fitness mat and kettlebell" are completely unrelated in appearance. The corresponding texts for the top two images are:

* <mark style="color:blue;">"A black and silver motorcycle parked on a surface" 1.359</mark>
* <mark style="color:blue;">"A woman posing in front of a plain background"  1.379</mark>

The 3rd and 4th embeddings are related to home decor, but then we have an image of "A handsome property manager posing for a picture" at 5th place. This highlights a noticeable disparity.

## So, are we discussing drawbacks of traditional embeddings?&#x20;

This disparity highlights a critical issue when using traditional vector embeddings, especially when dealing with multi-modal data like images and text. The embeddings for images that seem contextually similar, such as home decor and fitness equipment, do not always cluster together as expected. Instead, the embeddings may align with unrelated categories, such as motorbikes or people posing, due to the limitations of vector-based similarity measures.

> This brings us to a fundamental problem: <mark style="color:red;">traditional vector embeddings often lack the ability to capture the rich, nuanced relationships between different modalities.</mark> They treat each piece of data as an isolated point in space, without understanding the underlying context that connects them.

<mark style="color:red;">**But what if we could represent these relationships in a graph?**</mark> By connecting images, text, and other data types through nodes and edges, Knowledge Graphs allow us to capture deeper, more meaningful connections. This approach moves beyond simple vector embeddings, providing a structured, relational understanding of our data.

With <mark style="color:blue;">Knowledge Graphs, we can improve tasks like information retrieval, recommendations, and large language models (LLMs)</mark> by making them more context-aware and accurate. In the next section, we'll explore how integrating Knowledge Graphs enhances these capabilities.

***
