Traditional Vector Embeddings
Last updated
Last updated
Embeddings are a way to represent data in a continuous vector space, where each data point (whether it's a word, image, or any other type of information) is mapped to a high-dimensional vector. These vectors are learned representations that capture the intrinsic properties and relationships between the data points.
In natural language processing (NLP), computer vision, and other fields, raw data (like text or images) is often unstructured and complex. For instance, words in a sentence have semantic meanings, and pixels in an image contain spatial information. However, machine learning models require inputs to be in a numerical format. This is where vector embeddings come into play.
Embeddings are typically represented as vectors of real numbers, often in high-dimensional spaces. For example, a word embedding might be a 300-dimensional vector where each dimension captures some aspect of the word's meaning or usage in the context of a large corpus. Visually, embeddings can be represented in lower dimensions (2D or 3D) using techniques like PCA, t-SNE, or UMAP to show clusters of similar data points.
While traditional vector embeddings, such as those used in Word2Vec or GloVe, have revolutionized the way we represent data, they are not without limitations:
1. Lack of Contextual Understanding: Traditional embeddings assign a fixed vector to each word or entity, regardless of its context. This means that words with multiple meanings (e.g., "bank" as a financial institution or the side of a river) are represented by a single vector, leading to potential misunderstandings in applications requiring context-aware processing.
2. Limited Semantic Granularity: These embeddings often capture only superficial semantic relationships. For example, synonyms may be close in the vector space, but more nuanced relationships like the difference between "happy" and "joyful" might be poorly represented, reducing the effectiveness in tasks requiring fine-grained understanding.
3. Static Representations: These embeddings do not evolve with new data or contexts. Once trained, the vector representations remain static, failing to capture shifts in meaning or emerging trends, which limits their adaptability to dynamic environments.
While traditional vector embeddings have been foundational, their limitations highlight the need for more advanced techniques, such as knowledge graph embeddings, to address contextual nuances and evolving data representations.
The key difference between these methods lies in how they generate embeddings. Word2Vec creates embeddings by focusing on the local context using two approaches:
CBOW (Continuous Bag of Words) and
Skip-gram.
The result is static embeddings, where each word has a single, fixed vector representation, regardless of its context in different sentences. For example, the word "bat" would have the same vector whether it refers to a flying mammal or a piece of sports equipment.
On the other hand, Transformer-based models like BERT, GPT, and T5 use the Transformer architecture, which leverages self-attention mechanisms to process text. These models produce dynamic embeddings, meaning the vector representation of a word varies depending on its context within a sentence. So, the word "bat" would have different embeddings when used in the context of animals versus sports.
In the KGE_Project, we've compared and visualized embeddings using Transformer models, highlighting their ability to capture nuanced, context-specific meanings.
In this project, we leverage CLIP (Contrastive Language–Image Pretraining) to extract rich image embeddings, enabling us to link visual content with textual descriptions in a shared vector space. Here's how the process unfolds:
Raw Image Input:
Each image is initially represented as a tensor with dimensions corresponding to its resolution and color channels. For example, an image with a resolution of 1280x960 and 3 color channels (RGB) is represented as a tensor of shape 3x1280x960.
Extracting CLIP Embeddings:
These image tensors are then passed through the pre-trained CLIP model. The CLIP model processes each image to generate a compact embedding, which is a 1-dimensional vector of fixed size. In this case, each image is transformed into a vector of shape 1x768.
If we have N
images, the output will be a matrix of embeddings with dimensions N x 768, where each row corresponds to the embedding of one image.
Dimensionality Reduction with t-SNE:
To visualize the high-dimensional embeddings (768 dimensions), we apply a dimensionality reduction technique called t-SNE (t-Distributed Stochastic Neighbor Embedding). This technique reduces the 768-dimensional vectors to a 3-dimensional space for visualization.
The resulting embeddings have dimensions N x 3, making it possible to visually compare and analyze the embeddings in a 3D plot.
These embeddings allow us to capture the semantic essence of images in a form that is directly comparable to text embeddings, facilitating tasks like image search, captioning, and cross-modal analysis.
Next, we'll explore the visualizations, code, and results of the embeddings.