Knowledge Graph Embeddings
Last updated
Last updated
Triplet extraction with REBEL set up a base to create embeddings of both text and image from the COYO Subset dataset- https://www.kaggle.com/datasets/anantjain1223/coyo-1k-reduced.
The extracted triplets using REBEL (https://huggingface.co/Babelscape/rebel-large) for image urls were saved as .csv files at https://www.kaggle.com/datasets/agampy/triplets-kg and used for model training for embeddings.
TransE for embeedings used
Softplus Loss
epochs=100
We created a .tsv file with 'Head' : 'Relation' : 'Tail'
values and Pykeen library with TransE model and softplus loss with 100 epochs was used to train the model on the triplet sets. Since the dataset was quite small, the process was quick.
Once the PyKeen model was trained on the triplets, embeddings of both the entities ('head' & 'tail') and relation ('type'/'relation') we used the embeddings to plot Knowledge Graphs and further reduce their dimensions to 3 with PCA/ UMAP & t-SNE dim reduction techniques and save them for further investigation and comparisons with traditional vector embeddings.
For more insights to the code notebook, refer to https://github.com/dsgiitr/kge-clip/blob/main/3.KG_Embeddings/src/pykeen_KGE.ipynb
The triplets were extracted with Babelscape's REBEL-large model and saved as .csv file with "text" :"triplet"
The text triplets dataset can be found here : https://www.kaggle.com/datasets/agampy/text-triplets1k was used for training PyKeen model.
The trained model is then used to create embeddings of the triplets which can be found at https://github.com/dsgiitr/kge-clip/blob/main/3.KG_Embeddings/src/pykeen_KGE_text.ipynb.
Okay, we have the text and image KGE! and the corresponding Traditional Vector Embeddings with Word2Vec and CLIP reduced to 3d space with PCA/UMAP/t-SNE. Click on the link for the .csv
The detailed result files can be found in the assets folder located at 3.KG_Embeddings/assets/results/reduced_embeddings
Are these vectors static, or do they adapt based on the dataset and context, especially when using different language models? To explore this, we went a step further by plotting these embeddings in TensorBoard, uncovering insights into how context influences vector representation.
The visualisation
We’ve taken a deep dive into visualizing text and image embeddings using KGE, along with traditional vector embeddings from Word2Vec and CLIP. By reducing these embeddings to 3D space through PCA, UMAP, and t-SNE, we aimed to see which ones truly capture the context of our specific dataset.