Deep Neural Networks Based Approaches for Graph Embeddings
By : Firas BEN HASSAN
Supervisors :
Mr. Jörg Schlötterer
Prof.Ferdaous CHAABENE
Dr.Harald Kosch
Graphs, such as social networks, word co-occurrence networks, and communication networks, occur naturally in various real-world applications. Analyzing them yields insight into the structure of society, language, and different patterns of communication. Many approaches have been proposed to perform the analysis. Recently, methods which use the representation of graph nodes in vector space have gained traction from the research community.
Node Classification :
Often in networks, a fraction of nodes are labeled. In social networks, labels may indicate interests, beliefs, or demographics. In language networks, a document may be labeled with topics or keywords, whereas the labels of entities in biology networks may be based on functionality. Due to various factors, labels may be unknown for large fractions of nodes. For example, in social networks many users do not provide their demographic information due to privacy concerns. Missing labels can be inferred using the labeled nodes and the links in the network. The task of predicting these missing labels is also known as node classification.
Link Prediction :
Networks are constructed from the observed interactions between entities, which may be incomplete or inaccurate. The challenge often lies in identifying spurious interactions and predicting missing information. Link prediction refers to the task of predicting either missing interactions or links that may appear in the future in an evolving network.
Social Networks Datasets :
-KARATE :
Zachary’s karate network is awell-known social network of a university karate club Social network of friendships between 34 members of a karate club at a US university in the 1970
-BLOGCATALOG :
This is a network of social relationships of the bloggers listed on the BlogCatalog website. The labels represent blogger interests inferred through the metadata provided by the bloggers. The network has 10,312 nodes, 333,983 edges and 39 different labels.
-LiveJournal:
LiveJournal is a free on-line blogging community where users declare friendship each other. LiveJournal also allows users form a group which other members can then join. We consider such user-defined groups as ground-truth communities. We provide the LiveJournal friendship social network and ground-truth communities.
Name | Type | Nodes | Edges | Link |
---|---|---|---|---|
Karate | Undirected,Unweighted,Static | 34 | 78 | click here to download |
BlogCatalog | Undirected,Unweighted | 10312 | 333983 | click here to download |
LiveJournal | Undirected,Unweighted | 3 997 962 | 34681189 | click here to download |
Collaboration Networks Datasets :
-Cora:
The Cora dataset consists of Machine Learning papers The papers were selected in a way such that in the final corpus every paper cites or is cited by at least one other paper. There are 2708 papers in the whole corpus.
-Wiki:
Wiki contains 2, 405 documents from 19 classes and 17, 981 links between them.
-Citeseer:
Citeseer contains 3, 312 publications from six classes and 4, 732 links between them. Similar to Cora, the links are citation relationships between the documents and each paper is described by a binary vector of 3, 703 dimensions.
Name | Type | Nodes | Edges | Link |
---|---|---|---|---|
Cora | Undirected,Unweighted, | 2708 | 5429 | click here to download |
Wiki | Undirected,Unweighted | 2405 | 17981 | click here to download |
Citeseer | Undirected,Unweighted | 3312 | 4732 | click here to download |
Biology Networks Dataset :
-PROTEIN-PROTEIN INTERACTIONS (PPI) :
This is a network of biological interactions between proteins in humans. This network has 3,890 nodes and 38,739 edges.
Name | Type | Nodes | Edges | Link |
---|---|---|---|---|
PPI | Undirected,Unweighted, | 3890 | 38739 | click here to download |
Node2vec
node2vec (Network-only): Scalable Feature Learning for Networks,
[arxiv] [Python] [Python] [Python],
datasets(Cora, Zachary’s Karate Club, BlogCatalog, Wikipedia, PPI)
DeepWalk
DeepWalk (Network-only): Online Learning of Social Representations,
datasets(Cora, Zachary’s Karate Club, BlogCatalog, Wikipedia, PPI)
LINE
LINE(Network-only): Large-scale information network embedding,
datasets(Cora, Zachary’s Karate Club, BlogCatalog, Wikipedia, PPI)
Doc2vec Doc2vec (content-only ): Distributed Representations of Sentences and Document,
dataset (Cora)
Paper2vec Paper2vec ( combined ): Combining Graph and Text Information for Scientific Paper Representation,
dataset (Cora)
Glove Glove (content-only ): global vectors for word representation,
GraRep
Grarep: Learning graph representations with global structural information,
TADW
TADW ( combined ): Network Representation Learning with Rich Text Information,
Datasets (Cora, Citeseer, Wikipedia)
planetoid
[Planetoid: (Network-only)]Revisiting Semi-supervised Learning with Graph Embeddings,
Datasets (Cora, Citeseer, Wikipedia)
DNGR
DNGR: (Network-only) Deep Neural Networks for Learning Graph Representations,
[Matlab] [Python Keras], [Datasets]
ComplEx ComplEx :(Network-only)Complex Embeddings for Simple Link Prediction,
Requirement specification
The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of graph embeddings approaches on multiple tasks and on multiple datasets.
By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of graph embeddings algorithms.
In particular, the evaluation framework provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art and it allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes.
The evaluation framework should be an open-source and extensible framework that allows evaluating tools against 11 different approaches on 2 different tasks with 7 different datasets.