Change Password

Input error
Input error
Input error
Submit

Change Nickname

Current Nickname:
Submit

XAI: Graph Learning & Graph Embedding

By Ricky Sun and Victor Wang

In our previous essays, we talked about graph data structures and database query languages, we also showcased broad-spectrum real-world applications that can be best powered with graph technologies. Social networks dwell on graphs that best model how people follow and befriend each other; biotech and pharmaceutical companies leverage graphs to understand protein interactions and chemical compounds efficacies; supply chains, telco networks, power grids are naturally presented as graphs. Many industries are looking to graphs to help with their businesses.

In the big-data era, more and more of these aforementioned businesses are interested in machine learnings to boost their predictability, some have been using deep learning and specifically varied neural networks to extract more predictive powers. There are 3 major problems lingering around though:

  • Siloed Systems within AI Eco-system
  • Low-Performance AI
  • Black-Box AI

Graph-0: AI Learning Techniques and Their Performance & Explainability

 

Graph-0 illustrates the 3 major problems well. First of all, there are quite a few learning techniques, but they are NOT offered in a one-stop-shop fashion; most AI users and programmers are dealing with multiple siloed systems, software and hardware-wise, which means a lot of data ETL are needed all the time (we’ll give detailed examples to illustrate how serious the problem is next). Secondly, AI (deep/machine) learning seems to be facing the dilemma of performance and explainability contradictory to each other, you can NOT have the best of both, and needless to say that as we go deeper into the deep-learning and ANN/CNN space, things are so black-box that human decision-makers are NOT okay with this status-quo. Last but certainly not the least important problem is most of these learning processes are NOT high-performance, and the data preparation, ETL, training, and application against production data can be a very complex and lengthy process! All in all, these problems deserve elevated attention, and ideally, infrastructure-level innovation should be attained to address and solve these challenges.

Graph database is that infrastructure-level innovation. And, we’ll show you why and how in this essay.

Traditionally, mathematical and statistical operations are rather limited on graphs. To understand the reason, we have to look at the evolution of graph data structures which was covered in our previous essay The Specific Evolution of Graph Data Structures. To recap, in either adjacency matrix or adjacency list centric data structures, very few math or stat operations can be done, not to mention these data structures are NOT high-currency capable.

Due to these limitations, developers have invented the concept of graph embedding, which basically transforms a property graph into vector space, so that a lot more mathematical and statistical operations can be performed with higher efficiency. This transformation essentially is to compress high-dimensional data to be represented in lower dimensions while still preserving certain properties (such as graph topology and node/edge relationships). The other benefit of such transformation is that vector operations are much simpler and faster. By saying vector space operations, we mean to refer to a very generalized and broad selection of data structures, and the extremely high-performance adjacency hash* is certainly one way to offer high-performance graph embedding operations.

Slightly expanded discussions around adjacency hash* is necessary. Ultipa Graph leverages adjacency hash* in its in-memory storage and computing architecture framework. A major advantage that’s highly relevant to graph embedding is that – the underlying data structure is vector-space comparable, therefore naturally graphically embedded in terms of data storage, which enables convenient mathematical and statistical computing operations. Moreover, we paralleled everything, from infrastructure, data structure, algorithms to engineering implementation. Our pursuit for high concurrency has never ceased to thrust forward, and this will ensure that otherwise time-consuming operations like graph embedding and learning can be completed as fast as timely as possible!

We’ll use the Word2Vec example to explain why graph embedding can be orders of magnitudes simpler and faster, and more lightweight with adjacency hash* within Ultipa’s graph platform. Word2Vec method can be seen as the basis, and prototypical, for all node, edge, or struct centric graph embedding methods. It basically transforms words into embedding vector space, so that mathematically in the lower dimensional vector space, similar words have similar embeddings – and 'You shall know a word by the company it keeps' as coined by the famous English linguist J.R. Firth in 1957 – this is to say that similar words with similar meanings tend to have similar neighboring words.

In the 3 following graphs, a Skip-gram algorithm and SoftMax training method implementing Word2Vec is illustrated, there are other algorithms (i.e., Continuous Bag of Words) and methods (i.e., Negative Sampling) which are abbreviated for discussion simplicity (See Graph-1, Graph-2 & Graph-3).

Graph-1: Word2Vec from Source Text to Training Samples

 

Graph-2: Architecture of Word2Vec Neural Network

Graph-1 shows some of the training samples extracted from a sentence with a window size of 2 (meaning 2 words pre- and 2 words post- the input/focus/highlighted word). Note that in the improved word2vec method, techniques like sub-samplings are used to remove samples that contain words like 'the', so that to improve overall performance.

To represent a word and feed it to a neural network, the one-hot vector method is used; as illustrated in Graph-2, the input one-hot vector for the word 'ants' is a vector of 10,000 elements, but with only 1 and all the other positions are 0s. Similarly, the output vector also contains 10,000 elements but is filled with the probability (floating points) of each nearby word.

Assuming we have only 10,000 words in our vocabulary, and 300 features, or parameters that you would use to fine-tune this Word2Vec model. This translates into 300 neurons in the hidden layer and 300 features in the word vector (See Graph-3).

Graph-3: Gigantic Word Vectors (3,000,000 Weights!)

As you can see from Graph-3, the hidden layer and output layer each carries 3,000,000 weights, this is NOT feasible for any real-world applications, in real-world circumstances, the vocabulary size can be as large as in the range of millions, multiplied by hundreds of features, which will make the size of one-hot vector reaching billions, even with the help of sub-sampling, negative sampling, and word-pair-phrase techniques, this sheer computational challenge is NOT ignorable. The one-hot vectors in both the input and output layers are clearly way too sparse a data structure, it’s like the adjacency matrix (Refer to The Evolution of Graph Data Structures for more information), it may be partitioned for parallel computing on GPUs when it’s very small in terms of the number of elements (<32,768 for instance) the matrix can hold for each dimension, but for much larger real-world applications, this will be overwhelming in terms of resource and time consumption.

Now, recall what we wanted to achieve with Word2Vec, essentially, we are trying to predict, in a search and recommend engine setup, when a word (or more) is entered, what other words we would like to recommend. This is a statistical game, and underlying, it’s a mathematical problem, for one word asked, a handful of other words with the highest probability of appearing as close neighbors for this asked word. This is naturally a graph challenge, every single word is considered a node (or vertex) in the graph, and its neighbors are forming edges connecting with it directly, this goes on recursively. And each node and edge can have its own weight, type, label, or attributes (as many as needed).

What we just described in the above paragraph can be represented in Graph-4, with two data structure variations. On the surface, they look similar to those big-table-like data structures, but under the hood, it’s the high-concurrency core implementation that makes things fly.  

Graph-4: Graph Data Structures with Natural Graph Embeddings

Given the adjacency hash* data structure, word2vec problem really boils down to two steps:

  • Assign a probability weight for each directed edge connecting two words, which can be done statistically;
  • Search and recommendation are simple and short graph traversal tasks that simply pick the top-X number of vertices connecting the starting vertex, and this operation can be done recursively to traverse deeper in the graph – this is like to search and recommend in real-time (similar to real-time suggestion, but potentially more accurate and intelligent if you recall this exactly resembles how human brain functions with NLP).

What we just described is a complete white-box process, every single step is specific, explainable, and crystal clear. It’s very much unlike the black-box and hidden layer(s) that you have to deal with cluelessly as in Graph 2&3. This is why graph will drive XAI (short for Explainable Artificial Intelligence) forward!

More specifically, adjacency hash* is naturally graphically embedded – there is no further computation or operations needed, when the data are presented in the graph in the format of adjacency hash*, the embeddings are done.

Graph-6: Breakdown of DeepWalk

DeepWalk is a second example to look into. It is a typical vertex-centric embedding approach and uses random walks to produce embeddings. With the help of adjacency hash*, random-walk is as easy as selecting a specific node and moving on to a random neighboring node, and deeply traversing the graph for a random number of steps (i.e., ~ 40 steps). Pay attention to the depth/length of the steps, we understand that with every hop deeper into the graph, the computational challenge is exponential; the simple math here is that, if an average node has 20 neighbors, going 40 steps deep is to look through neighbors as many as 2040, no production system today can handle this in a brutal-force way. However, with the help of highly concurrent adjacency hash*, each step of traversing (sampling) in the graph takes O(1), going 40 steps further is basically O(40*20) in terms of overall latency (though compute-wise, all the concurrency tasks still parallelly deal with all neighbors encountered and filter out any duplicated or revisited nodes effectively).

The otherwise typical DeepWalk steps are illustrated in Graph-6. With highly concurrent adjacency hash*, the random walking, training, and embedding processes are reduced into native graph operations like K-Hop-ing and Path-Finding, which are straightforward, explainable therefore XAI friendly!

DeepWalk has been criticized for the inability of generalization, meaning not being able to handle highly dynamic graphs (each new coming node must be re-trained) well, and not possible to preserve the local neighborhood (because deep random walks do NOT preserve neighborhood information).

Node2Vec method has been invented to address at least the second shortcoming of DeepWalk. Graph-7 illustrates the many steps taken to make the whole learning process complete. Please note that there are 10 steps (subprocesses) involved, each step by itself can further break down into a lengthy computing process. A graph embedding method like this usually takes a long time to complete. But, this is NOT the case with Ultipa’s graph embedding, we converted all these subprocesses into concurrent fashion so as to maximize the leverage of underpinning hardware computing juices as well as minimize turnaround time. Specifically, all of the following subprocesses are run in parallel fashion:

  • Node2Vec’s random walks
  • Preparation of precomputed Exp. Table
  • Learning of Vocabulary (from Training File)
  • Removal of Low-Frequency Vocabularies
  • Sub-sampling of Frequent Words (Vectors)
  • Initialization of Network (Hidden Layer, Input and Output Layers)
  • Initialization of Unigram Table (Negative Sampling)
  • Training of Skip-Gram

Graph-7: Outer Steps of Node2Vec Relying on Word2Vec

The parallelization of these steps helps achieve much better latency and performance improvement, always remember that every single second wasted on not achieving high concurrency is a criminal waste of resources, therefore not environment friendly, and we are not kidding.

There are other graph embedding techniques that bear similar concepts and steps:

  • Sampling and labeling
  • Training of certain models (i.e., Skip-Gram, n-Gram, and etc.)
  • Computing embeddings

Some are done locally at vertex/subgraph level, others are done against the entire graph therefore mathematically more time-consuming, whichever approach we take, being able to achieve the following would be meaningfully fantastic:

  1. Attain to high performance via high concurrency on data structure, architecture, and algorithm level;
  2. Explain every step of operations in a definite, specific, white-box way;
  3. Ideally, complete the operations in a one-stop-shop fashion.

If we recall Graph-0 at the beginning of this essay, the Explainability-vs-Performance chart is intriguing. Decision Trees have better explainability because we understand what they do step-by-step. Trees are essentially simple graphs, they don’t contain circles that are considered perplexing mathematically, therefore harder to explain, and most so-called graph algorithms or embedding implementations today are based on low-performance adjacency lists or academic-oriented-but-not-industrial-ready adjacency matrix. On the other hand, overly sophisticated neural networks and deep learning techniques are going in a direction that’s black-box-centric and difficult to explain. In real-world applications, AI can NOT go away with the human owners not being able to understand what exactly is going on under the hood – it’s against the XAI rule of thumb – Explainability!

Graph-8: Integrated Ultipa Real-Time Graph Database & All-In-One Manager

The breakthrough of graph data structures and implementation of highly parallel graph computing architecture would make achieving the aforementioned 3 goals highly tangible. This is what Ultipa Graph has been invented for and geared toward. In our next essay, we’ll be introducing more cool XAI features offered by Ultipa Graph.

The World’s Fastest & Most-Intuitive Graph System – Ultipa Graph.

Graph-9: Ultipa Manager 3D Mode

Want to read more?