Big graphs are ubiquitously used for data modeling in modern applications, such as online social networks, knowledge graphs and biological networks. Unlike relational databases where queries can be formulated in a uniform manner with relational algebra, operations on graphs are highly diversified. For example, PageRank and SimRank that are based on random walks, shortest distance and reachability that are based on path calculation, subgraph matching and dense subgraph mining that are based on pattern matching, frequent subgraph pattern mining that are based on Apriori-based data mining, graph neural networks and graph transformers that are based on backpropagation. Currently, the mainstream frameworks for big graph processing are vertex-centric (i.e., following a think-like-a-vertex computing model). While vertex-centric systems are easy to program, they are dominantly designed for IO-bound execution and are only suitable for graph problems with a low time complexity. However, a lot of graph mining problems such as subgraph matching and finding dense/frequent subgraph structures usually have a very high time complexity, and when IO-bound systems are applied, the performance is a catastrophe with CPU cores severely underutilized.
To address the above problem, I proposed a new parallel programming framework called T-thinker, or think like a task. T-thinker is able to effectively utilize the available CPU resources in a computing cluster to achieve near ideal speedup, and a number of graph processing systems have been developed on top such as (1) G-thinker for finding qualified subgraphs from a big data graph with applications such as dense subgraph mining and subgraph matching; (2) PrefixFPM for mining general frequent patterns from a transaction database, such as subsequences, subtrees, subgraphs and submatrices; (3) T-FSM for mining frequent patterns from a big data graph. We have also developed and are currently developing many other T-thinker based systems beyond the domain of graph processing, towards broader applications including spatial data processing and general data mining. Finally, I will briefly summarize my other research directions such as Data Science and Deep Learning.
Da Yan is currently an Assistant Professor of the Department of Computer Sciences at the University of Alabama at Birmingham (UAB). His research interests include parallel and distributed systems for big data analytics, graph data management, geo-spatial data management, data mining and machine learning. Dr. Yan regularly publishes in first-tier venues such as SIGMOD, VLDB, KDD, ICDE, ACM TODS, VLDB Journal, TKDE, TPDS, etc., where he also regularly serves as reviewers. At UAB, he designed and is currently teaching many Data Science courses including Foundations of Data Science, Machine Learning, and Deep Learning, which have been popularly taken by students of all backgrounds. Dr. Yan was the sole winner of Hong Kong 2015 Young Scientist Award in Physical/Mathematical Science, and a senior member of ACM and IEEE.
See upcoming and previous presentations at CS Seminar Series