Keywords

1 Introduction

Smart applications and infrastructures are increasingly relying on graph computations. We are witnessing a continuous increase in the use of graphs to model real-world problems [1]. The emergence of many graph-based software, programming languages, graph databases, and benchmarks—such as ArangoDB, Neo4j, Sparksee, Gremlin, and Graph 500—provide the evidence for the increasing popularity of graph-based computing. Graph analytics plays an important role in information discovery and problem solving. A graph can be any real-life application that can be used to find a relation, route, or a path. Graphs have many applications such as image analysis [2], social network analysis [3, 4], smart cities [5,6,7], communication networks [8,9,10,11,12,13,14], scientific and high performance computing [15,16,17,18,19,20], transportation systems [21], Web analyses [22], healthcare [23,24,25], and biological analyses [26]. In these applications, a large amount of data is being generated every second, commonly referred to as big data.

Big Data refers to the “emerging technologies that are designed to extract value from data having four V’s characteristics; volume, variety, velocity and veracity” [27, 28]. Volume defines the generation and collection of the vast amount of data. Variety defines the type of the data stored or generated. Types include structured, semi-structured, and unstructured data. Velocity describes the timeline related to the generation and processing of big data. Veracity refers to the challenges related to the uncertainty in data. Big Data V’s and Graphs have a close relationship. For example, volume could represent the number of edges and nodes, and velocity could be considered as the graph’s streaming edges. A graph could be uncertain (veracity) and has the variety characteristics because data sources could vary.

The processing of graphs in a distributed environment is a great challenge due to the size of the graph. Typically, a large graph is partitioned for processing. A graph can be partitioned to balance the load on the various machines in a cluster. These partitions are processed in a parallel distributed environment. For the computation of the graph data on the distributed platform, there is a need for scalability and efficiency. These are the two key elements to achieve good performance. We also need to move our data closer to computation to minimize the overhead of data transfer among the nodes in the cluster. Load balancing and data locality plays a major role in achieving this purpose. It can utilize the whole resource of the system during processing. Moreover, as mentioned earlier, big data cannot be processed by traditional tools and technologies. There are many platforms for graph processing, but these platforms have performance issues. Parallel computation of large graphs is a common problem. Therefore, in this scenario parallel distributed platforms are suitable for processing large graphs. In this work, we have used the GraphX [29,30,31] for parallel distributed graph processing which is a widely used framework for the graph processing. The big data platform that we have used for distributed graph computing of shortest paths is Apache Spark [32].

This chapter extends our earlier work on single source shortest path computations of big data road network graphs using Apache Spark. In our earlier work [33], we had used the US road network data, modelled as graphs, and calculated shortest paths between two vertices over a varying number of up to 368 compute cores. The experiments were performed on the Aziz supercomputer (a former Top500 machine [34]). We had analyzed Spark’s parallelization behavior by solving problems of varying graph sizes, i.e., various states of the USA with up to over 23 million vertices and 58 million edges.

We focus in this chapter on computing a set of large varying number of shortest path queries on a (source, destination) vertex pair. The number of queries used are 10, 100, 1 K, 10 K, 100 K, and 1 M queries executed over up to 230 CPU cores. We achieve good performance, and as expected, the speedup is dependent on both the size of the data and the number of parallel nodes. In addition to the extended results, this chapter provides a detailed literature on shortest path graph computations. The system architecture for graph computing in Spark is explained with additional details using the architecture depiction and elaborated algorithms. We call our system, the Big Data Shortest Path Graph Computing (BDSPG) system.

The rest of the chapter is organized as follows. Section 8.2 gives background and literature review. Section 8.3 describes the design and methodology of the BDSPG system. Section 8.4 presents the analysis of result. The conclusions and future directions are given in Sect. 8.5.

2 Literature Review

Smart urban infrastructure greatly replies on smart mobility designs. Many approaches have been proposed to address smart mobility-related challenges [35]. These include, among many others, modelling and simulation-based approaches [36, 37], location-based services [38], telematics [39], social media-based approaches [40,41,42], approaches based on vehicular networks (VANETs) and systems [43,44,45], autonomic mobility management [46, 47], autonomous driving [48], mobility in emergency situations [49,50,51,52,53,54], approaches to improve urban logistics [40, 55], and big data-based approaches [40,41,42, 56]. A recent book discusses several smart society proposals on infrastructure and applications including smart mobility [7]. Many mobility problems naturally map to graph-based computations; shortest path computations are one of them and are of great significance in smart mobility infrastructure designs. In this section, we discuss state-of-the-art work from the literature on graph-based road network shortest path computations.

Quddus and Washington developed an algorithm to find the shortest path between two points called weight-based shortest path and vehicle trajectory aided map-matching (stMM) [57]. It improves the map-matching of low-frequency positioning data on a roadmap. They exploit a well-known A* search algorithm. They tested the performance of proposed approach with collected data from rural, suburban, and urban areas in Nottingham and Birmingham, UK. Szucs designed and implemented a model and an algorithm for route planning in road network [58]. They proposed a solution that also aims to find the equilibrium in the path optimization problem. The proposed approach takes the uncertainty of state information of roads, their uncertainty and influencing factors into account. The system is based on the Dempster-Shafer theory, which helps to model the uncertainty and Dijkstra’s algorithm which allows finding the best route. Feng etal. proposed an improvement of alternative route calculation, based on alternatives figures [59]. They exploit a bidirectional Dijkstra algorithm to explore the route. They introduced three quotas to measure the quality of an Alternative Figures (AG). They introduce the concept of pheromones into the Plateau method and enhance the ability of Plateau method to find a meaningful alternative road.

Zeng and Church demonstrated the relative value of A* algorithm to solve simple point-to-point shortest path problems on real road networks [60]. It is applied to road networks from two counties of California, USA. They state that Dijkstra algorithm can be improved by taking advantage of network properties associated with GIS-source data. Whereupon, Dijkstra does not take advantage of the spatial attributes which are available in a GIS setting, while A* can take the advantage of spatial coordinates in trimming the search to find the shortest path. Malewicz etal. proposed Pregel, a framework for large-scale graph processing [61]. The framework is similar in concept to MapReduce. It provides users with a natural API for programming graph algorithms while managing the details of distribution invisibly, including messaging and fault tolerance. It contributes providing a suitable system for large-scale graph computing. They deployed dozens of Pregel applications. The users report that the API is intuitive, easy to use, and flexible. The experiment shows that the performance, scalability, and fault tolerance of proposed framework are satisfactory for computing graph jobs with billions of vertices.

Yan etal. proposed a framework called Graphine for graph-parallel computation in multicore clusters [62]. It addresses the problem of existing distributed graph-parallel frameworks which cannot scale well with the increasing number of cores per node. They implemented the proposed framework and evaluated it. The experiment result shows that their proposed framework achieves sublinear scalability with the number of nodes, a number of cores per node, and graph size up to one billion vertices, as well as achieves 2∼15 times faster than the state-of-the-art Power Graph on a cluster with 16 multicore nodes. Selim and Zhan proposed an algorithm and data reduction technique based on data nodes in large networks dataset [63]. It is done by computing similarity computation, maximum similarity clique (MSC), and then finding the shortest path due to the data reduction in the graph. The technique aims to reduce the network that will have a significant impact regarding performance (shortest time and faster analysis) on calculating the shortest path. The proposed technique takes into account shortest path problem between two nodes in a large undirected network graph. The result shows that their proposed technique beats up Dijkstra’s shortest path algorithm with large datasets with respect to execution time. Zhou etal. presented a new graph processing framework based on Google’s Pregel called P++ [64]. The proposed framework aims to reduce the system overhead for algorithms that require many iterations in Pregel. It extends Pregel by some new terms such as introducing a new data structure, internal compute, super-vertex, and new API. Their proposed approach has been evaluated by using real datasets with cases Shortest Path and PageRank. The result shows that their proposed technique demonstrate its superior performance.

Cao etal. proposed an approach for solving the stochastic shortest path problem in vehicle routing [65]. It aims to find the optimal path that maximizes the probability of arriving at specified destination before the given deadline. Their approach is data-driven which explores big data generated in traffic. They evaluated the performance using a real traffic data extracted from real GPS trajectories of vehicles in road network of Munich city, which consists of 170 nodes and 277 edges. The experiment result shows that the proposed approach outperforms traditional methods. Hou U etal. developed a framework to solve online shortest path problem called live traffic index (LTI) [66]. The proposed framework aims for computing the shortest path according to live traffic conditions. It enables drivers to effectively and quickly get the live traffic information on the broadcasting channel. There is no existing efficient solution that can offer affordable costs for online shortest path computation at both client and server sides. The conventional architecture scales poorly with the number of clients. Their approach is that the server collects live traffic information and distributes it over radio or wireless network. They evaluated their approach with four different road maps, including New York City, San Francisco bay area road map, San Joaquin road map, and Oldenburg road map. The result shows that their proposed method reach optimal solution in terms of four performance factors for online path computation.

Strehler etal. developed a model called fully polynomial-time approximation scheme (FPTAS) for finding shortest energy-efficient routes for electric and hybrid vehicles [67]. It aims to resolve the problem of electric and hybrid vehicles regarding the shortest path problem and planning of the trip, whereupon recharging an electric car takes longer than refilling fossil fuels car. Their contribution is introducing a general model for the routing of hybrid and electric vehicles with intermediate stops at charging station and convertible resources. They are using Matlab to represent and test their model. The used dataset are engine model, topographical information, and road data of German. They are in improvement phase that the running time of the proposed algorithms may not be suitable for practical purposes, particularly, when it is running on a mobile phone or on an in-car device. Hong etal. developed a multicore computing approach to find shortest route from single source and single destination while avoiding obstacles [68]. Whereupon, the existing approaches have limited ability in dealing with real-time analysis in big data environments. They use multicore computing to speed up the computation and analysis using Python’s official Multiprocessing library. Thus, the parallelization is core based. The approach itself exploits the notion of a convex hull for evaluating obstacles and constructing pathways iteratively. The experiment result shows their proposed approach for parallel processing has significant improvements over sequential computing for wayfinding and navigation tasks with a large number of obstacles in complex urban area. Mozes etal. developed an algorithm by combining two techniques for computing shortest paths in directed planar graphs [69]. The two combined techniques are STOC’94 and FOCS’01. It aims to remove the log n dependency of the shortest path algorithm in the running time, in order to have better and optimal performance. The theoretical proving shows that their proposed technique obtains a speedup over previous algorithms for solving shortest-path problem.

In this work, Abraham etal. [70] have worked on the point to point the shortest path computations on the road network data. They modelled the road network as a graph having highways with low dimension. The algorithm they named Hub labels for computation of shortest path. The authors claim that it works faster for all types of queries. However, they have not used the parallel implementation of an algorithm. The performance this might suffer from significant data computation of this algorithm. In this chapter, they have not used the US road network dataset. It uses a general algorithm for the computation of road network graph data. Sanders etal. [71] have presented the real-world road network processing algorithms. They claim that algorithm takes less as compared to the Dijkstra. In this work, they have not used the parallel implementation of the algorithm. They also did not use the big data computation. Peng etal. [72] has presented a framework for the computation of the road network distance using a single source-target pair. In presented algorithms, they mapped the distance into a distributed structure of hash. For the implementation, they used the Apache Spark and in memory computation for the distance of road network computation. They experimented their algorithm using US and NYC road network dataset. Zhu etal. [73] have proposed an index structure called Arterial Hierarchy (AH) for the shortest and distance queries in a road network. They argue that existing work concentrates on the practical or asymptotic performance. The problem with state of the art was worst regarding space and time. The primary objective of this chapter was to minimize the gap between theory and practice for shortest path queries on road network. For the evaluation, they have used the 20 million nodes. The proposed technique performs better than existing approaches for road network dataset. Moreover, in this work, they have not used the weighted road network graph data.

Zheng etal. [74] have presented all pair shortest path algorithm. The proposed algorithm was an alternative to the Floyd-Warshall. They implemented their algorithm using Apache Spark and analyzed the performance of their algorithm. They argue that the performance of Floyd-Warshall algorithm suffered using Apache Spark due to a large number of global updates. To solve this issue, they have used the fewer global update steps based on computation that has been done on each iteration. As a result, they showed that their algorithm performs better than Floyd-Warshall algorithm. However, their work is different as compared to our work. We are parallelizing the shortest path between two vertices source and target. Djidjev etal. [75] have presented all pair shortest path algorithm using GPU cluster. They have used both centralized and decentralized computation for the all pair shortest path algorithm. They have presented the two algorithms that use the Floyd-Warshall method. For implementation, they have used the multi-GPU cluster. They have also used the California state road network dataset that consists of 1.9 million vertices and five million edges. Aridhi etal. [76] have presented the shortest path algorithm on the base of MapReduce. To solve the shortest path problem in an efficient way, they have partitioned the graph into subgraphs then they process it parallel. The algorithm they have proposed is an iterative whose performance will suffer when these are large of input data due to its iterative nature. For an experiment, they have used the French road network dataset from the OpenStreetMap. For the computation, they have used the Hadoop and MapReduce. Faro etal. [77] have presented a shortest path all pair algorithm with and without traffic congestions on the road network. The main objective of this chapter was to find the fastest shortest path on road network. They implemented the proposed all pair shortest algorithm parallel. First, they tried to find the shortest path then tried to find the alternate shortest path in case of traffic congestion. They implemented their algorithm using the GPU. They have not used any road network dataset, neither Spark nor Hadoop.

Kajdanowicz etal. [78] used the Bulk Synchronous Parallel (BSP), map-side join, and MapReduce for the graph computation. They applied these approaches for the single source shortest path (SSSP) and relational influence propagation (RIP) for collective classification of graph vertices. They stated that using BSP iterative graph processing perform better as compared to MapReduce. Liu etal. [79] have proposed a framework for parallel processing of large graph to solve the issue of communication between partitions, unbalanced partition, and replication of vertices. This framework uses three different greedy graph partitioning algorithms. They run these algorithms using the various dataset and observed that whether these algorithms can solve the issues of graph partitioning based on the specific needs. The major objective of this framework was to balance the load and reduce the bandwidth. Wang etal. [80] proposed a technique for k-plex enumeration and maximal clique approach. Using the binary graph partitioning approach, find the dense subgraph from the graph. It parallel process each partition of the graph by dividing the graph. MapReduce was used for implementation. Braun etal. [81] presented a new approach for social network analysis for knowledge-based systems. The major objective of this technique is to mine the interests of social network and represent as graph. The directed graph has been used for relationship analysis and undirected graph has been used to capture mutual friends. They have used the Facebook and Twitter dataset to analyze the performance of the proposed approach.

Laboshin etal. [82] proposed a new framework based on MapReduce to analyze the web traffic. The major objective of proposed framework was to scale the storage and computing resources for the extensive network. Liu etal. [83] proposed a clustering algorithm for the distributed density. This algorithm solves the issues in distance-based algorithms. This algorithm calculates the distance among all pairs of vertices. The authors claim that using this algorithm computational cost will be reduced. They implemented their algorithm using Apache GraphX [29, 30]. Aridhi etal. [84] investigated different frameworks for mining of big graph. The major focus was to use the mining algorithm for pattern mining that consists of the discovering useful information from the huge graph dataset. They analyzed comprehensively different mining techniques for the large graphs. Drosou etal. [85] proposed an enhanced Graph Analytical Platform (GAP) framework for processing of large graph dataset. This framework uses the top-down approach for mining of huge graph dataset. It provides the strength to features like HR clustering. It is an effective framework for the big data getting useful insights.

Zhao etal. [86] evaluated various graph computation platforms. They did comparison between graph- and data-parallel platform for processing of large dataset. They found out that graph-parallel platforms perform better for resource utilization and graph computation as compared to data-parallel platforms. However, data-parallel platform for graph processing is superior in performance regarding size. Mohan [87] etal. compared the graph computation platforms for large data processing using the key features and performance. Miller etal. [88] investigated the graph analytics from perspective of query processing. There are issues in finding the interesting information from the graph whether it’s a shortest path or pattern matching from the graph. They also introduced algorithms which show that vertex centric and graph centric algorithms are easily parallelizable. They stated that MapReduce is not an ideal platform for the iterative algorithm.

Chakaravarthy etal. [89] proposed an algorithm that is derived from the Delta-stepping and Bellman-Ford algorithms. The primary objective was to categorize the edges, minimize the traffic of inner vertices, and optimize the directions. They applied the single source shortest path (SSSP) to get the shortest path between the vertices. Yinglong etal. [90] stated that big data analytics are essential for the entities that can be represented as graph. It is the main challenge for the computation of graph bases patterns. They presented a new architecture that allow users to organize the data for parallel computation. This architecture has three components: graph storage, analytics, and visualization. They evaluated the data locality for the processing and effects on the performance of cache memory on a processor. Zhang etal. [91] presented an algorithm for the fast graph search. This algorithm converts the completed graphs into vectorial representations on the basis of prototype in the database. So, it accelerates the query efficiency in a Euclidean space by using locality-sensitive hashing. They examined their proposed approach using real dataset that gets higher performance regarding accuracy and efficiency. Pollard etal. [92] proposed a new technique for the parallel graph processing platforms analysis based on the performance and scalability. They used the breadth first search, page rank, and single source shortest for the analysis of power consumption and performance with packages of graph processing with various datasets.

Table8.1 provides a comparison of various shortest path graph computation approaches. The table includes information for each work regarding aim and objectives, the approach used, the dataset sources, the type of datasets, the platforms used, research gap, and comments. We would have preferred to include the names of authors of the respective works in a separate column in the table but these were omitted to save space and fit the table in as few pages as possible.

Table 8.1 Comparison of various shortest path graph computation approaches

3 Methodology and Design

This section details the methodology and design of our Big Data Shortest Path Graph Computing (BDSPG) System.

Figure8.1 shows the architecture for shortest path computations. First, it will take the graph data as input for the computation of shortest path. This data can be directed or undirected graph data but in our work, we are using undirected graph dataset. Once we have data we have uploaded any distributed file system, so nodes in the cluster can easily access this data. There can be any distributed file system such as FEFS, NEFS, and HDFS. But in our work, we are using HDFS. After keeping input graph data, we build the graph and perform the one pair shortest path (OPSP) using GraphX [31]. After computation of OPSP, we shall get the shortest path having total distance and vertices in the source and target vertex.

Fig. 8.1
figure 1

The Big Data Shortest Path Graph Computing (BDSPG) System Architecture

We propose an approach for the parallel shortest path computation with multiple queries of a pair of vertices using Apache Spark. In this approach, we have two functions: The One Pair Shortest Path (OPSP) algorithm to find the best route between a pair of vertices, and the main driver program which builds the graph, constructs and parallelizes the queries, and invokes OPSP function. Algorithm 1 (Fig.8.2) shows the OPSP algorithm. In this algorithm, we employ the concept of the well-known Dijkstra algorithm to find the optimal route between source and destination in a graph problem. This algorithm first explores the neighbor vertices of the current vertex from distPaths[0] (the path of minimum distance from src to dest), inserts the neighbor’s vertex id to a set of explored vertices exp[], if the neighbor vertices have not been explored in advance, keeps track of explored paths from source to the neighbors (the list of vertices to reach the neighbors) and its distance (neboursPath.concat(distPathRest)), picks the path with minimum distance to be explored further (sortByDist() ascending), and calls the OPSP function itself (recursive) until the path with minimum distance meets the destination, then will return the minimum distance and the paths to reach the destination (dist, paths.reverse()).

Fig. 8.2
figure 2

The One Pair Shortest Path (OPSP) Algorithm

Algorithm 2 (Fig.8.3) shows the main driver program. It builds the graph, the queries, and executes the queries with OPSP algorithm. First, the program builds a graph from the given input of vertices and edges G(V,E). Then, constructs queries q from the given input of list of queries(src,dst), which contains multiple pairs of src and dest. Furthermore, q is partitioned with np size and becomes Q(src,dst). Afterwards, Q(src,dst), G(V,E), and initialization variable exp[] as a set of explored vertices, and distPaths(list(k,v[])) as an initial step of 0 distance and source vertex are passed to OPSP function in Algorithm 1. Multiple queries of Q(src,dst) are executed in parallel by multiple executors in cluster nodes of Spark. Thus, each executor computes different multiple queries at the same time t.

Fig. 8.3
figure 3

The Master Algorithm

3.1 Dataset

We have used the DIMACS [94] dataset. The DIMACS is a collection of various datasets. It also has road network dataset containing more than 50 states of the USA and various districts. It is an undirected weighted graph that consists of millions of edges and nodes. We considered in our experiments the entire US dataset. We have also investigated in this chapter results for five different states of the USA. These are District of Columbia (DC), Rhode Island (RI), Colorado (CO), Florida (FL), and California (CA). Each node has node id, latitude, and longitude. Every edge also has source node id, target node id, travel time, distance, and category of road. Table8.2 shows the number of edges and vertices in different states as well as for the complete US road network. Figure8.4 graphically displays degree of vertices for selected states and whole US road network.

Table 8.2 USA road network dataset
Fig. 8.4
figure 4

Visualization of (a) District of Columbia road network (b) Florida road network (c) Colorado road network (d) Whole US road network

We also have visualized road network dataset using Gephi [95]. We have only visualized the DC and RI state data set as shown in Figs. 8.5 and 8.6, respectively. We could not visualize the other states data due to the large size which cannot be handled on a single PC. We have only visualized two states to perceive the structure of road network datasets. We will look into visualizing larger datasets using Spark in the future.

Fig. 8.5
figure 5

District of Columbia road network

Fig. 8.6
figure 6

Rhode Island road network

4 Results and Discussion

For experimental setup, we have built a Spark cluster setup on the Aziz supercomputer [34]. In this configuration, we have used different number of nodes, varying from one to sixteen. We have used Apache Hadoop HDFS to store input and output data. Apache Spark has been used for the data processing. The Master and Salve Spark nodes used on the Aziz supercomputer have the following configuration.

  • Linux CentOS, JDK 1.7, Dual Socket Intel Xeon E5-2695v2 12-core processor, 2.4 GHz, Total 24 cores, 96GB RAM, Apache Spark 2.0.1, GraphX Apache Hadoop HDFS.

4.1 Single Shortest Path Query Results

In our earlier work [33], we had presented results for a single shortest path query on up to 16 nodes (368 cores) for the USA states DC, RI, CO, FL, CA, and the whole US road network with up to over 23 million vertices and 58 million edges (see Table8.2). See [33] for the detailed results and analysis.

4.2 Multiple Shortest Path Query Results

The aim here is to investigate and achieve high performance in finding the shortest path of multiple queries with our proposed parallel-shortest path algorithm between the source and the target. Using Spark, we run in parallel a varying number of queries, each computing shortest path between a (source, destination) pair; see Sect. 8.3 for details. In these experiments, we use Rhode Island (RI) road network, USA, which consists of 53,658 vertices and 69,213 edges.

The results in Fig.8.7 show that parallelization does not have a significant impact when executing a small number of queries. This is because the job is too small compared to the number of cores. It has an ineffective job distribution and takes a long time for I/O overhead among the cores which are distributed among up to 10 nodes (with 24 cores each). Three different queries are used in the figure: 10, 100, and 1000 queries. The horizontal axis shows results for varying number of cores: 23, 46, up to 230. Each Aziz node contains 24 cores. However, we keep one core for the operating system to perform its job. Thus, we utilize 23 cores for each node. The vertical axis gives the total runtime to compute the whole sets of queries.

Fig. 8.7
figure 7

Parallel execution time of varying number of cores

A larger number of queries (10 K, 100 K, and 1 M) show a clear reduction and advantage in execution time while parallelizing the whole sets of queries as shown in Figs. 8.8 and 8.9. As usual the letter K denotes a thousand and M indicates a million.

Fig. 8.8
figure 8

Parallel execution time of varying number of nodes

Fig. 8.9
figure 9

Parallel execution time of varying number of nodes

4.3 Speedup

According to the experimental results in Sect. 8.4.2, we have calculated the achieved speedup. Figure8.10 depicts that the achieved speedup is increasing with the increasing number of cores: 46 to 230. The figure depicts the speedups for six different query set sizes: 10, 100, 1 K, 10 K, 100 K, and 1 M. Note that the speedups for smaller computations get saturated for a smaller number of nodes compared to, for example, for larger query set of 1 M. The speedup is measured by using the following well-known formula.

Fig. 8.10
figure 10

Achieved speedup with different number of cores

$$ Sp=\frac{T_s}{T_p} $$

Sp denotes the achieved speedup, while Ts denotes the execution time of the sequential computation, and Tp denotes the execution time of parallel computation.

4.4 Relative Speedup

To further elaborate the speedup saturation for increasing query set sizes and the number of cores, we now investigate relative speedup, the core-based speedup. The gained relative speedup is quite stable for large number of queries (1 M), and it is fluctuating for 100 K queries, as shown in Fig.8.11. Whereas, for small queries less than 10 K, the relative speedup is decreasing. The following formula is used to calculate the relative speedup.

Fig. 8.11
figure 11

Achieved relative speedup with different number of Aziz nodes

$$ \mathrm{Relative}\kern0.17em \mathrm{speedup}=\frac{Sp}{NC} $$

Sp and NC indicate the achieved speedup and the number of used cores, respectively.

5 Conclusion

Smart applications and infrastructures are increasingly relying on graph computations to model real-life problems and process big data. The emergence of many graph-based software, programming languages, graph databases, and benchmarks, and their use in application domains provide the evidence for the increasing popularity of graph-based computing. In this chapter, we have our earlier work on single source shortest path computations of big data road network graphs using Apache Spark. In our earlier work [33], we had used the US road network data modelled as graphs and calculated shortest paths between two vertices over a varying number of up to 368 compute cores. The experiments were performed on the Aziz supercomputer (a former Top500 machine [34]). We had analyzed Spark’s parallelization behavior by solving problems of varying graph sizes, i.e., various states of the USA with up to over 23 million vertices and 58 million edges. We call our system the Big Data Shortest Path Graph Computing (BDSPG) system.

In this chapter, we have focused on computing a set of large varying number of shortest path queries on a (source, destination) vertex pair. The number of queries used were 10, 100, 1 K, 10 K, 100 K, and 1 M, executed over up to 230 CPU cores. We achieved good performance, and as expected, the speedup is dependent on both the size of the data and the number of parallel nodes. In addition to the extended results, we have provided a detailed literature on shortest path graph computations. The system architecture for graph computing in Spark was explained with additional details using the architecture depiction and elaborated algorithms.

Future work will look into improving algorithms for sequential shortest path algorithm and its parallelization including data locality. There is a need for further performance analysis of our proposed system. We wish to apply the BDSPG system to the smart city case studies developed in [5, 6, 55].