CONDOR: Dynamic Planning for Link Discovery

Source code:

The provision of links between knowledge bases is one of the core principles of Linked Data. With the growth of the number and the size of RDF datasets comes an increasing need for scalable solutions to support the linking of resources. Hence, the growth of knowledge bases on the Linked Data Web in size and number has led to a significant body of work which addresses the two key challenges of Link Discovery (LD): efficiency and accuracy (see [1] for a survey). In CONDOR, we focus on the first challenge, i.e., on the efficient computation of links between knowledge bases. We address the scalability of the execution of link specifications by presenting the first dynamic planning approach for Link Discovery.

Most LD frameworks use combinations of atomic similarity measures by means of specification operators and thresholds to compute link candidates. The combinations are often called linkage rules [2] or link specifications to compute links [1]. So far, most approaches for improving the execution of LSs have focused on reducing the runtime of the atomic similarity measures used in LSs (see, e.g., [3],[4],[5]). While these algorithms have led to significant runtime improvements, they fail to exploit global knowledge about the LSs to be executed. In CONDOR, we build upon these solutions and tackle the problem of executing link specifications efficiently.

CONDOR makes used of a minute but significant change in the planning and execution of LSs. So far, the execution of LSs has been modeled as a linear process (see [1]), where a LS is commonly rewritten, planned and finally executed. While this architecture has its merits, it fails to use a critical piece of information: the execution engine knows more about runtimes than the planner once it has executed a portion of the specification.

The core idea behind CONDOR is to make use of the information generated by the execution engine at runtime to re-evaluate the plans generated by the planner. To this end, we introduce an architectural change to LD frameworks by enabling a flow of information from the execution engine back to the planner. While this change might appear negligible, it has a significant effect on the performance of LD systems as shown by our evaluation.

The goal of CONDOR is to improve the overall execution time of LSs. To this end, CONDOR aims to derive a time-efficient execution plan for a given input LS L. The basic idea behind state-of-the-art planners for LD (see [6]) is to approximate the costs of possible plans for L, and to simply select the least costly (i.e., the presumably fastest) plan so as to improve the execution costs. The selected plan is then forwarded to the execution engine and executed. We call this type of planning static planning because the plan selected is never changed.

CONDOR addresses the planning and execution of LSs differently: Given an input LS L, CONDOR's planner uses an initial cost function to generate initial plans P, of which each consists of a sequence of steps that are to be executed by CONDOR's execution engine to compute L. The planner chooses the least costly plan and forwards it to the engine. After the execution of each step, the execution engine overwrites the planner's cost function by replacing the estimated costs of the executed step with its real costs. The planner then re-evaluates the alternative plans generated afore and alters the remaining steps to be executed if the updated cost function suggests better expected runtimes for this alteration of the remaining steps.

Based on our evaluation, CONDOR achieves significantly better runtimes than existing planning solutions while retaining an F-measure of 100%. We quantified our improvement by evaluating our approach on 7 datasets and 700 link specifications. Our results suggest that CONDOR is up to 2 orders of magnitude faster than the state of the art and requires less than 0.1% of the total runtime of a given specification to generate the corresponding plan.

CONDOR was accepted at the Research Track of the Extended Semantic Web Conference (ESWC) 2018.



[1] Nentwig, M., Hartung, M., Ngomo, A.C.N., Rahm, E.: A Survey of Current Link Discovery
Frameworks. Semantic Web (Preprint) (2015) 1–18

[2] Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link Discovery

without losing Recall. In Marian, A., Vassalos, V., eds.: WebDB. (2011)

[3] Ngonga Ngomo, A.C., Auer, S.: LIMES - A Time-Efficient Approach for Large-Scale Link

Discovery on the Web of Data. In: Proceedings of IJCAI. (2011)

[4] Wang, J., Feng, J., Li, G.: Trie-join: Efficient Trie-based String Similarity Joins with Edit-

distance Constraints. Proc. VLDB Endow. 3(1-2) (September 2010) 1219–1230

[5] Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection.

In: Proceedings of the 17th International Conference on World Wide Web. WWW ’08, New

York, NY, USA, ACM (2008) 131–140

[6] Ngonga Ngomo, A.C.: HELIOS – Execution Optimization for Link Discovery. In: Proceed-

ings of ISWC. (2014)