Scalable Keyword Search on Large RDF Data : IEEE 2014
Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF data sets show that our techniques are scalable and efficient.
The RDF (Resource Description Framework) is the de-facto standard for data representation on the Web. So, it is no surprise that we are inundated with large amounts of rapidly growing RDF data from disparate domains. For instance, the Linked Ope n Data (LOD) initiative integrates billions of entities from hundreds of sources. Just one of these sources, the DBpedia dataset, describes more than 3.64 million things using more than 1 billion RDF triples; and it contains numerous keywords, as shown in Figure 1. Keyword search is an important tool for exploring and searching large data corpuses whose structure is either unknown, or constantly changing. So, keyword search has already been studied in the context of relational databases , , , , XML documents , , and more recently over graphs ,  and RDF data , .
We studied the problem of scalable keyword search on big RDF data and proposed a new summary-based solution: (i) we construct a concise summary at the type level from RDF data; (ii) during query evaluation, we leverage the summary to prune away a significant portion of RDF data from the search space, and formulate SPARQL queries for efficiently accessing data. Furthermore, the proposed summary can be incrementally updated as the data get updated. Experiments on both RDF benchmark and real RDF datasets showed that our solution is efficient, scalable, and portable across RDF engines. An interesting future direction is to leverage the summary for optimizing generic SPARQL queries on large RDF datasets.
Processor : intel Pentium IV
Ram : 512 MB
Hard Disk : 80 GB HDD
Operating System : windows XP / Windows 7
FrontEnd : Java
BackEnd : MySQL 5
- Abiteboul, R. Hull, and V. Vianu. Foundations Of Databases. Addison-Wesley, 1995.
- Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. CACM, 31(9):1116-1127, 1988.
- Agrawal, S. Chaudhuri, and G. Das. DBXplorer: enabling keyword search over relational databases. In SIGMOD, 2002.
- Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using banks. In ICDE, 2002.
- Bizer and A. Schultz. The berlin SPARQL benchmark. International Journal On Semantic Web and Information Systems, 2009.
- Broekstra and et al. Sesame: A generic architecture for storing and querying RDF and RDF schema. In ISWC, 2002.
- Chen, W. Wang, and Z. Liu. Keyword-based search and exploration on databases. In ICDE, 2011.
- Chen, W. Wang, Z. Liu, and X. Lin. Keyword search on structured and semi-structured data. In SIGMOD, 2009.
- B. Dalvi, M. Kshirsagar,and S. Sudarshan Keyword search on external memory data graphs. In VLDB, 2008.
- Duan, A. Kementsietsidis,K. Srinivas, and O.Udrea. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. In SIGMOD, 2011.