A search engine for peer-to-peer information retrieval in XML documents
The quantity of digitally available documents continues to increase. Adequate methods are even more important for searching through very large document collections. In contrast to exact searches, in which you look for documents with known file names, here we use techniques of Information Retrieval (IR) to find relevant results for a query. For a few years, searches have been performed on collections with structured documents, in particular since the establishment of the eXtensible Markup Language (XML) as the official standard of the World Wide Web Consortium (W3C). In the meantime, there are a number of research approaches in which IR methods are applied to XML documents. XML Information Retrieval (XML-IR) uses the structure of the documents to make the search for and in these documents more effective, i.e. to improve the quality of search results, e.g. through focusing on particularly relevant document parts. However, previous solutions all refer to centralized stand-alone search engines for research purposes. Very large data collections distributed over a number of computers cannot be searched with them. Techniques for distributed XML-IR are also needed in practice wherever the system to be searched consists of a number of local, heterogeneous XML collections whose users do not want to or cannot save their documents on a central server; such users frequently connect together in form of a distributed peer-to-peer (P2P) network.
In the SPIRIX research project we study for the first time on the example of P2P networks, to what extent XML-IR can be effective and efficient in distributed systems. For this purpose, we design a generalized architecture model for the development of P2P search engines for XML retrieval in which functionality from the areas XML-IR and P2P is arranged in abstract layers. The model is used as a basis for the design of a specific P2P search engine for XML-IR. We develop different techniques for distributed XML-IR to implement the individual phases of the search: Indexing of documents, routing of queries, ranking of suitable documents, and retrieval of results. In particular, the problem of multi-term queries consisting of several search terms and distribution aspects are considered. In addition to the search quality to be achieved, the focus also lies on the necessary communication effort.
The developed methods are implemented in the form of a P2P search engine for distributed XML retrieval. This search engine named SPIRIX can conduct a fully functional search for XML documents in a P2P network search and evaluate their relevance based on content. For the communication between peers, a P2P protocol named SpirixDHT is designed, which is based on Chord and is specifically adapted for use in XML-IR.
For the evaluation of the designed techniques, we first demonstrate the search quality of SPIRIX. This is done through participation in INEX, the international initiative for the evaluation of XML retrieval. In INEX, XML-IR solutions are compared worldwide every year. In 2008, SPIRIX was able to achieve a search precision that is comparable to the quality of the Top-10 XML-IR solutions.
In further experiments, we evaluate the designed methods for distributed XML retrieval with INEX tools; for this purpose, we compare the respective achieved search quality and the needed effort. The gained insights are applied to the routing process; particularly interesting here is the question how XML structure can be used to improve performance in terms of the efficiency of a distributed system. The evaluation of the designed routing techniques shows a significant reduction in the number of sent messages, their size, and thus the network load, while at the same time achieving an improvement of the search quality.
The SPIRIX project thus shows that distributed XML-IR can be effective and efficient. At the same time, we show how the use of XML-IR techniques in routing of queries can contribute to further reducing the search effort - in particular for the communication between peers - to a point that the system can also be scaled to a large number of participating peers and still maintain a high search quality.