Research: A novel model of extraction of data from the deep web
With the rapid development of modern society, people’s need for information interaction is surging with the advent of the internet. With the help of the world wide web, information can disseminate quickly, and the forms through which information may be shared are increasing, including documents, pictures, audio clips, videos, hyperlinks, forms, and many more.
The surged demand for web information extraction technologies and the in-depth study of associated research work have fueled the development of deep web information extraction technology. Presently, various forms of deep web information extraction methods and tools have been developed.
Even though most of these tools and systems utilize web page wrappers to obtain structured data within a data source, the methodologies used and the fields of research involved are not the same. According to the principle of the method of identification and location of user’s data within web pages, various types of web information extraction systems and related technologies have been classified. The main types are: ontology based extraction, NLP based extraction, location based extraction, web query based information extraction, and wrapper modeling based extraction.
Nevertheless, most of the current methods applied for extraction of deep web data do not consider the domain requirements. On the other hand, a considerable amount of useful field data is stored in the background database, which belongs to hidden deep data, and requires continuous query and extraction. Motivated by this, a recently published paper analyzes the methods of extraction of web data entities based on domain knowledge. Moreover, the paper proposes a topic-oriented extracting model in order to obtain in-depth information on the domain web pages. The paper also designed a sorting classification extraction algorithm for numerical data. Throughout this article, we will overview the model proposed via this paper.
Domain Topic Extraction Model and search approach:
The domain topic extraction model presented via this paper has been improved on the basis of a generic deep web crawler model. The flowchart of the extraction model presented in this paper is shown in Figure (1). Compared with the used general deep web crawler model, the domain topic model developed has two more modules: the candidate URL priority calculation module and the page topic relevance calculation module.
The page topic relevance calculation module can filter the saved pages as per the relevance of the queried pages and topics. Whenever the relevance of the page to the topic exceeds the predefined threshold, the page’s candidate URL is extracted and fed into the candidate URL priority calculation module. The rules of calculation are as follows: whenever the candidate URL is relatively related to the topic, it is inserted in the front of the queue, the opposite is inserted into the back of the queue or is otherwise discarded.
Figure (1): Flowchart of the topic extraction model
If the relevance of the page to the queried topic is lower than the predefined threshold, the web page is discarded, and the candidate URLs present in the web page does not need to be extracted and prioritized. Accordingly, it can be proven that these two modules can directly affect the quality of crawled pages.
The best priority extraction approach:
Best-first search can be considered as an improvement over BFS. An essential principle of the best priority depends on the evaluation letter. An essential principle of the best priority is to continuously search for the path that is associated with lowest cost based on the evaluation result of the evaluation function. During the search process, the path associated with the lowest possible cost is eventually identified via continuous giving up of the costly path.
The topic crawling model, via means of the best priority strategy, preserves a URL priority to be crawled throughout the crawling process. Rank queues choose high priority URLs from the queue for web page download in order to analyze and calculate these pages. Links are prioritized, and then the URL priority queue is inserted in order to be crawled according to their priority levels. This process is repeated until the priority queue is empty or the termination condition is otherwise reached. Links’ priority depend on the relevance of the web page and the associated theme, and web pages with high topic relevance are usually preferentially crawled. As such, topic crawling models are usually climbed via means of optimal priority strategies. Nevertheless, this strategy also has the following features: the URL priority queue space to be crawled is relatively limited, and it is preferred throughout this queue. The highest-level URL is only temporary and is not necessarily associated with a highest global priority, so some of the deep relevance to the topic layer web pages may be discarded. The bottom line is that the Best-First Policy represents a simple and efficient best-first search strategy.
Compared with the general deep web crawler model, the domain topic extracting model boasts the following advantages:
– The domain topic extracting model utilizes the optimal priority search strategy rather than the common BFS strategy, due to the fact that the BFS does not always consider the domain requirements.
– The general model does not care about the net chastity’s content, while the domain topic model extracts the useful data associated with the target theme.
– The general model collects a large number of web pages no matter whether or not they are used, yet the domain topic extracting model avoids downloading unrelated pages.
This research study introduces a domain topic model for extraction of domain deep web data. This model involves two more modules, which are designed to filter saved pages on the basis of their relevance and target topics. Also, the paper proposes an optimal priority search strategy that has been proven to be better than the commonly used BFS searching strategy. More studies are needed to test the proposed models to prove their efficiency in extraction of deep web data.