International Journal of Science and Research (IJSR)

International Journal of Science and Research (IJSR)
Call for Papers | Fully Refereed | Open Access | Double Blind Peer Reviewed

ISSN: 2319-7064


Downloads: 132 | Monthly Hits: ⮙1

M.Tech / M.E / PhD Thesis | Aerospace Engineering | India | Volume 8 Issue 8, August 2019


DUST Removal Framework Based on Improved Multiple Sequence Alignment Technique

Pulagam Sai Nandana | K N Brahmaji Rao


Abstract: A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. These duplicate URLs, generically known as DUST (Different URLs with Similar Text), adversely impact search engines since crawling, storing and using such data imply waste of resources, the building of low quality rankings and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate documents without fetching their contents. To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. This information can be used by crawlers to avoid fetching DUST. A challenging aspect of this strategy is to efficiently derive the minimum set of rules that achieve larger reductions with the smallest false positive rate. As most methods are based on pair wise analysis, the quality of the rules is affected by the criterion used to select the examples and the availability of representative examples in the training sets. To avoid processing large numbers of URLs, they employ techniques such as random sampling or by looking for DUST only within sites, preventing the generation of rules involving multiple DNS names. As a consequence of these issues, current methods are very susceptible to noise and, in many cases, derive rules that are very specific. In this thesis, we present a new approach to derive quality rules that take advantage of a multi-sequence alignment strategy. We demonstrate that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. Experimental results demonstrate that our approach achieved larger reductions in the number of duplicate URLs than our best baseline in two different web collections, in spite of being much faster. We also present a distributed version of our method, using the Map Reduce framework, and demonstrate its scalability by evaluating it using a set of 7.37 million URLs.


Keywords: Search engines, Crawling, De-duplication, URL Normalization, Rewrite rules


Edition: Volume 8 Issue 8, August 2019,


Pages: 975 - 979


How to Download this Article?

You Need to Register Your Email Address Before You Can Download the Article PDF


How to Cite this Article?

Pulagam Sai Nandana, K N Brahmaji Rao, "DUST Removal Framework Based on Improved Multiple Sequence Alignment Technique", International Journal of Science and Research (IJSR), Volume 8 Issue 8, August 2019, pp. 975-979, https://www.ijsr.net/get_abstract.php?paper_id=3081902

Similar Articles with Keyword 'Search'

Downloads: 2 | Weekly Hits: ⮙1 | Monthly Hits: ⮙1

Review Papers, Aerospace Engineering, United Arab Emirates, Volume 10 Issue 11, November 2021

Pages: 368 - 371

A Review on the Development of Various Types of Rocket Propellants

Shrisudha Viswanathan | Karthika Chandramohan

Share this Article

Downloads: 3 | Weekly Hits: ⮙1 | Monthly Hits: ⮙3

Survey Paper, Aerospace Engineering, India, Volume 11 Issue 3, March 2022

Pages: 1468 - 1474

Black Hole Mystery

Munaf ul Raquib | Mohammad Hameez Larah

Share this Article
Top