Role_of_ URLs_ in_ Objectionable_ Web_ Content_ Categorization

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

A Seminar Presentation On“The Role of URLs in Objectionable Web Content Categorization” : 

A Seminar Presentation On“The Role of URLs in Objectionable Web Content Categorization” Presented By : Sheetal.P and Ranjana Shivanna

Objectives : 

Objectives An Introduction URL Content Categorization Framework of Objectionable Content Categorization N-gram Representation of URLS Feature Selection & Feature Selection Measure Integration of URL and Content Based Classifier Conclusion and Future Scope

URL-Based & Content-Based Categorization : 

URL-Based & Content-Based Categorization Technique used to prevent from accessing non trusted and objectionable websites. Due to growth of Internet traffic it has become difficult to manage the web content. Automatic content categorization has been developed. Content-Based classification alone doesn’t help in filtering the objectionable websites.

Approach Used for Identifying Objectionable Websites : 

Approach Used for Identifying Objectionable Websites The approach used is a combination of URL based and content based categorization. URL-Based Content Classification : URL is broken into a sequence of n-grams with a range of n’s and a subset of n-grams is selected to represent the URL. N-grams : a tuple of n-characters. Tells us the frequency of occurrence of a group of words.

contd.. : 

contd.. Example of N-gram: Consider the word “example” and value of N=3 then the 3-gram notation of the above word is “exa” “xam” “amp” “mpl” “ple” N = 4 , then the 4-gram notation of the above word is “exam” “xamp” “ampl” “mple” N = 5, then 5-gram notation of the above word is “examp” “xampl” “ample” N = 6, then the 6-gram notation of the word is “example”

Framework of Objectionable Content Categorization : 

Framework of Objectionable Content Categorization For each objectionable category, we develop a content categorization classifier to identify websites of the objectionable category. The classification problem involves two categories target category and its complementary non- target category .

Contd… : 

Contd… Figure above shows the Framework of the Integrated Objectionable Content

N – gram Represenation of the URLs : 

N – gram Represenation of the URLs A URL is simply viewed as a string of characters. Example of an URL image search done through google looks like : http://images.google.co.in/images?hl=en&um=1&q=computer&btnG=Search+Images the 4-grams that can generated for the word “computer” “comp”, “ompu “, “mput”, “puter” It is difficult task to determine the optimal n in the n-gram method for content. Solution is to increase the range of n’s

Feature Selection & Feature Selection Measure : 

Feature Selection & Feature Selection Measure Most important stage in classification. Feature Selection Measure known as R-measure has been developed in which the terms of the target and non target categories are ranked separately.

contd.. : 

contd.. We use the below formula to calculate R-measure. t = term , C = target category , = non –target category, r = takes value between 0 & 1, = probability that t occurs in documents of target class. = probability that t occurs in documents of non-target class.

Slide 11: 

and are computed as resp. are respectively the number of documents in C and in which t occurs and are the number of documents in C and

Feature Selection : 

Feature Selection Generating n-grams for a range of n’s may result in large number of redundant n-grams. A simple way of removing the redundant n-grams is to use a dictionary of terms. R- measure is applied to rank all terms of the objectionable category and a number of top rank terms are selected to form the dictionary for the category.

Integration of URL and Content-Based Classifier : 

Integration of URL and Content-Based Classifier URL and content classifier works independently. Score returned by URL-based classifier is less than that of the score returned by the content based classifier.

Conclusion : 

Conclusion The approach used in this paper helps in categorizing and filtering those sites that fall under objectionable category. Combination of URL-based and Content-based approach is used for optimal results

Slide 15: 

Thank You