142

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Filtron: A Learning-Based Anti-Spam Filter : 

Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis (ernani@iit.demokritos.gr), Ion Androutsopoulos (ion@aueb.gr), George Paliouras (paliourg@iit.demokritos.gr), George Sakkis (gsakkis@rutgers.edu), Panagiotis Stamatopoulos (takis@di.uoa.gr) Mountain View, CA, July 30th and 31st 2004 First Conference on Email and Anti-Spam (CEAS)

Outline : 

Outline Spam Filtering: past, present and future Anti-spam filtering with Filtron In Vitro Evaluation In Vivo Evaluation Conclusions

Spam Filtering: past, present and future : 

Spam Filtering: past, present and future Past: Black-lists and white-lists of e-mail addresses Handcrafted rules looking for suspicious keywords and patterns in headers Present: Machine learning-based filters Mostly using Naïve Bayes classifier Examples: Mozilla’s spam filter, POPFILE, K9 Signature based filtering (Vipul’s Razor) Future: Combination of several techniques (SpamAssassin)

Filtron: An overview : 

Filtron: An overview A multi-platform learning-based anti-spam filter. Features for simple the user: Personalized: based on her legitimate messages Automatically updating black/white lists Efficient: server-side filtering and interception rules Features for the advanced user and the researcher: Customizable learning component Through Weka open source machine learning platform Support for creating publicly available message collections Privacy-preserving encoding of messages and user profiles Portable: Implemented in Java and Tcl/Tk Currently supported under POSIX-compatible mail servers (MS Exchange Server port efforts under way)

Filtron’s Architecture : 

Filtron’s Architecture

Preprocessing : 

Preprocessing Break down mailbox(es) into distinct messages Remove from every message: mail headers html tags attached files Remove messages with no textual content Store 5 messages per sender Avoids bias towards regular correspondents. Remove duplicates Encode messages (optional)

Message Classification : 

Message Classification

In Vitro Evaluation : 

In Vitro Evaluation We investigated the effect of: Single-token versus multi-token attributes (n-grams for n=1,2,3) Number of attributes (40-3000) Learning algorithm (Naïve Bayes, Flexible Bayes, SVMs, LogitBoost) Training corpus size (~ 10%-100% of full training corpus) Cost-Sensitive Learning Formulation Misclassifying a legitimate message as spam (LS) is λ times more serious an error than misclassifying a spam to legitimate (SL) Two usage scenarios (λ = 1, 9)

In Vitro Evaluation (cont.) : 

In Vitro Evaluation (cont.) Evaluation: Four message collections (PU1, PU2, PU3, PUA) Stratified 10-fold cross validation Results: No clear winner among learning algorithms wrt accuracy  Efficiency (or other criteria) more important for real usage. Nevertheless, SVMs consistently among two best No substantial improvement with n-grams (for n>1) Refer to the TR for more details: Learning to filter unsolicited commercial e-mail, TRN 2004/2, NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)

Summary of in Vitro Evaluation : 

Summary of in Vitro Evaluation

In Vivo Evaluation : 

In Vivo Evaluation Seven month live-evaluation by the third author Training collection: PU3 2313 legitimate / 1826 spam Learning algorithm: SVM Cost scenario: λ = 1 Retained attributes: 520 1-grams Numeric values (term frequency) No black-list was used

Summary of in Vivo Evaluation : 

Summary of in Vivo Evaluation

Post-Mortem AnalysisFalse Positives : 

Post-Mortem AnalysisFalse Positives 52 false positives (out of 6732) 52%: Automatically generated messages subscription verifications, virus warnings, etc. 22%: Very short messages 3-5 words in message body Along with attachments and hyperlinks 26%: Short messages 1-2 lines Written in casual style, often exploited by spammers With no attachments or hyperlinks

Post-Mortem AnalysisFalse Negatives : 

Post-Mortem AnalysisFalse Negatives 173 false negatives (out of 6732) 30%: “Hard Spam” Little textual information, avoiding common suspicious word patterns Many images and hyperlinks Tricks to confuse tokenizers 8%: Advertisements of pornographic sites with very casual and well chosen vocabulary 23%: Non-English messages Under-represented in the training corpus 30%: Encoded messages BASE64 format; Filtron could not process it at that time 6%: Hoax letters Long formal letters (“tremendous business opportunity !”) Many occurrences of the receiver’s full name 3%: Short messages with unusual content

Conclusions : 

Conclusions Signs of arms race between spammers and content-based filters Filtron’s performance deemed satisfactory, though it can be improved with: More elaborate preprocessing to tackle usual countermeasures of spammers (misspellings, uncommon words, text on images) Regular retraining Currently most promising approach: combination of different filtering approaches along with Machine Learning Collaborative filtering Filtering in the transport layer level …