2007 Fiscal Year Final Research Report Summary
Development of a Filter of Unsolicited Bulk E-mail based on language independent method
Project/Area Number |
18500072
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Media informatics/Database
|
Research Institution | University of Tsukuba |
Principal Investigator |
SAKAGUCHI Tetsuo University of Tsukuba, Graduate School of Library, Information and Media Studies, Associate Professor (10225790)
|
Co-Investigator(Kenkyū-buntansha) |
SUGIMOTO Shigeo University of Tsukuba, Graduate School of Library, 'Information and Media Studies, Professor (40154489)
NAGAMORI Mitsuharu University of Tsukuba, Graduate School of Library, 'Information and Media Studies, Associate Professor (60272209)
|
Project Period (FY) |
2006 – 2007
|
Keywords | electronic mail / spam / multilingual text processing / automatic classification / unsolicited bulk e-mail / unicode |
Research Abstract |
Recently, the increase of unsolicited bulk e-mail (UBE) becomes one of great problems on the Internet. One of major method to decrease the number of UBE is spam filter which automatically classifies e-mail based on automatic learning of the characteristics of e-mail message. However, such filters ordinarily have language dependency because they use morpheme analyzers for some specific languages to extract words from messages. So they have weakness on classification accuracy of e-mail written in languages not supported by their morpheme analyzers. This research develops a filter of spam e-mail based on language independent method. The filter does not use morpheme analyzers for some specific languages but develops a method for extract characteristics of messages which independent on languages. In 2006, we developed methods that extract fixed length character strings from messages. Through the evaluation of accuracy, we found a disadvantage of the method on languages which use phonetic symbols such as English. So in 2007, we developed a method that extracts variable length character strings from messages based on the character properties of the Unicode standard. The accuracy of the method is better than previous methods, especially on English e-mail corpus. Through this research, we found a further problem on making corpus for evaluating spam filters. The corpus must consist of both UBE and non-UBE, but non-UBE are hard to collect because they usually have privacies. This problem has impact to evaluating spam filters at academic society of anti-spam technologies.
|