Academic
Publications
Predicting web spam with HTTP session information

Predicting web spam with HTTP session information,10.1145/1458082.1458129,Steve Webb,James Caverlee,Calton Pu

Predicting web spam with HTTP session information   (Citations: 7)
BibTex | RIS | RefWorks Download
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search en- gine indexes, burden Web crawlers and Web mining ser- vices, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research ana- lyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these tech- niques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session infor- mation (i.e., hosting IP addresses and HTTP session head- ers). Concretely, we built an HTTP session classifier based on our predictive technique, and by incorporating this clas- sifier into HTTP retrieval operations, we are able to de- tect Web spam pages before the actual content transfer. As a result, our approach protects Web users from Web- propagated malware, and it generates significant bandwidth and storage savings. By applying our predictive technique to a corpus of almost 350,000 Web spam instances and almost 400,000 legitimate instances, we were able to successfully detect 88.2% of the Web spam pages with a false positive rate of only 0.4%. These classification results are superior to previous evaluation results obtained with traditional link- based and content-based techniques. Additionally, our ex- periments show that our approach saves an average of 15.4 KB of bandwidth and storage resources for every success- fully identified Web spam page, while only adding an aver- age of 101µs to each HTTP retrieval operation. Therefore, our predictive technique can be successfully deployed in ap- plications that demand real-time spam detection.
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.
    • ...Earlier lightweight classiers include Webb et al. [44] describing a procedure based solely on the HTTP session information...
    • ...Furthermore in [44] the IP address is a key feature that is trivially incorporated in the DC2010 data set by placing all hosts from the same IP into the same training or testing set...

    Miklós Erdélyiet al. Web spam classification: a few features worth more

    • ...As an example of efficient text mining, [27] Webb et al define an effective algorithm for spam detection looking only into mails headers...

    Zinayida Petrushynaet al. Designing during use: Modeling communities of practice

    • ...Security-oriented traffic analysis has caught much attention from both the network and security communities, including malware or botnet characterization [8, 21, 23, 31] and privacy-preserving routing and packet trace anonymization [17‐19]...

    Huijun Xionget al. User-Assisted Host-Based Detection of Outbound Malware Traffic

    • ...To further assess the performance of our Web spam-detection approach we compare our accuracy ratios with the ones generated by the spam-prediction method introduced in [29] using another nine9 different subsets of documents...
    • ...approach, which relies on the content and structure of Web pages to detect spam documents, the spam-prediction method in [29] relies on HTTP session information...
    • ...The prediction method analyzes hosting IP addresses, as well as HTTP session headers, such as “Content-Type”, “Server”, “X-Powered-By”, “Content-Language”, or “Pragma”, to train classification algorithms, such as C4.5, HyperPipes, Logistic regression, or Support Vector Machine (SVM), to identify (non-)spam Web pages [29]...
    • ...To perform a compatible evaluation between our spam-detection approach and the one in [29], each of the nine subsets with the corresponding percentage of spam Web pages used for evaluation contains 1,486 Web documents from the WEBSPAM-UK2006 dataset, the same number of pages used for conducting the evaluation measures in [29]...
    • ...To perform a compatible evaluation between our spam-detection approach and the one in [29], each of the nine subsets with the corresponding percentage of spam Web pages used for evaluation contains 1,486 Web documents from the WEBSPAM-UK2006 dataset, the same number of pages used for conducting the evaluation measures in [29]...
    • ...As stated in [29], the classifier’s performance is relative consistent for subsets that contain between 30% and 70% spam Web pages, but varies considerably in the extremes...
    • ...Figure 12: Accuracy ratios achieved by our spam-detection approach and the Web spamprediction method in [29] on different corpus samples extracted from the WEBSPAM-UK2006 dataset...
    • ...Furthermore, the overall accuracy of our spam-detection approach (as shown in Figure 12) is higher than the averaged accuracy of the approach in [29] by 3%. Since the spam-prediction method in [29] performs better than our approach for collections of Web pages with low percentages of spam, i.e., below 40%, we could consider using the HTTP session information, in addition to our content- and structure-based analysis approach in classifying ...
    • ...Furthermore, the overall accuracy of our spam-detection approach (as shown in Figure 12) is higher than the averaged accuracy of the approach in [29] by 3%. Since the spam-prediction method in [29] performs better than our approach for collections of Web pages with low percentages of spam, i.e., below 40%, we could consider using the HTTP session information, in addition to our content- and structure-based analysis approach in classifying ...

    Maria Soledad Peraet al. A structural, content-similarity measure for detecting spam documents ...

Sort by: