Barcelona

Introduction to Web Mining

Bettina Berendt
2 and 4 March 2010

UPF course on Information Retrieval and Data Mining


 http://www.cs.kuleuven.be/~berendtWebMining10


Slides

Introduction to Web mining, or: From IR to KD - Transitions 1 and 2 (PPT)

Introduction to Web text mining, focusing specifically on blogs mining (PPT 1)

Things that I only covered briefly and/or informally during the discussion:
More on blogs mining (PPT 2)
Transition 3: The questions change - opinion mining (PDF - thanks to Mathias Verbeke)
Introduction to Web usage mining (PPT 1, PPT 2)
Transition 4: The material changes (2) - "story tracking" (PPT - pp. 31 ff.)
New challenges (PPT - pp. 41ff.)

ransition 4: The material changes (2)

Literature and sources mentioned in the slides

Two excellent general introductions/overviews: 
Baldi, P., Frasconi, P., & Smyth, P. (2003). Modeling the Internet and the Web. Probabilistic Methods and Algorithms. Chichester, UK: John Wiley & Sons. (chapter on text mining, slides: PPT)
Bing Liu (2006). Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer. (Book homepage)

Our clustering / ontology learning tool for literature search:
Berendt, B., Krause, B., & Kolbe-Nusser, S. (in press). Intelligent scientific authoring tools: Interactive data mining for constructive uses of citation networks. To appear in Information Processing & Management. (PDF of last version before proofs)
Verbeke, M., Berendt, B., & Nijssen, S. (2009). Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. In G. Boato & C. Niederee (Eds.), Proceedings of First International Workshop on Living Web, , collocated with the 8th International Semantic Web Conference (ISWC-2009), Washington D.C., USA, October 26, 2009. CEUR Workshop Proceedings Vol-515. (PDF, presentation: PPT)

Two nice and thought-provoking examples of text classification applied to blogs and news:
Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. (PDF)
Liu, H. & Mihalcea, R. (2007). Of men, women, and computers: Data-driven gender modeling for improved user interfaces. In Proc. of the International Conference on Weblogs and Social Media. (PDF)

The overview of Opinion Mining is based on Bing Liu's book (see above).

Web usage mining work:
Teltzrow, M., & Berendt, B. (2003). Web-Usage-Based Success Metrics for Multi-Channel Businesses. In Proceedings of the WebKDD 2003 Workshop - Webmining as a Premise to Effective and Intelligent Web Applications. August 27th, 2003, Washington DC, USA. Held in conjunction with The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (PDF)
More details in: Teltzrow, M. (2005). A quantitative analysis of e-commerce - channel conflicts, data mining, and consumer privacy. PhD Dissertation, Institute of Information Systems, Humboldt University Berlin. (HTML, PDF)
Berendt, B. & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. The VLDB Journal, 9, 56-75. (PDF)

Our story tracking work:
Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems. DOI - 10.1007/s10115-009-0227-x (PDF)
and other work by the same authors (see my homepage)

Privacy:
L. Sweeney (2002). k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 557-570.

Frankowski, D., Cosley, D., Sen, S., Terveen, L.G., Riedl, J. (2006). You are what you say: privacy risks of public mentions. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006 (pp. 565–572). ACM. (PDF)
Barbaro, M., Zeller, T.: A face is exposed for aol searcher no. 4417749. New York Times (9 August 2006) (HTML)

All other sources should be retrievable from the information in the slides - please let me know if I overlooked something!

Major conferences and workshops

All KDD ("data mining") conferences (SIGKDD, PKDD, SIAM Conf. Data Mining, PAKDD) have interesting papers on Web mining and dedicated workshops. Check


Further resources: Tools (Open source and/or free)

Analog
WUMprep and WUMprep4WEKA (how to use it: Gebhard Dettmar. Logfile Preprocessing Using WUMprep. Talk given at the Web Mining Seminar in Winter semester 2003/04, School of Business and Economics, Humboldt University Berlin, Berlin, 2003. PDF)
WUM
WEKA
TextGarden and DocumentAtlas
optional, for single-session visualization: ISM (for the couse; general Web page)
see the directory at www.kdnuggets.com

Questions? Comments?

Please talk or write to me!




last updated on 2010-03-10 by Bettina Berendt; URL of this page: http://www.cs.kuleuven.be/~berendt/WebMining10/index.html