This project aims to build a search engine that beats google, :p
While google has done a great job in bringing most relevant html pages to us with respect to a set of query keywords, it cannot help you if the keywords you searched are located in more than one pages provided that these pages are closely connected.
In this project, we will look into a few alternatives which can address the above problem. One of the challenges is the scalability of the solution as we definitely need to deal with millions of web pages.
This project is definitely practical, and requires good programming experience, and a lot of passions.
References:
Retrieving and Organizing Web Pages by “Information Unit”. (Available at http://wwwconf.ecs.soton.ac.uk/archive/00000025/)
Using micro information units for internet search. (Available at http://www.cs.uic.edu/~liub/publications/cikm-2002.pdf)
This project aims to build a system that provides easy-to-digest search results for e-Commerce web sites.
Background: Consider a search in one online real estate web site. Typically, a large number of results will be returned. Can we process the results such that it becomes easier to browse? One idea is to “rank” them according to the “preferences” of the users.
In this project, you will learn and implement some of the latest technology in web search and ranking, implement web site crawler (i.e., bot), and experiment various solutions on real datasets.
This project requires good programming skills. Knowledge in databases, data mining/machine learning will be a plus.
References:
Context-Sensitive Ranking. (Available at http://www.cs.helsinki.fi/u/terzi/rank.pdf)
Ordering the Attributes of Query Results.(Available at http://www.cs.fiu.edu/~vagelis/publications/horizontal.pdf)
This project aims to build a system that can detect near duplicate documents (or tuples) in a large document repository (or database).
Near-duplicate documents are causing serious problems in web search, enterprise document/knowledge management, and can also act as an indispensable module for automated text document processing. Finding duplicate tuples in databases is important to data integration applications too.
In this project, you will learn and implement some of the latest technology in near duplicate object detection, implement the system, and experiment various solutions on real datasets.
This project requires good programming skills and knowledge in data structures and algorithms.
References: