Disambiguating Identity Web References using Social Data
“I'm sorry, professor, but citizen Sharikov is absolutely correct. He has a right to take part in a discussion about his affairs, especially as it's about his identity document. An identity document is the most important thing in the world.”
Taken from “The Heart of a Dog” by Mikhail Bulgakov
The World Wide Web has evolved into an interactive information network, allowing web users to collaborate and share information on a massive scale. A large amount of information now available on the World Wide Web is personal information, which has either been disseminated voluntarily (i.e., a personal web page, profile page) or involuntarily (i.e., telephone directory, electoral register). The sensitive nature of personal information and its widespread visibility has lead to a rise in malevolent web practices such as lateral surveillance and identity theft. Given this context, it is important to investigate the use of accurate and efficient automated techniques that would allow web users to detect those web resources containing their personal information without the need for manual processing of web content.
The process of detecting identity web references consists of searching the Web for possible web citations and then performing disambiguation to identify the true citations for a given person. In this thesis three distinct disambiguation techniques are presented each of which fuse Semantic Web technologies with existing state of the art techniques; Inference Rules built from seed data, Random Walks over a graph-space built from metadata models and Self-training machine learning classifiers combined with semantic graph matching techniques - the notion of a Random Walk, also referred to as the Drunk's Walk, is contextualised in the figure below. A detailed evaluation of these techniques is presented using a real world dataset with respect to several baseline measures including human processing.
To function effectively automated disambiguation techniques require seed data, or background knowledge, describing a given person, however producing this data manually is expensive. Therefore this thesis explores how data leveraged from Social Web platforms can be used. To assess the suitability of such platforms a user study is presented assessing the similarity between the identity which users create online with their real world equivalent. The findings demonstrate a significant overlap and therefore suggest the suitability of data leveraged from the Social Web for use by automated techniques. Techniques are presented to leverage seed data from Social Web platforms such as Facebook, MySpace and Twitter. Semantic Web technologies such as knowledge representation formats and ontologies are used to describe the data in a machine-readable manner. Methodologies are also presented to construct metadata models from web resources which are to be analysed also using Semantic Web technologies.
The findings from this body of work demonstrate the feasibility of automated disambiguation techniques to replace the manual processing of web content and the facilitation of identity web reference discovery.