When: Thursday April 28th at 3:10pm
Where: 1131 Kemepr Hall
PageRank is a well-known and widely-applied algorithm originally devised in 1998 by Larry Page and Sergey Brin for document retrieval. In this talk we will show how to adapt this algorithm for record deduplication in databases, for entity disambiguation between textual document and a knowledge base, and for linking textual entity mentions. The task of record deduplication is to merge databases while properly identifying matching entities. Scalability becomes the major issue for this task when dealing with industrial size datasets. We present a scalable technique based on Personalized PageRank that handles efficiently databases of 110M and 203M entities. The goal of named entity disambiguation is to map entity mentions in a document to their correct entries in some knowledge base. Our graph-based approach for this problem combines local and global information for disambiguation and effectively filters out noise introduced by incorrect candidates. Experiments show that our method performs competitively on a dataset of 27.8K named entity mentions. Finally, we will show how to combine entity disambiguation and paraphrase detection techniques for entity linking task where the goal is to cluster textual entity mentions that refer to the same real world object.
Maria received her BSc and MS in Mathematics from Moscow Lomonosov University. Currently she is finishing her PhD in Natural Language Processing from the Courant Institute at New York University in May 2016. She is advised by Prof. Ralph Grishman. She worked at Google+ and Microsoft Research during her PhD, and has a patent application as a result of the latter internship. Maria taught graduate and undergraduate Calculus in Wentworth Institute of Technology, Boston, 2007-2008, and was a teaching assistant for NLP and Algorithms courses at NYU in 2014-2015. She participated in a Knowledge Base Population competition (KBP-2014) held by National Institute of Standards and Technology (NIST) and her disambiguation system was ranked 2nd among academic institutions in US. Her research interests span around different aspects of Information Extraction such as Relation Extraction, Entity Disambiguation and Linking, Paraphrase Detection.
1131 Kemper Hall