Prasenjit Mitra

9^TH – 11^TH DECEMBER 2019

SEMINAR HALL – IIT DELHI, INDIA

TEXT MINING TECHNIQUES IN DIGITAL LIBRARIES

ASSOCIATE DEAN FOR RESEARCH AND PROFESSOR ,
COLLEGE OF INFORMATION SCIENCES AND TECHNOLOGY,
THE PENNSYLVANIA STATE UNIVERSITY

9th Dec

12:15PM-1:00PM

Prasenjit Mitra is the Associate Dean for Research and Professor in the College of Information Sciences and Technology. His current research interests are in the areas of artificial intelligence, health informatics, big data analytics, applied machine learning, and visual analytics. In the past, he has contributed to the areas of data interoperation, data cleaning, and digital libraries especially in tabular data extraction, and citation recommendation.

Mitra received his Ph.D. from Stanford University in 2004, his M.S. from the University of Texas at Austin in 1994, and a B.Tech.(Tons.) from the Indian Institute of Technology, Kharagpur in 1993. At Penn State, he has pursued research on a broad range of topics ranging from data mining on the web and social media, scalable data cleaning, political text mining, chemical formula and name extraction from documents, and the extraction of data and metadata from figures and tables in digital documents.

He was the principal investigator of the DOES project funded by the NSF CAREER Award. He has also been the co-principal investigator of the CiteSeerX, ChemXSeer, and ArchSeer digital library projects, the Regional Visualization and Analytics Center (NEVAC), and the GeoCAM visual analytics projects. Mitra serves as the director of the Cancer Informatics Initiative at Penn State. His research has been supported by the NSF, Microsoft Corporation, DoD, DHS, DoE, NGA, and DTRA. Mitra has co-authored approximately 180 articles at top conferences and journals. He has supervised over 15 Ph.D. students; and several M.S. students

ABSTRACT

Digitization of libraries around the world has resulted in more and more of our knowledge being stored in digital format. Search engines enable us to search and find information more rapidly and with more precision than we would in libraries with physical books and catalogs. In order to enable more efficient and improved search and eventually question answering, researchers have built text-mining-based tools. After a text document is digitized, information can be extracted from these documents using text-mining to identify objects of interest and then these digital objects are indexed and searched. Even though increasingly video and audio files are being stored in digital libraries, very often they have some associated text with it. This text describes the video and audio files. In the cases where such metadata describing video and audio data is not available, modern technology allows us to transcribe the text associated with the digital objects and then perform information extraction and other text mining tasks on the extracted text. One of the successes of information extraction has been the identification of entities in text. Named entities are unique and are often a part of queries that we use to refer to and find information. Additionally, we have also extracted information from tables, figures, algorithms, etc. from digital libraries such as CiteSeerX and enabled searching for these extracted digital objects. The text in the caption of the tables, figures, and algorithms as well as referring text in the document that refers to ….