Understanding people, how they implicitly and explicitly group, their linguistic patterns, what motivates them and more are all deeply interesting and long-standing questions. Industry and academic developers and researchers today have access to extensive information on people, but the data often lacks many of the core demographic and psychographic variables that pertain to many research questions and which drive some business functions (e.g. marketing). This is certainly true of social media profiles, which typically lack structured demographic information beyond names and locations---and even these are often incomplete or fabricated. As such, there has been a surge of academic and commercial interest in predicting values for gender, age, race, location, interests, personality, and more, given some portion of the information available in data about individuals, including social profiles, customer records, and more. In the past, efforts to study people were primarily localized to the researcher and the individuals they interacted with or requested surveys from. But today, these questions can be explored at massive scale, using the public and private digital exhaust we all create. Findings are no longer simply interpretive, but instead can be additionally translated into automated programs that analyze gender, personality and more. Such programs are informed by research in natural language processing, computer vision, psychology and related fields, and they can be used for positive, negative, and mixed ends. As researchers, we are arguably still waking up to this reality, and we cannot take a neutral stance regarding the potential benefits and harms of our work. We must grapple with hard questions around privacy rights and think actively and creatively about the wider societal implications and impacts of our work. In my talk, I'll discuss specific practical and ethical aspects of such work in the context of text, graph and image analysis for understanding demographics and psychographics, with a eye toward the potential for positive impact that reduces or minimizes risk to individuals.
Jason is a research scientist at Google, where he works on semantics, discourse and multilingual processing. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin, and he co-founded People Pattern, a startup that delivers audience analytics for major brands. His main research interests include categorial grammars, parsing, semi-supervised learning for NLP, reference resolution and text geolocation. He has long been active in the creation and promotion of open source software for natural language processing: he is one of the co-creators of the Apache OpenNLP Toolkit and OpenCCG, and he has contributed to many others, including ScalaNLP, Junto, and TextGrounder. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.