Improving Inference from Non-Random Data for Social Science Research

Project Description

New types of data from digital traces and new forms of access to data from administrative processes open the possibility to observe individual and social behavior as well as change in behavior at high frequencies and in real time. The caveat with these new data is their (usually) unknown quality. At the same time, traditional survey data collection vehicles face rising costs, and many social science researchers are tempted to replace expensive probability-based surveys in favor of less expensive data collections. Those are often cheaper because they are collected from volunteer samples of unknown populations with unknown selection and unknown inclusion probabilities. Misrepresentation of societal groups in digital trace data and other alternative data sources can severely affect the utility of such data to both derive valid inference and accurate predictions for a given target population. The usefulness of alternative data sources thus depends on the effectiveness of bias mitigation techniques to correct for self­-selection processes. This research project combines methodology from social science and computer science to account for misrepresentation in data and develops and compares pseudo-weighting and post-processing techniques to improve inference from various data sources.


  • Kim, M. P., Kern, C., Goldwasser, S., Kreuter, F. and Reingold, O. (2022). Universal Adaptability: Target-Independent Inference that Competes with Propensity Scoring. Proceedings of the National Academy of Sciences of the United States of America (PNAS) 119(4).
  • Kern, C., Li, Y., and Wang, L. (2020). Boosted Kernel Weighting – Using Statistical Learning to Improve Inference From Nonprobability Samples. Journal of Survey Statistics and Methodology.