Data are not only ubiquitous in society, but are increasingly complex both in size and dimensionality. Dimension reduction offers researchers and scholars the ability to make such complex, high dimensional data spaces simpler and more manageable. My book offers readers a suite of modern unsupervised dimension reduction techniques along with hundreds of lines of R code, to efficiently represent the original high dimensional data space in a simplified, lower dimensional subspace. Launching from the earliest dimension reduction technique principal components analysis and using real social science data, I introduce and walk readers through application of the following techniques: locally linear embedding, t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approximation and projection, self-organizing maps, and deep autoencoders. The result is a well-stocked toolbox of unsupervised, nonparametric methods for tackling the complexities of high dimensional data so common in modern society. Replication code is here.
In our book, we introduce the Tidy approach to programming in R for social science research to help quantitative researchers develop a modern technical toolbox. We include hundreds of lines of code to demonstrate a suite of techniques for developing and debugging an efficient social science research workflow. To deepen our dedication to teaching Tidy best practices for conducting social science research in R, we include numerous examples using real world data including the American National Election Study and the World Indicators Data. Companion site, extra resources, and replication code are here.
In the age of data-driven problem-solving, applying sophisticated computational tools for explaining substantive phenomena is a valuable skill. Yet, application of methods assumes an understanding of the data, structure, and patterns that influence the broader research program. My book offers researchers and teachers an introduction to clustering, which is a prominent class of unsupervised machine learning for exploring and understanding latent, non-random structure in data. A suite of widely used clustering techniques is covered, in addition to R code and real data to facilitate interaction with the concepts. Upon setting the stage for clustering, the following algorithms are detailed: agglomerative hierarchical clustering, k-means clustering, Gaussian mixture models, and at a higher-level, fuzzy C-means clustering, DBSCAN, and partitioning around medoids (k-medoids) clustering.
Comment on Data Fission: Splitting a Single Data Point. Forthcoming, Journal of the American Statistical Association
The Paradox of Algorithms and Blame on Public Decision-makers (with Ryan Kennedy and Adam Ozer). 2024. Business and Politics
A Batch Process for High Dimensional Imputation. 2023. Computational Statistics
Where are we going with statistical computing? From mathematical statistics to collaborative data science (with Dominique Makowski). 2023. Mathematics, 11(8)
Trust in Public Policy Algorithms (with Ryan Kennedy and Matthew Ward). 2022. Journal of Politics, 84(2)
The Role of Personality in Trust in Public Policy Automation (with Ryan Kennedy). 2022. Journal of Behavioral Data Science, 2(1)
Pursuing Open-Source Development of Predictive Algorithms: The Case of Criminal Sentencing Algorithms (with Alec MacMillen). 2022. Journal of Computational Social Science, 5, 89–109
Uncovering the Online Social Structure Surrounding COVID-19 (with Robert Y. Shapiro, Sam Frederick, and Ming Gong). 2021. Journal of Social Computing, 2(2)
see: An R Package for Visualizing Statistical Models (with Daniel Lüdecke, Mattan S. Ben-Shachar, Indrajeet Patil, Brenton M. Wiernik, and Dominique Makowski). 2021. Journal of Open Source Software, 6(64)
A Computational Exploration of the Evolution of Governmental Policy Responses to Epidemics Before and During the Era of COVID-19. 2021. The Year in C-SPAN Archives Research, Vol. 7, Edited Volume from Purdue University Press
Applying Dimension Reduction in Modern Data Science and Quantitative Analysis. 2021. Software Impacts, 8, 100075
performance: An R Package for Assessment, Comparison and Testing of Statistical Models (with Daniel Lüdecke, Mattan Ben-Shachar Indrajeet Patil, and Dominique Makowski). 2021. Journal of Open Source Software, 6(60)
Pandemic Policymaking. 2021. Journal of Social Computing, 2(1)
Community Detection in Google Searches Related to ‘Coronavirus’. 2021. Journal of Data Science, 19(2)
Are there Racial Disparities in Fatal Police Shootings? Exploration with Uniform Manifold Approximation and Projection. In the Book of Abstracts of the 9th International Conference on Complex Networks and their Applications (COMPLEX NETWORKS 2020)
Exploring Ideological Signals from Cosponsorship (with Carol Ann Downes). 2020. Journal of Mathematical Sociology, 45(4)
Measuring Media Freedom: An Item Response Theory Analysis of Existing Indicators (with Jonathan Solis; replication data here). 2020. British Journal of Political Science, 51(4)
Exploring the Effects of Allegations of Sexual Misconduct on Political Careers (with Andrew Creekmore). 2020. Social Science Journal
The Shape of and Solutions to the MTurk Quality Crisis (with Scott Clifford, Ryan Kennedy, Tyler Burleigh, Ryan Jewell, and Nick Winter). 2020. Political Science Research and Methods, 8(4)
Exploring and Comparing Unsupervised Clustering Algorithms (with Marc Lavielle). 2020. Journal of Open Research Software, 8(21)
A Simple Method for Purging Mediation Effects. 2020. Journal of Statistical Theory and Practice, 14(25)
Big Data and Trust in Public Policy Automation (with Ryan Kennedy, Hayden Le, and Myriam Shiran). 2019. Statistics, Politics, and Policy, 10(2)
insight: A Unified Interface to Access Information from Model Objects in R (with Daniel Lüdecke and Dominique Makowski). 2019. Journal of Open Source Software, 4(38), 1412
Detecting Fraud in Online Surveys by Tracing, Scoring, and Visualizing IP Addresses (with Ryan Kennedy and Scott Clifford). 2019. Journal of Open Source Software, 4(37), 1285
Do Constituents Influence Issue-Specific Bill Sponsorship? 2019. American Politics Research, 47(4)
The hhi Package: Streamlined Calculation and Visualization of Herfindahl-Hirschman Index Scores. 2018. Journal of Open Source Software, 3(28), 828
Exploring Responsiveness in the U.S. House of Representatives. 2018. American Review of Politics, 36(2)
The Cost of Doing Business: Congressional Requests, Cost, and Allocation of Presidential Resources (with Brandon Rottinghaus). 2018. Political Research Quarterly, 71(4)
Informal Politics: The Role of Congressional Letters in Legislative Behavior and Careers. 2017. Journal of Legislative Studies, 23(3)
Assessing the Value of State Legislative Experience and Legislative Professionalism in National Election Performance, 1974-2010. 2017. Social Science Journal, 54(3)
Are Samples Drawn from Mechanical Turk Valid for Research on Political Ideology? (with Scott Clifford and Ryan M. Jewell). 2015. Research & Politics, 2(4)
Democratic Development of Algorithms (with Robert Y. Shapiro)
Accuracy and Fairness in Public Policy Algorithms: A Conjoint Experiment on Policy Preferences (with Myriam Shiran, Ryan Kennedy and Adam Ozer)
Ethical Considerations for Developing and Using Artificial Intelligence in Statistical Practice. (with Members of the ASA Committee on Professional Ethics), Amstat News
Indexing from 0? The Value of R and Python for Data Science. American Statistical Association CCD Portfolio Project Blog
Batched Imputation for High Dimensional Missing Data Problems. R-Bloggers
Measuring Aggregate Policy Priorities of the U.S. House of Representatives. The Political Methodologist, Blog of the Society for Political Methodology
Tidy Visualization of Mixture Models in R. R-Bloggers
The Importance of Mass Trust in Validating Algorithms in a Public Space. eLetter responding to “The accuracy, fairness, and limits of predicting recidivism” by Julia Dressel and Hany Farid (with Ryan Kennedy), Science Advances 4(1)
A New Release of rIP (v1.2.0) for Detecting Fraud in Online Surveys. R-Bloggers
How Venezuela’s Economic Crisis is Undermining Social Science Research – about Everything (with Ryan Kennedy, Scott Clifford, Tyler Burleigh, and Ryan Jewell), The Washington Post, Monkey Cage
Advice to Young (and Old) Programmers: A Conversation with Hadley Wickham. R-Bloggers
Constituents have Minimal Influence on their Legislators’ Policy Priorities. London School of Economics (LSE) USAPP Politics and Policy Blog
A New Package (hhi) for Quick Calculation of Herfindahl-Hirschman Index scores. R-Bloggers
Do constituents influence the work of legislators? R Street Institute’s LegBranch Blog
Introducing purging: An R package for addressing mediation effects. R-Bloggers
Applying Regularization for Feature and Theory Selection (with Joanna Schroeder)
Tidy Tools for Visualizing Mixture Models (with Fong Chan and Lu Zhang)
Extending Uniform Manifold Approximation and Projection for Anomaly Detection (with Leland McInnes)
Bypassing Limitations of Probability Models in Binary Classification Tasks with Support Vector Machines (with Abhishek Pandit and Lynette Dang)