# Research activities

A detailed list of my work is listed below. My google scholar page also has a more compact list.

## Recent pre-prints

*Submitted, 2020+ *

*Wang, Y. S., Lee, C. J., West, J. D., Bergstrom, C. T., Erosheva, E. A.*

We test the hypothesis that the gender homophily observed in the existing scholarly corpus can be explained by the varying gender representation and collaborative norms across intellectual communities. Our analysis explicitly accounts for heterogeneity in gender representation across a very fine-grain partitioning of the scholarly landscape. Ignoring this heterogeneity would otherwise inflate the estimated effect of gender on co-authorship formation. We find that even when accounting for the heterogeneous gender representation in intellectual communities—across wide swaths of the scientific landscape—co-authorship formation is not gender blind.

*Submitted, 2020+ *

*Zhao, B., Wang, Y. S., Kolar, M.*

We extend our the work from 2019 NeurIPS paper on direct estimation of functional graphical models to the setting where each random function is observed with noise at discrete time points.

*Submitted, 2020+ *

*Wang, Y. S., Drton, M.*

We consider data which arise from a linear structural equation model in which the idiosyncratic errors are allowed to be dependent in order to capture possible latent confounding. We show that under certain restrictions on the latent confounding and when the errors are non-Gaussian, the exact causal structure–not merely an equivalence class–can be consistently recovered from purely observational data when the graph corresponding to the SEM is bow-free and acyclic.

## Peer reviewed statistics publications

*ICML, 2021*

*Wang, Y. S., Lee, S.K., Toulis, P., Kolar, M.*

We propose a residual randomization procedure designed for robust Lasso-based inference in the high-dimensional setting. Compared to earlier work that focuses on sub-Gaussian errors, the proposed procedure is designed to work robustly in settings that also include heavy-tailed covariates and errors.

*Biometrika, 2020*

*Wang, Y. S., Drton, M.*

It has been previously shown that when the variable specific error terms are non-Gaussian, the exact causal graph of a linear structural equation model, as opposed to a Markov equivalence class, can be consistently estimated from observational data. We propose an algorithm that yields consistent estimates of the graph also in high-dimensional settings in which the number of variables may grow at a faster rate than the number of observations, but in which the underlying causal structure features suitable sparsity; specifically, the maximum in-degree of the graph is controlled. Our theoretical analysis is couched in the setting of log-concave error distributions.

*Biometrika, 2019*

*Chen, W., Drton, M., Wang, Y. S.*

It was previously shown that causal structure can be identified from purely observational data when the data is generated by a linear structural equation model where all idiosycratic errors have the same variance. Under that assumption, we propose a simple method which can consitently identify the causal ordering even in the high-dimensional setting where the number of considered variables is much larger than the number of observed samples.

*NeurIPS, 2019*

*Zhao, B., Wang, Y. S., Kolar, M.*

We consider the setting where the data consist of multivariate random functions which been observed from two (possibly) distinct populations (e.g., consider EEG data from a control group and a treatment group). In many cases, the scientific question of interest involves differences in the connectivity patterns (i.e., conditional independence graph) for each population. We propose a method which directly estimates the differences and avoids seperately estimating each graph.

*Annals of Statistics, 2019*

*Drton, M., Fox, C., Wang, Y. S.*

General non-linear optimization procedures for calculating the MLE of Gaussian linear structural equation models often suffer poor performance when the models are non-recursive (i.e., contain feedback loops or cycles). Thus, we propose an alternative block-coordinate descent procedure in which each block update can be solved in closed form. Furthermore, we characterize the set of models for which the procedure gives a unique solution at each update.

*Annals of Applied Statistics, 2018*

*Chen, Y. C., Wang, Y. S., Erosheva, E. A.*

Variational inference is a widely used alternative to MCMC due to its relative computational efficiency. However, MCMC automatically quantifies uncertainty for the estimated parameters, whereas naively using the estimated variational distribution often underestimates uncertainty. Thus, we propose a simple bootstrap procedure which produces confidence intervals for the estimated variational parameters with valid frequentist coverage. We also show in some settings, despite the inherent model misspecification of variational procedures, a two-sample test is still valid.

*Annals of Applied Statistics, 2017*

*Wang, Y. S., Matsueda R., Erosheva, E. A.*

We develop a variational EM method for estimating mixed membership models with multivariate rank data. This procedure has many compuational advantages to the previously proposed MCMC procedures. We apply the procedure to Eurobarometer data and find interpertable sub-groups which may be defined by public policy preferences. Update

*Stat, 2017*

*Wang, Y. S., Drton, M.*

We propose estimating the parameters of a linear SEM with dependent errors using an empirical likelihood criterion. The computationally efficient procedure we propose profiles out the covariance of the idiosyncratic errors, resulting in an estimated covariance which is always positive definite. In simulations, we see that the procedure produces estimates with lower MSE than existing procedures in certain settings when the errors are non-Gaussian.