Posts by Collection


Causal discovery with unobserved confounding and non- Gaussian data

Submitted, 2020+
Wang, Y. S., Drton, M.

We consider data which arise from a linear structural equation model in which the idiosyncratic errors are allowed to be dependent in order to capture possible latent confounding. We show that under certain restrictions on the latent confounding and when the errors are non-Gaussian, the exact causal structure–not merely an equivalence class–can be consistently recovered from purely observational data when the graph corresponding to the SEM is bow-free and acyclic.

Gender-based homophily in collaborations across a heterogeneous scholarly landscape

Submitted, 2020+
Wang, Y. S., Lee, C. J., West, J. D., Bergstrom, C. T., Erosheva, E. A.

We test the hypothesis that the gender homophily observed in the existing scholarly corpus can be explained by the varying gender representation and collaborative norms across intellectual communities. Our analysis explicitly accounts for heterogeneity in gender representation across a very fine-grain partitioning of the scholarly landscape. Ignoring this heterogeneity would otherwise inflate the estimated effect of gender on co-authorship formation. We find that even when accounting for the heterogeneous gender representation in intellectual communities—across wide swaths of the scientific landscape—co-authorship formation is not gender blind.


Empirical likelihood for linear structural equation models with dependent errors

Stat, 2017
Wang, Y. S., Drton, M.

We propose estimating the parameters of a linear SEM with dependent errors using an empirical likelihood criterion. The computationally efficient procedure we propose profiles out the covariance of the idiosyncratic errors, resulting in an estimated covariance which is always positive definite. In simulations, we see that the procedure produces estimates with lower MSE than existing procedures in certain settings when the errors are non-Gaussian.

A variational EM method for mixed membership models with multivariate rank data: An analysis of public policy preferences

Annals of Applied Statistics, 2017
Wang, Y. S., Matsueda R., Erosheva, E. A.

We develop a variational EM method for estimating mixed membership models with multivariate rank data. This procedure has many compuational advantages to the previously proposed MCMC procedures. We apply the procedure to Eurobarometer data and find interpertable sub-groups which may be defined by public policy preferences. Update

On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example

Annals of Applied Statistics, 2018
Chen, Y. C., Wang, Y. S., Erosheva, E. A.

Variational inference is a widely used alternative to MCMC due to its relative computational efficiency. However, MCMC automatically quantifies uncertainty for the estimated parameters, whereas naively using the estimated variational distribution often underestimates uncertainty. Thus, we propose a simple bootstrap procedure which produces confidence intervals for the estimated variational parameters with valid frequentist coverage. We also show in some settings, despite the inherent model misspecification of variational procedures, a two-sample test is still valid.

Computation of maximum likelihood estimates in cyclic structural equation models

Annals of Statistics, 2019
Drton, M., Fox, C., Wang, Y. S.

General non-linear optimization procedures for calculating the MLE of Gaussian linear structural equation models often suffer poor performance when the models are non-recursive (i.e., contain feedback loops or cycles). Thus, we propose an alternative block-coordinate descent procedure in which each block update can be solved in closed form. Furthermore, we characterize the set of models for which the procedure gives a unique solution at each update.

Direct estimation of differential functional graphical models

NeurIPS, 2019
Zhao, B., Wang, Y. S., Kolar, M.

We consider the setting where the data consist of multivariate random functions which been observed from two (possibly) distinct populations (e.g., consider EEG data from a control group and a treatment group). In many cases, the scientific question of interest involves differences in the connectivity patterns (i.e., conditional independence graph) for each population. We propose a method which directly estimates the differences and avoids seperately estimating each graph.

On causal discovery with an equal-variance assumption

Biometrika, 2019
Chen, W., Drton, M., Wang, Y. S.

It was previously shown that causal structure can be identified from purely observational data when the data is generated by a linear structural equation model where all idiosycratic errors have the same variance. Under that assumption, we propose a simple method which can consitently identify the causal ordering even in the high-dimensional setting where the number of considered variables is much larger than the number of observed samples.

High-dimensional causal discovery under non-Gaussianity

Biometrika, 2020
Wang, Y. S., Drton, M.

It has been previously shown that when the variable specific error terms are non-Gaussian, the exact causal graph of a linear structural equation model, as opposed to a Markov equivalence class, can be consistently estimated from observational data. We propose an algorithm that yields consistent estimates of the graph also in high-dimensional settings in which the number of variables may grow at a faster rate than the number of observations, but in which the underlying causal structure features suitable sparsity; specifically, the maximum in-degree of the graph is controlled. Our theoretical analysis is couched in the setting of log-concave error distributions.

Robust Inference for High-Dimensional Linear Models via Residual Randomization

ICML, 2021
Wang, Y. S., Lee, S.K., Toulis, P., Kolar, M.

We propose a residual randomization procedure designed for robust Lasso-based inference in the high-dimensional setting. Compared to earlier work that focuses on sub-Gaussian errors, the proposed procedure is designed to work robustly in settings that also include heavy-tailed covariates and errors. Note: An error in the statement of Theorem 2 in the ICML version has been fixed in the Arxiv Version