Symposium2024-STA,CUHK

FAN Yingying
University of Southern California

Exogenous Randomness Empowering Random Forests

We offer both theoretical and empirical insights into the impact of exogenous randomness on the effectiveness of random forests constructed with tree-building rules that are independent of the training data. We formally introduce the concept of exogenous randomness and identify two types that commonly exist in random forests: Type I, which arises from feature subsampling, and Type II, which results from tie-breaking during the application of tree-building rules. We develop non-asymptotic expansions for the mean squared error (MSE) for both individual trees and the corresponding forest, and establish necessary and sufficient conditions for tree and forest consistency. In the specific context of linear regression with independent features, our MSE expansions are more explicit, allowing us to gain more understandings on random forest’s working mechanisms. It also allows us to derive an upper bound with an explicit rate on the consistency for both individual trees and forests. Guided by our theoretical findings, we conducted simulation studies to further explore how exogenous randomness enhances random forest performance. Our theoretical and empirical findings show that feature subsampling reduces both the bias and variance of random forests compared to individual trees, serving as an adaptive mechanism to balance bias and variance. Furthermore, our results reveal an intriguing phenomenon: the presence of noise features can act as a “blessing” in enhancing the performance of random forests thanks to feature subsampling.

KOU Samuel
Harvard University

Catalytic Prior Distributions for Bayesian Inference

The prior distribution is an essential part of Bayesian statistics, and yet in practice, it is often challenging to quantify existing knowledge into pragmatic prior distributions. In this talk we will discuss a general method for constructing prior distributions that stabilize the estimation of complex target models, especially when the sample sizes are too small for standard statistical analysis, which is a common situation encountered by practitioners with real data. The key idea of our method is to supplement the observed data with a relatively small amount of “synthetic” data generated, for example, from the predictive distribution of a simpler, stably estimated model. This general class of prior distributions, called “catalytic prior distributions” is easy to use and allows direct statistical interpretation. In the numerical evaluations, the resulting posterior estimation using catalytic prior distribution outperforms the maximum likelihood estimate from the target model and is generally superior to or comparable in performance to competitive existing methods. We will illustrate the usefulness of the catalytic prior approach through real examples and explore the connection between the catalytic prior approach and a few popular regularization methods.

LI Mingyao
University of Pennsylvania

Unlocking the Power of Spatial Omics with AI

Spatial omics technologies have revolutionized biomedical research by providing detailed, spatially resolved molecular profiles that enhance our understanding of tissue structure and function at unprecedented levels. Histopathology is considered the clinical gold standard for disease diagnosis. However, the integration of histological information in spatial omics data analysis has been limited due to a lack of computational pathology expertise among computational biologists. In this talk, I will present several tools that we recently developed to leverage pathology image information, thereby enhancing spatial omics data analysis.

MA Ping
University of Georgia

Statistical Computing Meets Quantum Computing

The recent breakthroughs in quantum computers have shown quantum advantage (aka quantum supremacy), i.e., quantum computers outperform classic computers for solving a specific problem. These problems are highly physics-oriented. A more relevant fact is that there are already general-purpose programmable quantum computing devices available to the public. A natural question for statisticians is whether these computers will benefit statisticians in solving some statistics or data science problems. If the answer is yes, what kind of statistics problems should statisticians resort to quantum computers? Unfortunately, the general answer to this question remains elusive.

In this talk, I will present challenges and opportunities for developing quantum algorithms. I will introduce a novel quantum algorithm for a statistical problem and demonstrate that the intersection of statistical computing and quantum computing is an exciting and promising research area. The development of quantum algorithms for statistical problems will not only advance the field of quantum computing but also provide new tools and insights for solving challenging statistical problems.

YING Zhiliang
Columbia University

Recent Developments in IRT and Related Models: From Classical Theory to Modern Applications

This talk gives an overview of some recent developments in item response theory models and the related extensions. We will first look at the classical problems and related statistical framework for educational and psychological measurement from the historical perspective. We will then present some more recent developments that can deal with challenges arising from modern applications with more complex data. Theoretical results in terms of consistency and asymptotic efficiency will be discussed and compared with some classical results.

ZHAO Hongyu
Yale University

Statistical Challenges and Opportunities in Biobank Analysis

Recent years have seen the establishments of biobanks at different scales across the world. For example, the UK Biobank with more than 500,000 participants has proved to be a great resource for scientific discoveries with more than 11,000 publications with keyword “UK Biobank” in the PubMed covering a wide range of topics, including traditional epidemiology, genetics, imaging, wearable device, disease causal pathways, and others. The Million Veteran Program in the US has recruited more than one million participants, and the All of Us Initiative has recruited more than 800,000 people with a focus on diversity. Other programs at the national and regional levels include those in Iceland, Estonia, Finland, Japan, Taiwan, and others. At the local level, many health care providers have created their own biobanks to address the unique needs with the goal of providing more personalized and effective health care. Rich information is available from the large number of participants in these biobanks, with many biobanks linked with extensive electronic health records collected over the past decades, allowing longitudinal analysis of thousands of traits. Genetic information has been collected through either genotyping arrays, whole exome sequencing, or/and whole genome sequencing. Other data, such as transcriptomics, epigenetics, proteomics, and metabolomics, from various tissues, are collected from many participants. Imaging data and wearable device data are also available. The sheer volumes of these data present great computational and statistical challenges. In addition to the sizes of the data, the complexities of these data are also daunting. In this presentation, I will discuss the many statistical and computational challenges in the analysis and interpretation of biobank data, with a focus on the identification of genetic basis of common and rare diseases, disease risk predictions, causal inference, and the potential biases in the analysis of these observational data.