Sharing of Department Summer Internship 2022
JIANG, Yunhui , BSc in Statistics
I am very grateful for being given the opportunity to work in the Trade Statistics Processing Section of the Trade Statistics Branch (2) under the guidance of my supervisor, Mr Hinz SHUM, during my summer internship in the Census and Statistics Department (C&SD). I was involved in a project to build models for automatically matching manifests and trade declarations. I first read literature reviews to familiarise myself with text analysis techniques. From the research bulletins produced by statisticians in the C&SD, I learned how to apply these techniques to analyse international merchandise trade statistics. After equipping myself with the necessary background knowledge, I assisted the project by programming functions for data pre-processing, especially for the textual data. I also built models for matching data items of manifests and trade declarations by using a variety of programming techniques and packages, in which I referenced several existing string matching models and borrowed some string similarity checking methods from them. My group eventually implemented a pipeline to test functions and models.
I was also honoured to be selected to assist the research work of my supervisor at school, Prof. Han Ruijian. We are working on the convergence analysis of the MM algorithm for solving the maximum likelihood estimator used in pairwise comparison models. We wrote a Python package for the existing MM-algorithm-based solvers of maximum likelihood estimation that was specifically designed for several pairwise comparison models. We further extended the algorithm to several other models and added these solvers to the package we created. The package enabled us to conduct simulations to test the properties of interest for the convergence of the algorithm. I am very happy that my part of this work was fruitful, as the current phase of the simulation work is almost complete, with satisfactory results, which gives me great satisfaction as it was my first experience of research work.
The joint programme has been a valuable experience for me, as my exposure supplemented my skill set and helped me identify my field of interest for future studies. I would like to express my gratitude to both my supervisors and colleagues, who inspired and supported me and also gave me valuable advice on career and life.
LEUK, Shi Min, BSc in Statistics
I am honoured to have been given the opportunity to work in the National Income Section (1)2 of the Census and Statistics Department (C&SD) for the past two months. This section is mainly responsible for research and development on GDP-related statistics by economic activity and for providing GDP estimates and revisions for press releases.
It is important to provide GDP estimates on time, and my main task was to perform GDP nowcasts. As the economy was severely affected by the COVID-19 pandemic, models using conventional macroeconomic variables performed poorly, which complicated my task. I am grateful to my supervisor, Brian, who recommended using Google Mobility, a new high-frequency series that recorded citizens’ mobility since the pandemic, as a variable. As a result, the model was considerably improved. I also learned many time-series-related techniques such as backcasting, temporal disaggregation, and seasonal adjustment using X-13ARIMA-SEATS. I was able to build an ARIMAX model using Python and R, with satisfactory results.
At CUHK, my supervisor, Dr Ho Kwok Wah, introduced me to the projects he was working on. I was asked to analyse psychology and physiology data sets using the linear mixed-effect model from both the frequentist and Bayesian approaches. This was challenging as I was unfamiliar with Bayesian statistics and the model used. Knowing my concerns, Dr Ho patiently guided me through the intricacies of the Bayesian philosophy and framework and provided the references I needed to understand and implement the model in R. The tasks were completed successfully, and the results show clear differences between the two approaches.
Overall, this internship experience consolidated my programming skills and statistical knowledge, allowing me to apply my skills in various fields. To say that it was an enlightening experience would be an understatement. I am grateful for the opportunity offered by the Department of Statistics, and thank my supervisors for their patience and guidance.
LUI, Chak Sum, BSc in Statistics
I am grateful for having been given the opportunity to join this internship programme. During the two-month internship, I was assigned to the Construction and Miscellaneous Services Statistics Section of the Sectoral Economic Statistics Branch (4). This section is mainly responsible for handling data related to both the building, construction, and real estate sectors and social and personal services.
One of the tasks assigned to me was conducting desktop research on the building, construction, and real estate sectors. The data that I collected is used for publications related to the building, construction, and real estate sectors of the Annual Survey of Economic Activities. My supervisor, Henry, explained to me how the C&SD collected the related data by introducing me to the survey that they used. He also explained to me the sampling method C&SD used to select a sample of contractors. I also learned more about the organisation and career path of the C&SD.
Apart from working at the C&SD, I also worked at CUHK with Prof. Sit. My work at CUHK involved building linear mixed-effect models with data related to some measures of patients’ eyes. As a sophomore year student, this task was a little difficult for me in the beginning as I did not have strong background knowledge, but Prof. Sit gave me materials to help me and answered my questions patiently, which helped me understand more about the model and finish the task successfully.
I thank my supervisor, Henry, Prof. Sit, and the Department of Statistics. This internship broadened my horizons as I gained working experience and learned more about statistics. I believe that the knowledge and experience that I have gained from this programme will serve as a strong foundation for my future development.
NGAN, Yi Ho, BSc in Statistics
I am honoured to have been given the opportunity to work in the Trade Analysis Section (1) of the C&SD. This section provides high-quality statistical services and publishes trade statistics for the general public.
I was responsible for building and testing different structures of probabilistic deep learning (PDL) models for predicting the distribution of the unit value (UV). If the probability of observing a reported UV is extreme, then it may be considered abnormal. It is a novel big data analytics project on detecting misdeclared UVs from traders. I first built a baseline model as a benchmark for model evaluation, then compared the performance of the PDL model for different output distributions by cross-validation.
Although I had some modelling experience from my course projects, it was the first time I was processing such a complicated and large-scale dataset. Fortunately, one of my supervisors, Benjamin, hosted an interactive deep learning tutorial series that prepared me to handle the task. Currently, this project is under development. I am looking forward to seeing the final model.
I am also conducting research on statistical inference with incomplete time series under the supervision of Prof. Chan Kin Wai. It is a golden opportunity for me to get research experience as I am planning to pursue postgraduate studies. Prof. Chan provides guidance and many ideas to help me develop the proposal. To be honest, I often feel frustrated before meeting him, but he invariably eases my worries and motivates me to go further. I thank him for his patience and support.
I would like to thank my supervisors (Benjamin, Ian, and Natalie) and Prof. Chan. They not only taught me the subject but also offered many useful suggestions regarding my future plans. They were a constant source of support and encouragement. This program has enriched my knowledge and experience in both official statistics and academic research. It will undoubtedly help my further studies and future career.
SUN, Zhengyao, BSc in Statistics
During my internship at the C&SD, I was very pleased to be assigned to the Data Collection Systems Section of Information Technology Services and Infrastructure Branch, supervised by Jordan Mo. The main responsibility of this section is to maintain the Online Questionnaire System and the Computer-Assisted Telephone Interviewing System, which are important data sources of the C&SD in this information age.
My supervisor, Jordan, was caring and considerate. He encouraged me to explore fields that matched my interests and would help my career development, instead of directly assigning projects to me. Finally, among the many interesting ideas, I found data security and anonymisation techniques very attractive.
Once you need to publicise your data sources or share some data containing sensitive information, anonymisation is a topic that will come up in statistical research irrespective of the industry. There is always the risk that an unintentional data leak will cause a privacy breach. The main task of this project was to review anonymisation methods and techniques and their applicability, advantages, and drawbacks. I also had to provide demo code in Python to show how these techniques could be conveniently implemented in practice.
I was also supervised by Philip YAM at CUHK during the internship and was involved in projects such as the optimisation of word choice in the Wordle game to find the correct answer with the least number of steps, and its implementation in R and Java. In another project, I learned some methods of sentiment analysis and tried to use a naïve Bayes classifier for prediction. The most attractive part of my work was learning about generative adversarial networks (GANs), which is an important deep learning concept. Many kinds of GAN models are used to perform different tasks, such as generating your own Pokémon or transforming an image into an artwork in Van Gogh style, which can be considered as NTFS.
The work experience gave me a comprehensive understanding of data anonymisation, information entropy, and GAN, and a taste of what the work of a real statistician is like. Along the way, I noticed that some seemingly unrelated concepts or methods are based on common statistical theories. It was a pleasure to be a part of this internship programme, and I have a clearer insight into what I want to do and the skills on which I need to work to prepare for the future.
ZHAO, Shengchun, BSc in Statistics
During the summer holiday, it was my great honour to work at the Centre of Clinical Research and Biostatistics (CCRB). The time was short, but I gained a lot during this summer internship.
My project was to quantify the relative contributions of government measures, meteorological factors, and viral genetic variants to COVID-19 pandemic control from the regional and global perspectives.
My first task was to collect and manipulate the data for government interventions. Prof. Marc gave me some references and data websites. In the beginning, I sorted and combined the data sets according to the requirements in the proposal, but the results from the data sets processed in this way were not good. Therefore, I consulted Prof. Marc many times regarding the treatment of the data sets. Prof. Marc told me that when collecting data, we should judge the accuracy and reliability of each data source; not all data should be collected. After collecting data, the next step was to combine different data sets, but their forms and formats varied, so I had to make careful comparisons and use my judgment. Prof. Marc suggested that I should focus on one data set and supplement and improve the rest. Finally, I standardised all of the data sets using their codebooks.
My second task was to collect the weather data of more than 180 countries in the world for the past three years. Because each country has many weather bureaus, I had to collect data from all of the bureaus and process them. These tasks taught me that data accuracy and reliability are the cornerstone of research. Only when reliable data are collected can research produce accurate results. Prof. Marc also made me understand what academic rigour means.
My last task was to calculate the reproduction number (Rt). This involved learning a lot of statistics. Prof. Marc gave me many reference materials from which to learn. I learned about the Markov Chain Monte Carlo method and the Metropolis–Hastings algorithm, which benefitted me a lot.
I am very grateful to Prof. Marc for his help during the summer internship. I not only learned a lot of statistics but also understood what scientific research entails.