Sharing of Department Summer Internship 2021
KWOK, Ho Hin, BSc in Statistics
In this programme, I was assigned to the Labour Statistics Branch (1) of the Census and Statistics Department (C&SD). This department is mainly responsible for collecting surveys from companies and stores in Hong Kong to classify them into different categories as defined in a Code Book.
As it is important to have up-to-date information on the stores, the Statistician, Daniel, assigned me the task of building a Web scraping program. The program was required to retrieve the name, status and address of restaurants in Hong Kong from OpenRice. Daniel was very helpful and advised me to use Selenium to complete the task. Through the task, I learnt many Web scraping techniques and the idea of acting like a human when doing Web scraping. The technique and the idea were crucial to overcoming the traps designed by Web developers.
This internship experience inspired me greatly and exposed me to dealing with practical problems. The Web scraping task was very different from those in my coursework. I believe that the techniques I learnt and the overall experience will be helpful in my future career.
LAI, Tsz Chun, BSc in Statistics
I was glad to be assigned to the Development Branch of the C&SD during this two-month internship, supervised by Jessie. My focus was on research on the National Quality Assurance Framework of the United Nations (UN-NQAF) and the microdata products of different national statistical offices (NSOs), especially considering the data confidentiality and statistical disclosure control (SDC) requirements.
Microdata typically contain potentially sensitive information, such as individuals’ names or addresses. NSOs prevent identification of individuals during analyses in statistical research. Therefore, traditional SDC methods such as rounding numbers and suppressing and removing outliers are implemented to anonymise the data. However, these methods may affect statistical inferences from the anonymised data. Hence, some NSOs have developed advanced SDC measures to achieve a good balance between data confidentiality and information preservation.
One interesting measure is the personal information factor (PIF) from the Australian Bureau of Statistics. Suppose we wish to enable access to an extract with some records and variables from a dataset. We use the prior knowledge of the variables from the population dataset and the posterior knowledge of the variables from the sample extract to calculate the cell information gain for each cell. The row information gain is obtained by summing each row, and the PIF is thereby obtained. The PIF gives us a way to measure the amount of personal information in microdata to assess the disclosure risk.
I was also supervised by Prof. Han at The Chinese University of Hong Kong (CUHK) during the internship, which was another valuable experience. We focused on ranking tennis players based on their historical performance from 2000 to 2021 by implementing the MM algorithm on Python. At the beginning, it was rather hard for me as I was new to Python. Prof. Han guided me step-by-step to finish the implementation, which helped me to familiarise myself with Python in a short time.
The first model that we implemented used the winner and loser of each game to rank the players. We calculated the negative log-likelihood of the first model and compared it with the implied probability from a betting website. Our estimation was not as accurate as the that of the betting website. Hence, we improved the model by adding more information such as the winning and losing sets.
Overall, I gained practical and research experiences at the C&SD and CUHK. These experiences broadened my horizon on the importance of statistics in different aspects and consolidated my statistical skills for my future career. I would like to thank my supervisors again for this unforgettable internship experience.
LAM, Man Ting, BSc in Statistics
In the internship programme, I was glad to be assigned to the Statistical Processing Systems Branch of the C&SD. This section specialises in IT planning and policy formulation, such as program enhancement. I was responsible for two major tasks: BANFF imputation for missing values in a survey and the generic application of automatic coding.
I accessed the research paper for the BANFF procedure, which inspired me greatly in understanding the general mechanism and workflow of the imputation system. By comparing the overseas practice and the existing imputation procedure, I realised that apart from the general imputation rules, the accuracy could be improved through outlier detection and multiple rounds of imputation. With the guidance of my supervisor, I also learned about the practical considerations and requirements when applying the theory to the program model. This experience equipped me with the ability to handle real-life statistical models.
I would like to thank my supervisor, Prof. Zhu Hui Chen, for her guidance on the data analysis project on COVID-19. I worked on the data collection and analysis of the worldwide COVID-19 data. Handling a massive amount of data was challenging and new for me. With the patience and guidance of my supervisor, I followed the right direction and completed the task. I would be glad to acquire concrete and fruitful knowledge on analysis and illustration alternatives by using R. I benefitted immensely from the internship programme, and I would like to acknowledge the Department of Statistics for their help and for providing this opportunity.
LEE, Wai Hin, BSc in Statistics
I am grateful for the precious opportunity to join this internship programme. During my work at the C&SD, I was assigned to the Wages and Labour Costs Statistics Section (1) of the Labour Statistics Branch (3). This section is mainly responsible for publishing the quarterly reports of wage and payroll statistics.
I was assigned the task of conducting desktop research about the gig economy. Through this research, I learned how to construct a study from scratch: from understanding the topic and collecting relevant data to analysing the data and writing a summary article. I also learned more about the organisation and work of the C&SD, in addition to some of its publications such as the general household survey.
Apart from working at the C&SD, I also worked at CUHK with Prof. Yam and his MPhil student, Kaiser. My work was mainly on machine learning, which was a new topic for me. Whenever I encountered any problems, they were willing to guide me patiently, and I acquired new knowledge, for example on support vector machine and the BFGS algorithm, through my work.
This internship was a great opportunity for me to work with colleagues and supervisors at both the C&SD and CUHK, and I believe that the knowledge and experience that I have gained from this programme will be useful in the future.
TSE, Shun Chi, BSc in Statistics
I was very pleased to have the opportunity to work in the C&SD for two months as a summer intern. During this internship period, I was assigned to the Data Collection System Section of the Information Technology Services and Infrastructure Branch. The section is mainly responsible for maintaining the Online Questionnaire System and the Computer-Assisted Telephone Interviewing System.
Over the two months of internship, I was assigned two tasks. First, I established an online questionnaire template for a new survey, and the template was the basis for further applying the survey on the online questionnaire system. Second, I conducted research and compiled a research report on the possibility of automating the case allocation process within a field pool to reduce the manual effort involved.
My work experience at the C&SD will be beneficial to my future study and career in multiple ways. As I have learned to set up a sample online questionnaire template based on a real-world economic survey, my programming skills have improved. By researching the questionnaire for the survey, I also learned the basic principles of questionnaire design, which consolidated my statistical knowledge. Moreover, when I was conducting the research on the automation of an optimisation model for work assignment, I learned to formulate a problem in terms of different model approaches and to make concrete recommendations for solving the problem. Lastly, during the two months of being a summer intern, I learned to communicate and cooperate with my colleagues in solving problems together, which improved my interpersonal skills. The work experience at the C&SD has equipped me with various soft skills and technical knowledge, which will be beneficial to my future study and career.
PENG, Zetao, BSc in Statistics
During the summer holiday, I felt honoured to be selected to intern at the Center of Clinical Research and Biostatistics (CCRB; Health View Bioanalytic Limited). The two-month internship was short but fruitful.
The company mainly deals with health issues using statistical methods. The major topics during my internship time were dementia, diabetes, stroke and anaemia. My first task was to collect relevant disease data in the Greater Bay Area. Two obstacles surfaced when I attempted to search for relevant data. The first problem was that I had no idea how to collect reliable data because I had never done similar work before. The other problem was rather more objective: the medical data were usually not accessible publicly, which brought me much difficulty. I searched the Internet tirelessly, assuming that every piece of data would be valuable and trustworthy. Therefore, I gathered almost all of my data without considering their sources, and this turned out to be problematic as the final report was less rigorous than it could have been. My leader asked a senior colleague, Sara, to assist me with addressing the challenge. Sara taught me how to search for stricter data and taught me a precious lesson about how to solve similar problems in my future study and career.
My second task was simple data analysis among several variables. This part resembled what I had learnt in class. I mainly used R to generate images and explore potential relationships. Unlike in the class teaching materials, the data were “ugly” and seemed to have almost no internal relationships. When I discussed it with a senior colleague, Jack, he told me that in most real-world situations, irregular data would be the norm. He remarked, “Only when theory combines with practice can it be real knowledge”.
The last task was to process the original data obtained from different questionnaires. This task posed a serious test to my Excel skills. I needed to input the data and combine them. Then, I needed to combine different Excel files into a single file and select the designated entries. This task required me to be patient and meticulous. It also helped me recognise the value of every piece of data in real life. A statistical project is not only about data analysis; the data collection and selection are also important.
During this period, I also served as an “ARIA” (Automatic Retinal Image Analysis) volunteer in communities. I was responsible for explaining the analysis results to the participants. When I communicated with the older participants, I realised that the fight against geriatric chronic diseases still had a long way to go. This experience also helped me to appreciate how statistics can be applied in real life as well as ways to solve practical problems.
I sincerely express my appreciation to my leader and the people who helped me in my work. I hope this experience will serve as a beacon along my journey of progress.
XIAO, Jian Bo, BSc in Statistics
It was an honour to work as a Junior Research Assistant at the CCRB this summer. During the two-month internship, I gained much valuable experience in applying statistical knowledge as well as instrumental career guidance from my supervisor, Prof. Benny Zee, and other staff at the CCRB.
My main tasks included organising raw data, writing analysis reports, designing course materials for undergraduates majoring in Public Health and interpreting risk predictions for the elderly in the community.
Previously in my statistics courses, I had never worked on organising raw data; hence, I underestimated the importance of this tedious job. Ms Maria Lai and Dr Jack Lee emphasised that every record counted and that I needed to be cautious when cleaning the data. For example, a statistical outlier may reflect the reality in the clinical field and hence should not be removed. I proceeded to write a primary analysis report on a questionnaire survey. It was challenging at first because many attributes were involved as well as a large proportion of missing values. However, Ms Sara Li encouraged me to start by learning from published journals on relevant topics and set about constructing my report from a basic descriptive summary and bivariate relationship. This was expected to make it easier for me address the problems of interest.
I was glad to be invited by Prof. Marc Chong to help design course materials for teaching statistical software called PSPP through a flipped classroom. This special task inspired me to rethink how to convey knowledge in a more easily comprehensible way rather than follow a strict and formal framework. I also learned the philosophy of the Feynman technique and found it fulfilling to master new software while teaching others.
Additionally, I participated in some outreach services for the elderly as a health ambassador. Our team helped the elderly assess their stroke and dementia risk levels using our Automatic Retinal Image Analysis technique. My duties were to interpret the prediction results and advise accordingly. It was memorable to witness how an effective application of our technique benefitted the community; this experience has taught me to make good use of data from people for the people. I also came to understand the significance of interpretability in biostatistics.
Thanks to the Statistics Department, I was privileged to enjoy a fruitful internship with a team of esteemed senior colleagues. Not only did I gain professional expertise, I also now aspire to grow into a statistician who cares about society and knows more than merely handling data.
ZHU, Jiaying, BSc in Statistics
I was honoured to serve as an intern at the CCRB this summer. I worked at Beth Bioinformatics Co., Ltd and also analysed data for senior PhD students. Working with Prof. Maggie Wang, along with her team, was an inspiring and memorable experience.
As a junior research assistant, my first task was to download COVID-19 data from websites and perform some alignment. A senior colleague explained that I could only proceed to consider the analysis methods and perform research once I knew what the initial data looked like. I also had to be cautious because compared with other mistakes in coding or calculations, mistakes in data collection are easily concealed and more costly to handle. Furthermore, I created an information sheet summarising the variations of the virus depending on different viral characteristics and the outcomes of common and rare mutations. Although I lacked professional medical knowledge, my technical contributions aided the efficiency of the research.
The other part of my internship involved marketing issues at Beth Bioinformatics Co., Ltd. The company mainly offers technology for vaccine and antigen design, evaluates the risk of complex disease through genomic analysis, and operates an app called “vaccine4u”. I made two original hand-painted posters for advertisement, and I checked for the copyright on photos in articles issued by the company’s Facebook account. This was when I realised the importance of copyright usage problems for a company.
Through this programme, I gained experience and knowledge in research and practice. I was glad to work in such a bioinformatics company that benefits society. I would like to thank Prof. Maggie Wang, Wilson, Siyang and Lirong for their support. I would also like to express my gratitude to the CCRB for offering me such a valuable opportunity.
CHING, Sze Wai, BSc in Statistics
It was my pleasure to work as a Junior Data Scientist (Intern) at Beta Labs this summer. I was honoured to work with Dr Ming Cheung, who provided helpful guidance on the tasks and taught me a lot about data science.
My duty was mainly to assist the data science team. My first task was to write a program to generate the respective Autokeras code from Keras models. It was a difficult task as I was new to deep learning and had to study as much as I could to understand it. With guidance from my supervisor, I was able to learn about how deep learning works, how to train a model and the structure of deep learning models.
Another task of mine was product recommendation. The goal was to introduce a regularisation factor into the recommender system to reduce the bias of the recommendation. I was required to read a research paper on this issue and create a program to perform unbiased recommendations with optimum time complexity. Additionally, I created some Power BI dashboards to visualise and compare the result of different product recommendation algorithms.
I would like to express my gratitude to my supervisor, Dr Ming Cheung, for the guidance and advice. I learnt a lot about deep learning and recommender systems. The Beta Labs team were also very helpful and friendly. I am grateful to the Department of Statistics for nominating me to the internship programme. Not only did I learn technical skills and knowledge about data science, I also came to understand how data scientists, analysts and engineers work as a team to support each other. I believe this internship experience will be very beneficial to my future career.