Theses

We are always looking for motivated students who are interested in writing about a topic connected to our current research projects!

Potential Topics

Strategies for reducing annotation costs by implementing an LLM annotator (Master Thesis)

Training machine learning (ML) models relies on annotated (or also called labeled) training data. Large Language Models (LLMs) offer great potential for data annotation. However, human annotations are likely still needed for difficult or ambiguous annotations. A reasonable collaboration setup of LLM and human annotators could assign the easier instances to the LLM and the more complicated ones to humans. This approach could reduce annotation costs and let the human annotators focus on the more ambiguous cases. Best practices for allocating annotation tasks between humans and LLM, in particular for subjective tasks, are yet to be developed. In this thesis you could develop and test algorithms for allocating tasks between the two annotators and study their impact on quality and cost. Indicators to route an instance to the human (expert ?) annotator could for example be a self-assessment of the LLMs certainty. If interested, please email your CV and a brief explanation of interest in the topic to jacob.beck@stat.uni-muenchen.de and CC soda@stat.uni-muenchen.de . Also, we ask you to please describe your familiarity with the topic.

Strategies for reducing annotation costs by implementing an LLM annotator (Master Thesis)

The way in which an annotation task is structured affects the annotations that human annotators provide, a result called annotation sensitivity. For example, the order in which annotations are collected, and the number of screens, can change whether tweets are annotated as containing hate speech or offensive language (https://dl.acm.org/doi/10.1007/978-3-031-21707-4_19; https://aclanthology.org/2024.uncertainlp-1.8/). With the growing use of LLMs as annotators, we wonder whether LLMs also show annotation sensitivity. Since LLMs are built on data produced by humans, it might be that the models inherit similar biases. In this thesis, you could replicate findings from the above studies with LLM annotators. If interested, please email your CV and a brief explanation of interest in the topic to jacob.beck@stat.uni-muenchen.de and CC soda@stat.uni-muenchen.de . Also, we ask you to please describe your familiarity with the topic.

Multiple Imputation of Partially Observed Covariates in Discrete-Time Survival Analysis (Master Thesis)

We are seeking a motivated Master's student to embark on a methodological thesis project aimed at extending the scope of substantive-model compatible (SMC)-FCS multiple imputation (MI) techniques in discrete-time survival analysis (DTSA) to accommodate time-varying variables. Building on our existing work, which has successfully extended SMC-FCS MI for time-invariant covariates, this project will tackle the additional complexities introduced by time-varying variables. The successful candidate will conduct comprehensive Monte Carlo simulations to evaluate the extended methodology, and contribute to refining the practice of discrete-time survival analysis in the presence of missing data. If you are interested, please contact anna-carolina.haensch@stat.uni-muenchen.de with your CV, student records and a short explanation of why you are interested in this topic. In addition, please describe how familiar you are with the R programming language.

Implement MICE in Python (Master Thesis)

Are you a student with medium to strong Python skills looking to enhance your expertise? Consider implementing Multiple Imputation by Chained Equations (MICE) in Python for your master thesis. MICE is a statistical method for handling missing data, critical for reliable data analysis. This project offers the opportunity to gain hands-on experience with advanced data imputation techniques. Prior knowledge of missing data techniques is advantageous but not required; you will have the chance to learn on the job. By participating, you will produce a master thesis with significant practical applications, positioning yourself strongly in the data science field. Please contact anna-carolina.haensch@stat.uni-muenchen.de if you are interested.

Masterarbeit zu Recht, evidenzbasierter Politik, Nachhaltige Raumplanung und Data Science

In der Masterarbeit sollen Möglichkeiten von Data Science für die Bewertung der Nachhaltigkeit von Flächenmanagement und Stadtplanung ausgelotet werden. Das Projekt beschäftigt sich mit Umsetzung von regionalen oder nationalen Vorgaben zur Klimafolgenanpassung und Klimaschutz in der Stadtplanung. Die zentrale Frage ist, wie Data Science genutzt werden kann, um politische Maßnahmen und Verwaltungshandeln zu bewerten. Die Aufhaben sind:

  • Bestandsaufnahme und Kategorisierung von Themen, Grundsätzen, Zielen, Maßnahmen in Verwaltungsdokumenten und Gesetzen
  • Strukturierung der Informationen aus den Texten mit NLP-Techniken insbesondere Large Language Models
  • Entwicklung eines Frameworks mit Fachexperten, nach dem Kategorien eingeordnet, bewertet und verglichen werden können
  • Automatische Analyse und Bewertung der Dokumente
  • Entwicklung einer verständlichen Visualisierung der Daten

Bestand: Datensatz von Bauleitplänen, welche u.a. detaillierte Angaben auf Gebäudeebene zu Umweltzustand, Umweltrisiken und notwendigen Maßnahmen enthalten und genaue Angaben zur Gebäudeart, Höhe und Bebauungsdichte enthalten. Datensatz von Regionalplänen, in denen Vorgaben für Bauleitplanung gemacht werden. Zusätzlich Möglichkeit Hochwasserkarten, Klimarisikokarten u.ä. zu beziehen, ebenso Gerichtsprozesse und Klagen im Bezug auf die Pläne. Bundesländer: NRW, Bayern und Region Rhein-Main-Neckar.

Mit dem Forschungsteam kann eine eigene Fragestellung entwickelt werden. Die Arbeit erfordert selbstständige Arbeitsweise, Interesse an interdisziplinärem Arbeiten, erste Kenntnisse an den Themen Nachhaltigkeit und Klimawandel und gute Deutschkenntnisse. Es besteht die Möglichkeit, im Rahmen der Masterarbeit eine Stelle als studentische Hilfskraft anzubieten. Bei Interesse bitte eine E-Mail mit einem CV und einem kurzen Anschreiben an felicitas.sommer@tum.de und bolei.ma@lmu.de senden.

GIST: Greenhouse Gas Insights and Sustainability Tracking (Bachelor and Master Theses, you will work with Python)

Financial regulators and central banks are increasingly integrating sustainability aspects into their operations. The Corporate Sustainability Reporting Directive (CSRD) mandates that ~50000 European companies will have to publish sustainability reports in the future, a great source of data for statistical analysis.

One particular challenge is that companies communicate their sustainability information through unstructured PDF reports that contain both numerical and textual data. To make this information amenable to quantitative research, GIST applies Natural Language Processing (NLP) and Large Language Models (LLMs) for data extraction.

Possible tasks include:

  • you could implement additional features using Python in our data extraction pipeline and/or compare different methodologies.
  • you could review, replicate and extend existing literature that makes use of sustainability reports.

If you are interested, please contact malte.schierholz@stat.uni-muenchen.de with your CV and a short explanation of why you are interested in this topic. In addition, please describe how familiar you are with the topic.

LLM Plugin for Learning Statistics and Programming (Bachelor and Master Theses)

Large language models (LLMs) such as ChatGPT have been a disruptive development in the world of artificial intelligence with many promising opportunities. As part of this thesis you would develop an R package to help students learn statistical programming in R and resolve errors in their code using the OpenAI API. Your package will be made available to hundreds of students in introductory courses for statistics and statistical programming across the LMU and you have the option to evaluate the collected data from these trials. If you are interested, please contact jan.simson@lmu.de and cc anna-carolina.haensch@stat.uni-muenchen.de with your CV and a short explanation of why you are interested in this topic. In addition, please describe how familiar you are with the R programming language.

Cross-Cultural Examination of Algorithmic Fidelity: Comparing GPT-3 and Survey Results (Master Thesis)

In this thesis project, you'll extend the current research on "algorithmic fidelity" (Argyle et al 2023) in large language models to a new socio-cultural context. Choose a country and an election or a unique survey topic, and compare the model's output to actual survey results.

Your tasks will include:

  • Applying the concept of algorithmic fidelity in the chosen context.
  • Investigating GPT-3's response complexities relating to the interplay of ideas, attitudes, and the socio-cultural context of your chosen setting.
  • Identifying and examining potential biases in GPT-3's algorithm within your chosen context.

This project presents an opportunity to make significant contributions to a novel intersection of AI and social science, providing valuable insights into language models.
Please contact anna-carolina.haensch@stat.uni-muenchen.de with your CV and a proposal regarding the area of application if you are interested.

Dynamic Fairness and Algorithmic Decision-Making (Master Thesis)

Public agencies are increasingly automating the allocation of scarce public resources by making use of risk prediction models. While a wide range of studies focuses on bias in the application of such models, the long-term fairness implications of algorithmically assisted decisions are not fully understood. Building on the emerging literature of dynamic fairness, this project aims at studying feedback loops and the long-term consequences of algorithmic decision-making in social contexts. If you are interested, please contact christoph.kern@stat.uni-muenchen.de and cc anna-carolina.haensch@stat.uni-muenchen.de with your CV and a short explanation of why you are interested in this topic. In addition, please describe how familiar you are with the topic.

Policy Learning for Fair and Effective Interventions (Master Thesis)

ML methods are increasingly used in combination with ideas from the causal inference literature to explore heterogeneous treatment effects. Such approaches are useful, for example, for personalizing treatments in medicine or for selecting optimal treatment regimes in the delivery of welfare state measures. While topics such as explainability and transparency have already been studied in the past (see, e.g. policy trees), the connection of the causal learning literature to the fairML literature is still weak. However, it is well known that there are many biases present in data used for developing personalized treatments in medicine or in access to welfare state measures. Therefore, we seek students interested in exploring the connection between causal learning and fairML. If you are interested, please contact christoph.kern@stat.uni-muenchen.de,r.bach@uni-mannheim.de, and cc anna-carolina.haensch@stat.uni-muenchen.de with your CV and a short explanation of why you are interested in this topic. In addition, please describe how familiar you are with the topic.

Replicate a Meta-Analysis (Master Thesis)

This meta-analysis of 70 studies (Konrath et al. DOI: 10.1177/1088868310377395) claims that US college students' empathy levels have fallen over time. The decline really picks up in 2000. The authors speculate that the decline is due to social media use. Could the effect be due to changes in survey mode or declines in survey response rates over time? You would replicate the paper and add these methods variables. If you are interested, please contact steph@umd.edu and cc anna-carolina.haensch@stat.uni-muenchen.de with your CV and a short explanation of why you are interested in this topic. In addition, please describe how familiar you are with the topic.

Harnessing Machine Learning for Early Detection of Cognitive Impairment

Mild cognitive impairment (MCI), affecting over 15% of adults aged 50 and above, often progresses to dementia, underscoring the importance of early detection. This thesis project focuses on developing innovative, machine learning-based diagnostics for MCI using non-invasive data collection methods. Unlike traditional approaches that rely on extensive neuropsychological testing unsuitable for widespread screening, this project proposes the use of machine learning algorithms to analyze computer use behaviors, particularly mouse movement data.
Participants in a large Internet panel, engaging with surveys on various digital devices, will have their mouse movements recorded. This data will serve as the foundation for developing algorithms capable of predicting levels of cognitive functioning and identifying early signs of MCI based on how participants interact with standardized tasks and questionnaires. This project presents a unique opportunity for students to contribute to critical advancements in medical diagnostics, offering a cost-efficient, automated, and unobtrusive method to potentially delay the onset of severe dementia symptoms. If interested, please email your CV and a brief explanation of interest in the topic to felix.henninger@stat.uni-muenchen.de and CC soda@stat.uni-muenchen.de . Also, we ask you to please describe your familiarity with the topic.

AutoML for Fairness by Abstaining

Automated Machine Learning and Hyperparameter optimization techniques can be used to tune fairness-aware machine learning models that trade off predictive accuracy and a fairness measure (for example, equality of opportunity). However, a recent study challenges the assumption that there is a fairness-accuracy tradeoff, and suggests that only a few noisy samples per dataset are responsible for the perceived unfairness. When learning to abstain from such noisy samples using a bagging-based classifier, the study claims that standard models already produce fair predictions. However, this comes with the cost of having several data points not classified, and ideally, one would like to minimize the number of data points the classifier abstains from. The goal of this thesis is to find out if we can:

  • reproduce the findings from the study on a large and cleaned set of datasets for fairness research, and
  • minimize the number of samples from which the classifier abstains via hyperparameter optimization (while maintaining good predictive performance), and
  • research if calibration of the classifier (on out-of-bag data) could be used to further improve performance.

Alternatively, one can try to reproduce findings from other studies, such as the ones from Perrone et al. or Cruz and Hardt in light of these findings. If you are interested, please contact christoph.kern@stat.uni-muenchen.de, matthias.feurer@stat.uni-muenchen.de and cc anna-carolina.haensch@stat.uni-muenchen.de with your CV and a short explanation of why you are interested in this topic.

We also welcome a thesis topic of your own! Please do not hesitate to contact us.

Contact Person

Dr. Anna-Carolina Haensch