Author Image

Hi, I am Dennis!

Dennis Aumiller

PhD Student at the Database Systems Research Group, Heidelberg University

I am currently a fourth-year PhD Student at Heidelberg University, focusing primarily on Text Summarization, with a special interest in generating aspect-focused narratives. I do have a general interest in Natural Language Processing, particularly applied Machine Learning for NLP. I previously interned with Amazon Research and SAP, and helped build a search engine for lawyers at Codefy. As a commitment to open & reproducible research, I publish code repositories for my associated publications on Github. However, my biggest contributions to open source are my questions and answers on Stackoverflow, which have to date reached more than half a million people worldwide.

Skills

Recent Posts

Professional Experience

1
Applied Scientist Intern
Amazon Research

Berlin, Germany

Aug 2021 - Dec 2021

Situated within the organization of Amazon Search, I was part of a newly formed team investigating the applicability of multilingual NLP solutions for search scenarios.

Responsibilities:
  • Investigated sequential recommendation systems for customer query suggestions
  • Extracted suitable dataset and implemented a train/test pipeline for neural seque
  • Found flaws for tail queries in the existing live system and reported preliminary improvements in recall with my own sequential recommender tool

Software Engineer (part-time)
Codefy GmbH

Heidelberg, Germany

Sept 2019 - Jan 2021

During my PhD, I was involved in a local start-up, building a document search engine for lawyers.

Responsibilities:
  • Built backend for product prototype, helping the company secure 200.000€ in seed funding
  • Primary lead on document processing, developing a pipeline suitable for diverse document types
  • Optimized document database operators, improving ingestion time by over 30%
  • Built prototype for unsupervised keyphrase extraction module on legal documents
2

3
Software Engineering Intern
SAP SE

Walldorf, Germany

June 2018 - Sept 2018

At SAP, I was part of the Product & Innovations team, working on cloud-based ML solutions.

Responsibilities:
  • Optimization of Machine Learning operations with randomized algorithms
  • Achieved up to 1000x speed-up while maintaining comparable accuracy
  • Extensively benchmarked solutions to highlight performance discrepancies in existing tools

Heidelberg University

Heidelberg, Germany

Feb 2015 - Mar 2019

Throughout my studies, I worked at different groups within the Institute of Computer Science.

Teaching Assistant

Oct 2018 - Mar 2019

  • Teaching Assistant for the graduate lecture “Complex Network Analysis”
  • Additionally, helped design the assignments and final exam
Teaching Assistant

Apr 2016 - Sept 2017

  • Teaching Assistant for the lectures “Databases 1” in Summer 2016 and summer 2017
  • Lecture Assistant for the graduate course “Computer Graphics” in Winter 2016/17
  • Prepared and held weekly tutorial sessions for students and graded assignments
Student Assistant

Feb 2015 - Mar 2016

  • Tasked with the integration of a group’s website into the new corporate identity template
4

Education

PhD in Computer Science; Supervised by: Prof. Dr. Michael Gertz
Focus Area: Text Summarization and NLP
Publications
Aug 2017 - May 2019
M.Sc. Applied Computer Science
German GPA: 1.0 (with distinction; equiv. GPA: 4.0)
Minor: Computational Linguistics
Focus Area: NLP and Network Analysis
Thesis: "Implementation of a Relational Document Hypergraph for Information Retrieval"; Grade: 1.0 (with distinction)
Sept 2017 - Apr 2018
Exchange Year, Computer Science Program
CGPA: 3.95 out of 4.0
Focus Area: Machine Learning and Algorithmic Game Theory
Extracurricular Activities
  • Executive Member, UofT eSports Club
  • Executive Member, Undergraduate AI Group
Oct 2013 - Aug 2017
B.Sc. Applied Computer Science
German GPA: 1.4 (equiv. GPA: 3.6)
Minor: Computational Linguistics
Focus Area: Computer Graphics and Visualization
Thesis: "Mining Relation Networks from University Websites"; Grade: 1.0 (with distinction)

Publications

Online DATEing: A Web Interface for Temporal Annotations
SIGIR 2022 July 2022

We present an interactive interface, unifying the access to several temporal annotation frameworks. Aside from a graphical interface, we allow users to programmatically access the various tools through a streamlined API interface.

Klexikon: A German Dataset for Joint Summarization and Simplification
LREC 2022 June 2022

We present a high-quality resource of full-text alignments between German Wikipedia, and a German children’s encyclopedia. This yields a dataset that we empirically show to be suited for both summarization and simplification tasks.

Time for some German? Pre-Training a Transformer-based Temporal Tagger for German

Following a series of prior experiments in English, we outline a generalized training procedure for training a non-English temporal tagger with weakly supervised data. For this purpose, we utilize existing rule-based taggers as a way to scale up existing training resources in low-resource settings by several orders of magnitude.

Deep Learning und Legal Tech - Eine Bestandsaufnahme

In this (German) article, we outline the challenges that are currently preventing mainstream adoption of recent NLP advancements in the legal industry. Primarily, this can be attributed to a lack of proper domain generalization, as well as limited interpretability and scalability of such models.

BERT got a Date: Introducing Transformers to Temporal Tagging
arXiv Sept 2021

We experimented with various transformer-based architectures to see which ones would work best for extracting temporal annotations, such as ‘yesterday’ or ’every week’. However, we have since found a significant flaw in our evaluation setup for seq2seq-based models, so we decided to retract this article. Resulting tagging-based models are still valid, though, and are available online.

Structural Text Segmentation of Legal Documents
ICAIL 2021 June 2021

Utilizing existing segmentation tools, which primarily operate on sentence-level granularity, yields poor performance when segmenting long documents, which are prevalent in a legal context. In this work, we address the issue by proposing a weakly-supervised paragraph-based segmenter, which we empirically show on a novel dataset consisting of web Terms of Service documents.

UniHD @ CL-SciSumm 2020: Citation Extraction as Search
SDP @ EMNLP 2020 Nov 2020

We participated in the workshop’s shared task on extracting relevant paper sections in cited works. Interestingly, we show that our setup based on traditional search heuristics, coupled with improved pre-processing steps, outperforms our BERT-based retrieval setup. Overall, we placed third on the blind shared task test set.

TiCCo: Time-Centric Content Exploration
CIKM 2020 Oct 2020

This demonstration illustrates a time-centric approach to content exploration. Extracting and processing temporal mentions on large document collections allows a temporal expression of events, even when the documents themselves are not ordered chronologically.

A Versatile Hypergraph Model for Document Collections
SSDBM 2020 July 2020

Results from my Master’s thesis contributed the experimental section of this work. Primarly, we present a theoretical retrieval model based on hypergraphs, and demonstrate that these operations can be utilized to perform common Information Retrieval operations more efficiently on co-occurrence based word networks than traditional dyadic graphs.

Time-centric Exploration of Court Documents

Ordering events in a chronological fashion requires a accurate modeling of the temporal hierarchy, which previously was not well-defined for long-term event horizons spanning several decades. Here, we present a temporal model that is shown to work well, even without explicit temporal ordering in underlying document collections.

DNA accessibility of chromatosomes quantified by automated image analysis of AFM data
Scientific Reports Sept 2019

In a collaboration with molecular biosciences, we developed an automated Image Processing pipeline that was able to speed up the annotation process and accuracy for determining lengths of chromatosome strands in AFM images. In this work, it was ultimately shown how mutations in a particular gene can cause different winding patterns.