Academy Assistants 2018 granted
Vossen and Fokkens, Maks and Sommerauer of his team of the Computational Lexicology & Terminology Lab (CLTL) received 4 grants for the Academy Assistant Projects 2018. With this program the Network Institute aims to interest bright young master students for conducting scientific research and pursuing an academic career. The program brings together scientists from different disciplines; every project combines methods & themes from informatics, social sciences and/or humanities. For each project, 2 or 3 student research assistants work together. The projects result in papers and/or research proposals. For 2018, the following projects were granted.
Analysing large numbers of documents is a common and time-consuming task. For instance, investigating unsavoury business practices (e.g., slavery, fraud, bribery) can involve processing large numbers of contracts, yearly reports and external (news) sources that may reflect on a company’s reputation and relations. Currently, this is a labour intensive task mainly using text search to identify relevant documents that are then manually processed.
In this project we will apply methods to extract the relevant concepts (e.g., the name of
suppliers, or the type of relationship between companies, executive management) from unstructured (e.g., news) as well as semi-structured (e.g., contracts and financial) documents to populate knowledge graphs and link them to publicly available knowledge graphs. These knowledge graphs should reflect the temporal binding and provenance of the extracted relations and properties. This will enable automated reasoning about companies and their relationships such as structure of ownership or supply chains and their dynamics. This will allow leveraging external news sources as well as document collections such as the Panama papers in investigations and due diligence processes to automatically identify suspicious entities that companies interact with, even if this interaction is indirect.
At the outset of the project, we will construct a small corpus of relevant questions for document forensics tasks, together with hand-crafted gold standard answers as a benchmark for project success. The questions will be grouped in sets of increasing difficulty: answerable over a single document, answerable over multiple documents, answerable only with background knowledge.
If successful, this project will open up possibilities to guarantee fair trade practices and substantially reduce the effort to comply with regulations that aim at combating money laundering, financing terrorism, bribery, etc.
Building task-specific sentiment analysis mode an evaluation of the active learning approach
As automatic text analysis has become an established methodological field in the humanities and social sciences, one of the most sought after techniques is the automatic extraction of attitudes, emotions, judgments and opinions. Under the banner of sentiment analysis or opinion mining, these techniques have widely been used in scientific research as well as professional applications. Since sentiment can be defined and operationalized in multiple ways, and the expression of sentiment can differ greatly across domains, there is no single, universal sentiment analysis tool. Rather, dictionaries and models need to be tuned for specific use cases.
In this project we investigate the potential of a semi-supervised approach called active learning as a potentially fast and powerful way to train customized, task-specific sentiment analysis models. The essence of active learning is that a human annotator interactively trains a machine learning model. An algorithm provides the annotator with the most relevant texts for improving the model, which greatly reduces the amount of texts that require coding, thus enabling researchers themselves to supervise the training process.
Previous studies show promising results, but focus mostly on document-level sentiment scores, and often in short social media messages. In this project we investigate the application for journalistic texts, incorporating the holder and target of sentiment. We evaluate whether active learning enables us to train new models (RQ1) and retrain existing models (RQ2) for better performance on specific sentiment attribution tasks, using the Prodigy annotation tool1. Using two corpora (on terrorism and vaccinations, respectively), we develop two separate models for performing the same task in different domains. Additionally, two gold standard sets will be annotated independently from the active learning annotation process to detect possible bias caused by this particular approach. Based on these analyses, we discuss the potential applications of active learning for sentiment analysis.
Heattweet: Exploring the link between weather and aggression on social media
Recently, meteorological conditions (e.g., temperature) have been linked to expressed sentiment on social media (Baylis et al., 2018). In this project we focus on the influence of meteorological conditions on expressions of interpersonal and intergroup aggression in social media messages, and on a possible explanatory mechanism, i.e. strength of future orientation. Given the importance of social media in interpersonal and intergroup communication nowadays, expressions of aggression in social media messages may threaten societies’ interconnectedness and inclusiveness.
According to the model for CLimate, Aggression, and Self-control (CLASH; Van Lange, Rinderu, & Bushman, 2017), higher temperatures may increase aggression because they result in a weaker future orientation, which is linked to lower levels of self-control (e.g., Baumeister et al., 1994). However, some psychological experiments suggest that higher temperatures may actually inhibit aggression and promote prosocial behavior by enhancing relational mindsets (e.g., IJzerman & Semin, 2009) and affiliative motivation (e.g., Fay & Maner, 2012). To make things even more complicated, other resarch suggests a curvilinear relationship between temperature and aggression (Van de Vliert et al., 1999).
In the current project, we will explore the link between the daily temperature and other meteorological conditions in the Netherlands (data obtained from KNMI), and expressions of interpersonal and intergroup aggression extracted from social media data (provided by Coosto). Proxies for aggression include terms of abuse (i.e., swear words), and words specifically related to, e.g., racist discourse (e.g., Tulkens, 2016), hate speech, and cyberbullying (e.g., Del Vigna et al., 2017). In addition to existing word lists, dictionaries will be composed semi-automatically, using wordnet propagation, corpus comparison, and pattern extraction (Baccianella et al., 2010, Maks et al., 2014). Degree of future orientation will be assessed by detecting use of temporal references (e.g., tomorrow, next week; see Basic et al., 2018), and subsequently tested as explanatory mechanism.
The semantics of meaning: distributional approaches for studying philosophical text
Concepts such as schizophrenia, marriage or fact change through time. In philosophy, these changes are studied in a small amount of scientific or scholarly texts at a time through very precise, subtle, manual analyses (close reading). In computational linguistics, the changes in question are studied in massive, generic corpora such as the whole of Wikipedia, by computational methods largely based on so-called ‘word embeddings’, representations of word meaning in a semantic space using vectors based purely on their surrounding words. The current challenge in philosophy is to obtain fine-grained analyses at a bigger scale (Betti & van den Berg 2016). The current challenge in computational linguistics is to detect non-trivial shifts of meaning, while increasing reliability by a firm methodological grasp of the real factors influencing the results (Hellrich & Hahn 2016).
In this project philosophers and computational linguists conduct an interdisciplinary pilot study with the aim of combining the strengths of both fields. We will rely on a test case from a corpus comprising the writings of the American philosopher W. V. Quine. The corpus is small from a computational linguistics point of view, but rather big from a philosophical point of view. The philosophers will provide a dataset, a test case and an evaluation set centering around subtle shifts on a number of concepts (such as science,
fact , intuition). The computational linguists will apply an adaptation of word embeddings models for tiny data for this type of texts along the lines of Herbelot and Baroni 2017’s nonce2vec designed to learn embeddings from tiny data. The focus of the project will be methodological. The project will be considered successful if, next to a software release, an adequate evaluation method for this type of data and type of interdisciplinary projects will be developed at the end of the project.