- Project coordinator of NewsReader: a “Recorder of History”, which is a computer program that “reads” daily streams of news and stores exactly what happened, where and when in the world and who has been involved. The program uses the same strategy as humans by building up a story and to merge it with information stored previously. The software does not store separate events but a chain of events according to a story-line. Like humans, the program thus removes duplicate information and complements incomplete information in the news while reading. In the end, it maintains a single story-line for the events. Unlike humans, the recorder will not forget any detail, will be able to recall the complete and true story as it was told, know who told what part of the story and what sources contradicted each other. The history recorder can be seen as a new way of indexing and retrieving information that helps decision makers to handle billions of news items in archives and millions of incoming news items every day. Current solutions simply result in long lists of potentially relevant items due to the abundance of information. It is up to the user to sift through these results: removing duplication, putting pieces together and separating correct from incorrect information. Likewise, it is often impossible to make truly well-informed decisions. The history recorder is however able to structure these results according to story lines, where it presents the information as a single and complete history. In addition to organizing news as stories, the recorder also has the capacity to abstract from individual stories and to find trends and patterns. It can for example provide a quantified overview of types of companies that are involved in take-overs, in specific periods or regions and correlate that with changes in management and profits. Since it keeps track of all the original sources of the information, the recorder can also provide insights in how the story was told. This will tell us about the different perspectives of sources on our news of today and of the past.
- Project coordinator of DF in which we will apply methods to extract the relevant concepts (e.g., the name of suppliers, or the type of relationship between companies, executive management) from unstructured (e.g., news) as well as semi-structured (e.g., contracts and financial) documents to populate knowledge graphs and link them to publicly available knowledge graphs.
- VU University Research Fellowship (URF) is a programme developed for a select number of internationally renowned scientists at VU University Amsterdam. It is a token of appreciation and a public tribute to the university’s most excellent scientists for their extraordinary research performances. These scientists will be entitled to reward the best student of their choice with a University Research Fellow which will carry their own name.
- Project coordinator of this project whichs develops a model that provides a representation of things in the (real or assumed) world and allows us to indicate the perspective of different sources on them. In other words, we aim to provide a framework that can represent what is said about a topic, a person or an event and how this is said in and by various sources, making it possible to place alternative perspectives next to each other. We develop software to detect these perspectives in texts and represent the output according to our formal model which is called GRaSP (Grounded Representation and Source Perspective). GRaSP is an overarching model that provides the means to: (1) represent instances (e.g. events, entities) and propositions in the (real or assumed) world, (2) to relate them to mentions in text using the Grounded Annotation Framework, and (3) to characterize the relation between mentions of sources and targets by means of perspective-related annotations such as attribution, factuality and sentiment.
- Member of the Kernteam and Technical officer within WP3 of CLARIAH responsible for the theme Interoperability.
- Member of the research team of this project which main goal is to study and create adequate linguistic models with deep mathematical methods over the semantic knowledge in lexicons and texts. As a key element of the high-quality semantic management we consider the Word Sense Disambiguation which will be used as an experimental approach for testing the designed models.
- Project coordinator of this project where the goal is to design the optimal architecture for processing as many daily news items as fast as possible using the deepest semantic processing that is currently available in Natural Language Processing (NLP), so-called deep-reading. The complex and diverse technology developed in the European NewsReader project will be optimized given the infrastructure provided by EYR, exploiting the optimal capacity in a jungle-architecture. The project will result in parallelized NLP pipelines, involving a large variety of software, it will provide knowledge on what it takes to process the daily batch of news that comes in every working day (estimated on 2 million items), but it also result in knowledge on how rich, complex and dynamic information streams of news really are.
- Prize: A 2-years access to data storage, computing facilities, and visualisation infrastructure provided by SURFsara, for advanced network connections provided by SURFnet, and for support in the mapping of research solutions onto these e-infrastructure services by the Netherlands eScience Center (NLeSC) plus a cash prize of EUR 20,000 (more info).
- Project iniator/manager of this multidisciplinary project that combines expertise from history, computer science and computational linguistics. The aim of BiographyNet is to develop a demonstrator which supports the discovery of interrelations between people, events, places and time periods in biographical descriptions.
Project coordinator of Cornetto-LMF-RDF, which is a combined curation and demonstrator project in which the Dutch Cornetto database is converted to LMF and RDF and made available on a CLARIN Centre for efficient querying. As a semantic resource in which words and concepts are interlinked within the data and to other databases (e.g. wordnets in other languages and ontologies) this project will address many issues on the representation of meaning and user-queries to these data, such as the complex data structure (semantic and structural) and semantic linkage, such as hypernym chains of concepts or semantic typing of words. The project will combine a new release of Cornetto (version 2) with the data from DutchSemCor (a semantic annotation of text corpora) and a Dutch sentiment lexicon. The results are presented in LMF and the wordnet part also in RDF and SKOS. This bridges the standardization and metadata requirements of ISO and W3C.
- Member of the Kernteam and Technical officer within WP3 of CLARIAH responsible for the theme Interoperability.
- Member of the project team for research on a new methodology for programming many-core accelerators.
- Project coordinator of Dasym-valorisation of the NewsReader project.
- Project coordinator of this project which aims to develop a tool that visualizes subjectivity, perspective and uncertainty to make them controllable variables in Humanities research. The tool should allow users to compare information from different sources representing alternative perspectives and visualize subjectivity and uncertainty. Such a visualization enables improved and comprehensive source criticism, provides new directions of research and strengthens the methodology of digital humanities.
- Project coordinator of Modelling Perspectives where a MA student in philosophy and a MA student in computational linguistics take the first steps toward developing a sound method to extract and interpret information about perspectives as expressed in philosophical texts in a computational way.
- Projectcoordinator of this project which will establish a public-private partnership (PPP) encompassing different fields of knowledge and expertise to develop understandable and useful ways to visualize “big data news media” story lines of illicit trades in humans, wildlife, and drugs. The primary goal is to show the added value of this collaboration by defining a user context for specific forms of visualizing big data patterns in news media. These patterns serve as an investigative tool for purposes of crime monitoring, crime fighting, and policy making. According to Yury Fedotov, Executive Director of the United Nations Office on Drugs and Crime (UNODC) human trafficking, wildlife crime and drug dealing are among the most profitable trafficking flows in the world. These forms of serious transnational organized crime require new forms of intelligence. In recent years, there is growing evidence of overlap between these forms of illicit trafficking. For law enforcement organizations and NGO’s there is a pressing need to learn more about these developments and the organization of international trafficking networks. Much of the data used to estimate the value of the illegal markets is based on seizures or media reports. However, the data analyses are often ad-hoc and anecdotal in nature without taking into account systematic evidence based on state of the art methods for performing big data analyses. The challenge of this proof of concept consortium building project is to develop efficient workflows for tailoring the big data analysis visualizations to the specific needs of users (NGO’s, police and other stakeholders).
- Project coordinator of MTN, which will make a first investigation to detect belief system dynamics in online trust networks on medical topics. It aims to study how beliefs converge, collide, and are countered, and how (dis)trust develops within and between trust networks over time. When it comes to health, the online debate can be very intense, involving a range of actors, from government and science institutions to citizens voicing opinions in (organized) patient forums, blogs, and tweets. Actors may refer to highly discrepant information sources, and opinions show strong dynamics. In the Netherlands, online debates recently focused on the human papilloma virus (HPV) and the Swine flu virus vaccination programs. Both led to strong negative opinions about the programs and against vaccinations in general, despite best efforts of the RIVM to provide the public with neutral and evidence-based counterarguments. As a result, the HPV vaccination campaign was only partially effective (56% of the target group in 2012, 58.1% in 2013 and 60% in 2014). What became clear from these debates is that within different communities, different belief systems exist. Interconnected social networks with shared belief systems can be seen as trust networks. Because protecting belief systems is key to the viability of such trust networks, members apply diverse defensive strategies when belief systems are challenged, such as opinion disconfirmation and discounting, or actor exclusion, derogation, and ostracizing. As a result, outside influence on opinions in trust networks is very limited – within-group processes yield convergence of opinions and narrow latitudes of acceptance. The project will use automatic techniques to analyse the dynamics of social communities as well as the content of their beliefs. Two academy assistants from social science and computational linguistics will work for 10 months in an inter-disciplinary set-up on this issue.
- Project partner of OpeNER. Currently there is plethora of companies offering online Sentiment Analysis (SA) services; the majority are generic and monolingual. SA is a complex field at the edge of current state of the art in NLP. Two key elements are Lexical Resources and Named Entity Recognition and Classification (NERC). Both elements allow for the measure of “what” about “whom”. Building tools and resources for Opinion Mining (OM) is an expensive.These basic technologies for OM that are a fundamental to market qualification for enterprises offering OM Services are costly to develop. OpeNER proposes the reuse and repurpose of existing lexical resources, Linked Data and the broader Social Internet. OpeNER will focus on ES, NL, FR, IT, DE and EN, and create a generic multilingual graduated sentiment data pool reusing existing language resources (WordNets, Wikipedia) and automatic techniques. The Sentiment Lexicon will supplement popular or proprietary Lexicons. The Lexicon will be expressed in a new mark-up format. Multilingualism and cultural skew in OM increases complexity. The sentiment values will be culturally normalised to allow a “like-for-like” comparison. Tools for extension to other languages and domains will be provided. Fine-grained NERC which is critical in SA will be addressed. NERC will be done by “Wikification” and Linked Data. Extensions to the generic OM system will be created for validation in the Tourism domain with partner SMEs and an End User Advisory Board. OpeNER will also create an online development portal and community to host data, libraries, APIs and services. Task focused on implementing models to ensure the long-term self-sustainability and options for Open Licensing are included. It will provide base qualifying technologies and a means for continued development and extension to other languages and domains, freeing SMEs to concentrate on their efforts providing innovative solutions to meet market needs rather than expensive development of core technologies.
Associate partner of SIERA wich aims to reinforce closer and sustainable scientific cooperation between Palestinian and EU scientists in the field of multilingual and multicultural knowledge sharing technologies. This objective is attained through integrating BZU Sina Institute, which is the largest ICT research centre in Palestine and among a few in the Arab world in this field, into the European Research Area. Two EU multilingual knowledge sharing portals (which were developed in previous FP7 and eTen projects) have been selected as a concrete testbed for establishing scientific collaboration and integration. The first, MICHAEL, is a cultural heritage portal which provides a multilingual service to explore digital collections from museums, archives, libraries and other cultural institutions from across Europe. The second, KYOTO with Vossen as project coordinator), is a wiki-portal about environment and ecology. The key idea is to use them to investigate how to enable large-scale knowledge sharing portals with Arabic language and content. Both portals already support multilingual knowledge sharing. Extending such portals to support Arabic content and semantic search is a challenging task due to the complexity of the language and as the Arabic content needs to be semantically interlinked with EU content. MICHAEL and KYOTO portals were selected carefully not only because their application domains are important areas of interest for EU and Arab societies and markets, but because extending them with Arabic is a good case, from a scientific viewpoint to set up a joint research and cooperation, exchange knowledge, and tune in-house methodologies and tools concretely.
- Project coordinator. Cinema is seen as a complex medium, in which production is directed by a multifaceted authorial agency. In this project, a portfolio-management environment will be developed, in which film producers can report on the non-linear process of film making by describing steps in the process, intermediate products, networks, timelines and output. This allows both producers as researchers to understand and improve the production processes. Goal is a prototype of a mobile application and a tool whereby the creative process can be recorded.
- Project partner of a project which tries to develop methods to analyse/visualize potential meaningful relationships between artists and intellectuals by combining biographical data with relevant contextual information. A pilot study will bring together three complementary, but heterogeneous (meta)datasets: Biographical Reference Works(HI-KNAW), Ecartico (UvA) and Hadrianus (KNIR) and will explore potential relationships in biographical data and cultural networks in the creative industry in Amsterdam and Rome in the Early Modern Period.
- Project coordinator where we develop methods for tracking the spreading of metaphors and key-phrases in different media, such as blogs, newspapers, policy-documents and scientific articles over time.
Project coordinator of KYOTO. The goal of KYOTO was to develop a content enabling system that provides deep semantic search and information access to large quantities of distributed multimedia data for both experts and the general public, covering a broad range of data from wide-spread sources in a number of culturally diverse languages. In this project we targeted the languages: English, Dutch, Italian, Spanish, Basque, Chinese and Japanese. This powerful system crucially rests on an ontology linked to wordnets –lexical semantic databases–in a variety of languages. Concept extraction and data mining were applied through a chain of semantic processors that re-used the knowledge for different languages and for particular domains. The shared ontology guaranteed a uniform interpretation for diverse types of information from different sources and languages. The system can be maintained by field specialists using a Wiki platform. KYOTO is a generic system offering knowledge transition for any domain of knowledge and information, across different target groups in society and across linguistic, cultural and geographic borders. KYOTO will be applied to the environmental domain and span global information across European and non-European languages. For more information see:
The research project Text2Politics combines contemporary theories and methods in linguistics and political science to develop an automated research tool for rich text-mining. The transdisciplinary relevance of the project is that a carefully constructed mining tool for language-meaning research can be applied to enhance the Kieskompas (Electoral Compass) and prove useful in the social sciences in general. The research will give new insights into the complexity of language use, the linguistic modeling of subjectivity and the representation of this knowledge in a lexicon. It will also shed new light on the complex dimensionality of competition between political parties. The work is carried out by three AIOs that are situated at the Faculty of Social Sciences Sciences and the Faculty of Arts and is funded by the Interfaculty research institute CAMeRA.
The research project Semantics of History develops a historical ontology and a lexicon that are used in a new type of information system that can handle the time-based dynamics and varying perspectives in historical archives. The system will integrate new insights in the ontological and linguistic analysis of the data that will follow from empirical and fundamental research. The work is carried out by two AIOs that are situated at the Faculty of Exact Science and the Faculty of Arts and is funded by the Interfaculty research institute CAMeRA. History is typically a record of different realities in time and specifically focuses on the changes in reality. Even stronger, the perception of history can be different for different participants and for different cultural and linguistic groups. Finally, the reflection on the past can be different based on our different views: history has been and will be re-written many times. Information systems of historical archives should handle the dynamicity in time and represent all realities at an equal level while at the same time they should define the relations, the invariables and changes across the realities. The units of change are events and typically in history events can be organized at different levels of change. The most constant elements are locations, people and dates but nevertheless many different structures are still possible, which need to be related relative to these more constant elements. Such a system should also allow users to classify and structure reality from any possible perspective when accessing the archives. Vast amounts of historical data are available as free text. The text itself can be related in time just as the events. For direct reporting and communication in the same time-frame there will be little distance between the communication date and the event date. Historical documents on the other hand have a large distance between reporting and event date. We also expect that the linguistic expression for naming these events will be different; exhibiting high abstraction and others types of perspectives in historical reports as compared to actual news reports. A historical information system requires an innovative view on the semantics of events and the ways we can conceptualize these through language in different genres of documents.
The core of the project is to support the user in querying the data of cultural heritage. In layman terms: to help the user find cultural heritage objects based on cultural heritage data attributes (e.g. style, material, chronology dating). The result will be a working software prototype that supports the user in finding objects and can easily be extended to use different techniques. This project will be carried out by the Rijksdienst voor het voor het Cultureel erfgoed, the business engineering company Everest and the VU University Amsterdam.
- Project coordinator of DutchSemCor: which aims to deliver a one-million word Dutch corpus that is fully sense-tagged with senses and domain tags from the Cornetto database (STEVIN project STE05039). 250K words of this corpus will be manually tagged. The remainder will be automatically tagged using three different word-sense-disambiguation systems (WSD), and will be validated by human annotators. The corpus data will be based on existing corpus material collected in the projects CGN, D-CoI and SoNaR. These corpora have already been automatically annotated with morpho-syntactic tags and structures. The corpora will be extended where necessary to find sufficient examples for meanings of words that are less frequent and do not appear in the above corpora. The resulting corpus, for which we aim to offer the same balance in types of text as these basic resources, will be extremely rich in terms of lexical semantic information. Its availability will enable many new lines of research and technology developments for the Dutch language. In particular, it will enable research into the relation between language form and language interpretation, and as such it will be applicable in the fields of cognitive science, (psycho-)linguistics, language learning and language teaching, semantic web applications, information retrieval, machine translation, text mining, and document interpretation (summarization, topic segmentation). We foresee that the corpus will create new directions of research and technology development on a par with current developments for English. (news VUA).
- Project coordinator of the EuroWordNet I and II project and site-manager for the Dutch wordnet. The aim of the EuroWordNet project was to develop a multilingual database with wordnets for Dutch, Italian, Spanish, French, German, Czech, Estonian and English. Each wordnet was also linked to a so-called Inter-Lingual-Index, based on WordNet1.5. The database has been tested in an Information Retrieval application by Novell Linguistic Development. For more information see:
- Powerpoint presentation by Vossen;
- EuroWordNet General Document;
- Vossen (ed.) 1998: EuroWordNet: a multilingual database with lexical semantic networks, Kluwer Academic Publishers;
- Vossen 2001: Condensed Meaning in EuroWordNet, in: F. Busa, P. Bouillon (eds): The Language of Word Meaning: studies in Natural Language Processing, MIT Cambridge University Press, pp. 363-384.
- Projectpartner of Flarenet which addressed the development and exploitation of Language Resources (LRs) and Language Technology (LT) that will enable easy and natural access to the wealth of information and knowledge encapsulated in (written, spoken, multimodal) digital documents. FLaReNet expectation is to achieve:
- the largest network of LR and HLT players, ready to discuss information and knowledge on LRs and LT, and with appropriate mechanisms to share such information and disseminate it widely
- an extended picture of LRs and a recasting of its definition in the light of recent scientific, methodological, technological, social developments
- a consolidation of methods and approaches, common practices, frameworks and architectures
- a roadmap identifying areas where consensus has been achieved or is emerging vs. areas where additional discussion and testing is required, together with an indication of priorities
- a set of recommendations in the form of a plan of coherent actions for the EU and national organizations
- a European model for the LRs of the next years.
Bringing together the diverse approaches, efforts and technologies represented by its membership, FLaReNet will enable even greater progress toward community consensus.
- Project coordinator of the extension of the Dutch wordnet with combinatoric and referential relations based on usage of words within domains. Cornetto was an initiative of Vossen and the Free University of Amsterdam to combine the Dutch wordnet and the Referentie Bestand Nederlands (a Dutch database with combinatoric information of Dutch word meanings) in a unique resource for Dutch.Cornetto covers 40K entries, including the most generic and central part of the language. The database goes beyond the structure and content of Wordnet and FrameNet.The Cornetto database is available for download: free for non-commercial use and euro 15.000,= for commercial use. A demo is also available.
Project partner of Depression. For the effective treatment of depression more and more therapies are becoming available that exploit computer technology, for example in the form of therapist guided self-help modules on the Internet or more advanced support systems such as being developed in the FP7 ICT4Depression project coordinated by the Computer Science Department of the VU University Amsterdam. In the latter project, a first attempt is made to incorporate predictive models for people suffering from a depression, thereby allowing the system to predict what the course of a depression will be for a particular patient and how effective a certain therapy will be. In these attempts however, the wealth of data about depressed patients which is nowadays present has not been considered fully yet. This wealth includes data about the development of the mental state of patients over time, their involvement in the therapy such as their adherence and home work assignments they performed and their monitoring of mood and activities and keeping up with free text diaries. Utilizing such data opens new ways to improve computational models for depression. In this project, we propose to: (1) interpret free text in a large dataset from the domain of depression using sentiment analysis techniques; hereby, the actual self-reported mood ratings that are also part of the data set can be used as validation; (2) validate an existing predictive computational model for depression using the (interpreted) dataset, and (3) try to generate enhancements to the computational model by applying learning techniques upon the dataset (more in specific, Genetic Programming).
Project Partner of SPREAD. Research into cross-domain communication and the re-organization of linguistic meanings in complex interaction between different participating parties and public discourses is an emerging research field, triggered by the increase in the amount of online databases archiving large numbers of communication documents. New methods are needed for the analysis of the spreading and resonance of metaphors and key-phrases across different discourse domains in online settings. The NI/KNAW Academy assistant project will 1) develop tools for tracking the spreading of metaphors and key-phrases in different media, such as blogs, newspapers, policy-documents and scientific articles over time and 2) chart the resonance of meanings attached to such metaphors and key-phrases when they are taken in use in different context of discourse domains. The resulting tool would be useful for social scientists and linguists in their efforts to trace the evolving dynamics of communication networks and conventionalized figurative meaning-creation. The project will focus on climate change communication, embedded in the context of the ongoing NWO-ORA project ìClimate change as a complex social issueî that would facilitate the NI/KNAW assistants with suitable data sets and established qualitative knowledge about climate change stakeholders that will enable tool development in the 1-year time limit of the project. The new semi-automated text analysis tool for tracking the dynamics of metaphors and key-phrases across different discourses will be developed in collaboration with computational linguistics (technical tool development), metaphor research (LET) and organization sciences (FSW.
- Project coordinator of the Pilotgrant by NWO Geesteswetenschappen. The pilotgrant was used to prepare an NWO middel-groot investeringsproject submitted in September 2008. The goal of that final project is to deliver a corpus that is fully sense-tagged with senses, ontology tags, and domain tags from the Cornetto database. This corpus will play a key role in language technology research for Dutch and also in linguistic and cognitive research that relates linguistic form to meaning. Combining the best of both worlds, the corpus will be tagged using a combination of automatic techniques and manual editing. Automatic tagging techniques include on the one hand supervised methods, which can be trained on already tagged subcorpora as training data, enabling them to tag other subcorpora, and on the other hand unsupervised techniques that rely on other sources such as the Cornetto database itself. It is to be expected that the manual editing of the corpus will feed back in the form of adaptations to the semantic database Cornetto.
- Constructing an Arabic Wordnet (AWN) in Parallel with an Ontology. Project partner of this project which aimed at the development of wordnets for the Arabic languages and was sponsored by the American government and headed by Princeton University. Vossen was responsible for the European development and coordination of building a lexical resource in Standard Arabic. AWN was constructed according to the methods developed for EuroWordNet (EWN;Vossen 1998) and since applied to dozens of languages around the world. The EuroWordNet approach maximizes compatibility across wordnets and focuses on manual encoding of the most complicated and important concepts. Arabic WordNet is mappable straightforwardly onto Princeton WordNet 2.0 and EuroWordNet, enabling translation on the lexical level to English and dozens of other languages. Several tools specific to this task will be developed. AWN will be a linguistic resource with a deep formal semantic foundation. Besides the standard wordnet representation of senses, word meanings are defined with a machine understandable semantics in first order logic. The basis for this semantics is the Suggested Upper Merged Ontology (SUMO) and its associated domain ontologies.
- Project partner of MEANING which was concerned with automatically collecting and analysing language data from the WWW on a large scale, and building more comprehensive multilingual lexical knowledge bases to support improved word sense disambiguation (WSD). MEANING used state of the art NLP techniques pioneered by the consortium to enhance EuroWordNet with mainly language-independent lexico-semantic (concept) information. We used a combination of Machine Learning and Knowledge-Based techniques in order to enrich the structure of the wordnets in different domains (subsets of the web) in five European languages: English, Italian, Spanish, Catalan and Basque. The core technology used by MEANING included tools to perform language identification, morphological analysis, part-of-speech tagging, named-entity recognition and classification, sentence boundary detection, shallow parsing and text categorization.
- Project partner of Euroterm. The main objective of the proposed preparatory action was to offer public sector information in different languages beyond the mere homogeneous linguistic community it originates from. The extension of the EuroWordNet and the Inter-Lingual-Index records with public sector terminology aims at providing the infrastructure for facilitating access of European citizens and enterprises to public sector information. The action’s actual purpose was to combine effectively multilingual domain specific information into a common lexical database through a Terminology Alignment System, lowering this way the barriers across European languages. Another main goal of the preparatory action was to explore how such a multilingual lexical database could facilitate access to environmental information and contribute to the development and use of the European digital content. On a longer term we expect that the domain specific terminology incorporated into the EuroWordNet multilingual database will open up a whole new range of services in Europe at a trans-national level.
- Project partner of Balkanet which aimed at the development of a multilingual lexical database comprising of individual WordNets for the Balkan languages. The most ambitious feature of the BalkaNet was its attempt to represent semantic relations between words in each Balkan language and link them together in order to develop an on line multilingual semantic network. The main objective was the development of each’s languages WordNet from available resources covering the general vocabulary of each language. Semantic relations will be classified in the independent WordNets according to a shared ontology. Then, all individual WordNets were organized into a common database providing linking across them. Each of the WordNets were structured along the same lines as the EuroWordNet through a WordNet Management System. This project was an excellent opportunity to explore the less studied Balkan languages and combine and compare them cross-linguistically.
- Financially supported by the Dutch government within the framework of the CIC-programme (Compete with ICT-competences) of the Ministries of Economic Affairs and of Educational and Cultural Affairs. In the Netherlands a consortium of five organisations started the project with the aim to improve the cross-lingual communication on the Internet. The name PidGin has been derived from the pidgin languages that were developed between people with different language backgrounds during colonial times. The Pidgin-project is an unique project which combines the sciences of IT and linguistics. By using special techniques of both sciences, Pidgin enabled internet users all over the world: 1. to retrieve multi-lingual information from the internet (human-machine); 2. to communicate with their fellow users in their own native languages (human-human). Pidgin combined advanced search techniques with translation strategies. To improve the searching, the computer had to be able to “understand” language by way of syntactic and semantic analysis. The further development of these techniques can be stimulated by translation techniques that go beyond the currently available translation techniques that are based on machine translation or translation memory. Pidgin is self learning and makes use of an enormous semantic network. The idea behind Pidgin was not to deliver perfect translations, but to improve cross-lingual communication between man and machine and between people. It is like real life communication between two persons with a different mother tongue: by talking and writing to each other, each one of them will learn more about the language of the other person. PidGin makes use of the same principle. Every time Pidgin is used it learns from the user and improves its own knowledge. The project has been executed in two phases: at the end of the year 2002 the project yielded the first results of the cross lingual communication between man and machine. The perfection of the cross-lingual communication between between people was planned for the year 2004.
- Project-manager of the SIFT-project (LRE 62030) for Amsterdam and in charge of a team of 4 researchers and programmers. The aim of the Sift project was to develop a text retrieval system that makes use of distributive semantic representations of words. Distributive representations consist of simple semantic features with weights indicating the relevance of these features for the concept associated with a word. Using these representations equivalence can be measured in a flexible and computationally tractable way. The main result for Amsterdam has been the development of the first version of the Amsterdam Lexicon System (ALS). This is a very efficient and fast object-oriented lexical database system, developed in C and running on Unix, Windows and Macs.
- Project partner of the sequel to Acquilex-i in which also other resources such as corpora are considered to build up lexicons, and in which we cooperated with the publishers Van Dale in the Netherlands, Cambridge University Press in the UK and Bibliograf in Spain. For more information see: Acquilex 1 & II
- Senior researcher on the Acquilex-1 project (Esprit BRA-3030), a joint enterprise of the Universities of Cambridge, Dublin, Pisa, Barcelona and Amsterdam in which we examined the feasibility of building a multilingual knowledge base usable in Natural Language Processing on the basis of information extracted from several Machine Readable Dictionaries. His work focused on LDOCE and a Van Dale monolingual Dutch dictionary, which he linked using Van Dale bilingual Dutch-English and English-Dutch dictionaries. He developed definition parsers for Dutch definitions and made programs for automatically converting the analyzed information into formal typed feature structure representations, which can be loaded into a lexical knowledge base. Finally, he developed some techniques and tools to (cross-linguistically) compare the semantic organization of different dictionaries.For more information see: Acquilex 1 & II.
- Researcher in the Links-project in which he developed a parser for the definitions of nouns, verbs and adjectives of the Longman Dictionary of Contemporary English and stored the output (65,000 parse-trees) in the Nijmegen Linguistic Database. The Links-project was funded by the Dutch Council of Research (NWO).