Bridging the Gap between Information Science, Information Retrieval and Data Science
An interdisciplinary CHIIR 2021 workshop for students, practitioners and researchers in Data Science, Information Retrieval, Information Science and Human-Computer Interaction.
BIRDS offers a range of invited talks and accepted peer-reviewed papers
YouTube playlist of recorded talks
March 19, 2021
Like last year we ran a workshop called BIRDS - Bridging the Gap between Information Science, Information Retrieval and Data Science - which aims to foster the cross-fertilization of Information Science (IS), Information Retrieval (IR), Data Science (DS) and Human-Computer Interaction (HCI). BIRDS is an interdisciplinary workshop for students, practitioners and researchers in the aforementioned disciplines. Recognising the commonalities and differences between these communities, we brought together experts and researchers in IS, IR, DS and HCI to discuss how they can learn from each other to provide more user-driven data and information exploration and retrieval solutions. Therefore, we welcomed submissions conveying interdisciplinary ideas on how to utilise, for instance, IS concepts and theories in IR and/or DS approaches to support users in data and information access. BIRDS will be online and collocated with the 6th ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR 2021).
The overarching theme of the BIRDS workshop is to look at how Data Science, Information Retrieval, Information Science and Human-Computer Interaction can complement each other by applying a more holistic approach to these disciplines that go beyond traditional DS or IR or IS alone.
BIRDS aims at extending the scope of current research to provide a view on data and information in all its quantity and variety through investigating user preferences and interaction. The cross-fertilization of DS, IR and IS that we want to address in this workshop goes three ways.
To this aim, relevant topics of the workshop will be, but are not limited to:
IS models and theory applied to IR and DS
DS models and theory applied to IR and IS
IR models and theory applied to IS and DS
The target audience of the workshop are students, practitioners and researchers in DS, IR, IS and HCI, from academia and industry alike.
YouTube playlist of recorded talks and proceedings available.
Since the early 1990s, digital libraries have been devised to support particular communities (societies) engaged in varied activities (scenarios) with a focus on specialized types of content (structured streams of data, locatable and presentable using spaces – vector/probability/topological and 1D/2D/3D). Whether these efforts involve already known items, or finding new/additional items (often through searching/browsing/visualizing), it is typical for the digital libraries to support such discovery and exploration. The 5S (societies/scenarios/spaces/structures/streams) framework facilitates devising, populating, and using such digital libraries, with appropriate mixes of data and information, guided by various types of knowledge. We explain how key personas can work with future digital libraries, such as curators, data scientists, and those with information needs. We describe a new approach to devising such extensible information systems that integrate information retrieval, information science, and data science approaches and requirements, involving teams of UX, subject matter experts, data scientists, and DevOps personnel. We also discuss technology transfer methods, with customer discovery of the ecosystem of users (societies), to help ensure the relevance and utility of such digital libraries.
Knowledge workers such as information professionals, legal researchers and librarians need to create and execute search strategies that are comprehensive, transparent, and reproducible. The traditional solution is to use command-line query builders offered by proprietary database vendors. However, these are based on a paradigm that dates from the days when databases could be accessed only via text-based terminals and command-line syntax. In this talk, we explore alternative approaches based on a visual paradigm in which users express concepts as objects on an interactive canvas. This offers a more intuitive UX that eliminates error, makes the query semantics more transparent, and offers new ways to collaborate and share best practices.
Social Science Research can benefit from the massive amount of digitized content and the heterogeneous online sources available nowadays. An example is research activities in Science and Technology Studies, e.g., those investigating the presence and the perception of Science and Technology issues in the Media. Methodologies rooted in Data Science and Information Access can play a crucial role in supporting these research activities. In this context, users are specialists in Social Sciences, and their task is to investigate research hypotheses. This talk is about some of the challenges arising when supporting social scientists in their investigations on the media's discourse on technoscientific issues. The presentation will focus on a methodology and a system designed to support access and exploration of longitudinal corpora through diverse representations and diverse forms of user-system interaction.
It is not unusual to find that organisations need to search across 200 million+ files in multiple formats and in perhaps nine languages, and yet rarely is any training provided for what is almost always a business-critical search. There is also the need to support both known-item and exploratory search, and often groups of professional searchers within the organisation need to undertake high-recall searches. Very little research has been carried out into what is commonly described as ‘enterprise search’. This paper will examine the reasons behind this lack of research, summarise the emerging appreciation of how search is undertaken inside an organisation and suggest areas where it would be of value to the IR community and to the organisation to undertake research.
Medical data is generated and stored throughout a patient’s life. Exploiting such an amount of heterogeneous information can hardly be done by humans, but recent advances in artificial intelligence now allows to model patients trajectories, and use them to predict certain factors in the future. In this talk, I will present recent work on prediction from patient trajectories with two use cases : sleep apnea disorders and hospitalization of GPs patients.
Text Categorization (TC) is the act of assigning text documents to predefined categories. For instance, to distinguish between pro- and contra arguments for a specific topic. The automation of TC can either be done by using fixed rules or by machine learning. The difference between machine learning and programming is, that in machine learning, the machine creates its own program based on sample data. In the context of TC, these are example assignments of documents to categories called Target Function.
Machine Learning based Text Categorization (MLTC) can be used for many different applications. One such application is Argument Mining (AM), the finding of pro- and contra- arguments in large text corpora. Other examples include the assignment of news articles to specific categories, spam filtering, detection of offensive language in internet communications or the detection of user intent when interacting with a voice assistant like Amazon’s Alexa or Apple’s Siri.
MLTC is already widely applied. However, whenever a new application is developed that requires MLTC features, four fundamental problem fields arise. Firstly, the technical integration effort is high. This means that multiple prerequisites must be available, and programmers need to be familiar with details about the MLTC process. Secondly, the high effort required for the collection of examples for the MLTC process to learn from as well as providing manually crafted resources such as lists of relevant words for specific topics. Thirdly, according to the GDRP, MLTC systems operated in the EU that impact European citizens must be explainable. Generating explanations for the behavior of machine learning is no trivial task and an area of active research. A fourth problem field is semantic shift and the emergence of new knowledge. Previous resources and examples can become obsolete with future developments.
To overcome these problem fields, our previous work combined two research frameworks, the Design-oriented Information Systems Research methodology (DIRS) and the Research Framework for Information Systems Research (RFISR) to create insight into the problem fields and create artifacts that can overcome these problems. After assessing the state of the art in relevant areas of science and technology, a formal problem model was constructed. Capitalizing on recent trends in information technology, such as Big Data and Cloud Computing, a microservice oriented application to quickly provide explainable MLTC was designed and prototypically implemented as microservice oriented application. This prototype can even function without a target function by using word embeddings, and other recently emerged technologies. The created suite of microservices has already been evaluated in five different applications that apply MLTC. Even though the evaluation shows slightly inferior effectiveness to technologies that are fine tuned for their specific problems, the created system can be applied to these different problems in two different natural languages in a matter of minutes. Different to the existing most effective applications, the created system also generates explanations for its decisions. A qualitative evaluation and subsequent survey have already shown that the explanations are of a high quality and understood by a majority of survey participants. The developed prototype also possesses the ability to create new categories to organize documents when new knowledge emerges.
This capability of requiring little to no examples and other manually provided resources is well suited for scenarios in which said examples are hard to obtain. One such example is query by example information retrieval. This paper investigates how the developed MLTC microservices can be used for on-the-fly query construction that supports result sets including relevance feedback and facetted browsing. A unique characteristic of this information retrieval approach is its ability to generate explanations stating reasons why certain results are elements of the result set. Its applicability for facetted browsing also means, that the approach is well suited for information filtering applications.
We give an overview of the current state of research regarding the persistence and reproduction of meta-information and artifacts used during a Big Data analysis process. The collection of meta information and artifacts as well as the business modeling of the analysis process are performed via appropriate user interfaces based on the
Business Process Model and Notation and provided as packages via an automation interface. Based on requirements for Data Scientists from the industry, which were discussed and debated in an expert interview at the EGI Community Forum 2015. The requirements discussed were converted into various user stereotypes and serve as the basis for creating initial use case diagrams that also distinguish the technical perspectives of the respective user stereotypes with the help of personas. The use case diagrams are enriched by the individual phases and tasks of the Cross Industry Standard Process for Big Data Reference Model (CRISP4BigData) which was already presented at the Collaborative European Research Conference 2016 (CERC2016). Based on these use case diagrams and personas, concrete proposals for a software architecture will be derived in further work.
In the second talk we present current research work on the design and implementation of a graphical user interface based on the CRISP4BigData use cases with respect to technical requirements of the model-view-controller approach and the Symfony PHP framework developed. At the same time, the CRISP4BigData UI is to be integrated into the Knowledge Management Ecosystem Portal in order to extend the portal's functionalities and thus map added value. The Knowledge Management Ecosystem Portal (KM-EP) is based on the approach of Virtual Research Environments to solve problems around content and knowledge management application scenarios. The KM-EP already provides solutions for Centralized Digital Library and Media Archive, Unified Access Rights, Detailed Metadata, Faceted Search with Categorization, Open Contribution. The CRISP4BigData UI is intended to take up this idea and provide an automated and guided documentation process for Big Data analyses and their (process) meta information and artifacts. The collection of meta information and artifacts as well as the business modeling of the analysis process are performed via appropriate user interfaces based on the
Business Process Model and Notation and provided as packages via an automation interface.
BIRDS will take place in two sessions on March 19, 2021:
Please note there might be minor changes to the schedule.
Programme Chairs (Workshop Organisers)
Programme Committee