OpenResearch: Collaborative Management of Scholarly Communication Metadata

Scholars often need to search for matching, high-profile scientific events to publish their research results. Information about topical focus and quality of events is not made sufficiently explicit in the existing communication channels where events are announced. Therefore, scholars have to spend a lot of time on reading and assessing calls for papers but might still not find the right event. Additionally, events might be overlooked because of the large number of events announced every day. We introduce OpenResearch, a crowd sourcing platform that supports researchers in collecting, organizing, sharing and disseminating information about scientific events in a structured way. It enables quality-related queries over a multidisciplinary collection of events according to a broad range of criteria such as acceptance rate, sustainability of event series, and reputation of people and organizations. Events are represented in different views using map extensions, calendar and time-line visualizations. We have systematically evaluated the timeliness, usability and performance of OpenResearch.

Research results are published as scientific papers in journals and events such as conferences, workshops etc. Each component of this communication needs to be open and easily accessible. Besides conducting their actual research, scholars often need to search for scientific events to submit their research results to, for projects relevant to their research, for potential project partners and related research schools, for funding possibilities that support their particular research agenda, or for available tools supporting their research methodology. For lack of better support, scholars rely a lot on individual experience, recommendations from colleagues and informal community wisdom, they do simple Web searches or subscribe to mailing lists and are stuck with simplistic rankings such as calls for papers (CfPs) sorted by deadline. Domain specific mailing lists are a medium often used by conference and workshop organizers for posting initial, second, final calls for papers, as well as deadline extensions. But this situation leads to discussions on whether to allow calls for papers on the lists or threat them as spam 4 It is especially hard for subscribers to filter those calls according to their individual interests, or maybe explicitly subscribe to important information, such as deadline extensions or subsequent calls, on a specific event or an event series.
On the other hand, the quality of scientific events is directly connected to the research impact and the rankings of the scientific papers published by them. For example, the Research Excellence Framework (REF) for assessing the quality of research in UK higher education institutions, classifies publications by the venues they are published in. This facilitates assessing every researcher's impact based on the number of publications in conferences and journals. Providing such information to researchers supports them with a broader range of options and a comprehensive list of criteria while they are searching for events to submit their research contributions. To provide comprehensive information about scientific venues, projects, results etc., we present OpenResearch.org. OpenResearch is a platform for automating and crowd-sourcing the collection and integration of semantically structured metadata about scholarly communication. In particular, with regard to events, OpenResearch . . . 1. reduces the effort for researchers to find 'suitable' events (according to different metrics) to present their research results, 2. supports event organizers in visibly promoting their event, 3. establishes a comprehensive ranking of events by quality, 4. provides a cross-domain service recommending suitable submission targets to authors, and 5. supports easy and flexible data exploration using Linked Data technology: a structured dataset of conferences facilitates selection regarding fields of interest or quality of events.
OpenResearch empowers researchers of any field to collect, organize, share and disseminate information about scientific events, projects, organizations, funding sources and available tools. It enables the community to define views as queries over the collected data; assuming sufficient data, such queries can enable rankings by relevance or quality. Driven by Semantic MediaWiki (SMW), OpenResearch provides a user interface for creating and editing semantically structured event profiles, tool and project descriptions, etc. in a collaborative wiki way. OpenResearch is part of a greater research and development agenda for enabling true open access to all types of scholarly communication metadata (beyond bibliographic ones) not just from a legal but also from a technical perspective. The work on OpenResearch is aligned with OpenAIRE, the Open Access Infrastructure for Research in Europe. The remainder of this paper is organized as follows: Section 2 states the problem that OpenResearch intends to address. Section 3 presents the state of the art of existing services addressing the same problem. Section 4 establishes requirements for a system that can address the problem in a comprehensive way. Section 5 explains the approach and architecture of the OpenResearch platform. Section 6 presents the services that OpenResearch provides to its end users today. Section 7 discusses how we have assessed the time-lines, usability and performance of OpenResearch. Section 8 concludes and outlines future work.

Problem Statement
Challenge 1: Communication. Research communities use different communication channels to distribute event announcements and CfPs. Announcing CfPs through different mailing lists is the traditional but still most popular way of disseminating information about an event. Exploring the calls for papers posted on mailing lists of the Semantic Web community shows that 500 to 700 event announcements have been posted every year between 2006 and 2016 (approx. 15-30% of the overall traffic). This shows that a large and widely spread amount of unstructured data about scientific events is increasingly being published via communication channels not specifically designed for this purpose. Due to the interdisciplinary nature of research, event organizers easily overlook relevant channels to announce their event. In addition, browsing through the CfPs in several channels to identify events that might be of interest is a time and effort consuming task.
Challenge 2: Structure. There are structural differences across events, for example, events with many co-located events or sub-events, or new events emerged from multiple smaller ones. One example for the latter is the Conference on Intelligent Computer Mathematics (CICM), which results from the convergence of four conferences that used to be separate but now are tracks of a single conference. 5 Scholars who want to find out whether an event matches their research interests therefore have to understand its structure; if they cannot find the desired information for the super-event, they will have to study the sub-events.
Challenge 3: Series. Most scientific events occur in series, whose individual editions took place in different locations with narrow topical changes. Researchers often need to explore several resources to obtain an overview of the previous editions of an event series to be able to estimate the quality of the next upcoming event in this series.
Challenge 4: Addressing Different Stakeholders. Event organizers aim to attract as many submitters as possible to their events. Publishers want to know whether they should accept a particular event's proceedings in their renowned proceedings series. Potential PC members want to decide whether it is worth spending time in the reviewing process of an event. Similarly, sponsors and invited speakers need to decide whether a certain event is worth sponsoring or attending. Researchers receiving CfP emails have to distinguish whether the event is appropriate for presenting their work. Researchers searching for events through various communication channels assess events based on criteria such as thematic relevance, feasibility of the deadline, close location, low registration fee etc. The organizers of smaller events who plan to organize their event as a sub-event of a bigger event have to decide whether this is the right venue to co-locate with. These examples prove the importance of filtering events by topic and quality from the point of view of different stakeholders. Currently, the space of information around scientific events is organized in a cumbersome way, thus preventing events' stakeholders from making informed decisions, and preventing a competition of events around quality, economy and efficiency.
Strategies. Event organizers employ a number of strategies to cope with the challenges of advertising their event and engaging with the potential audience. They use multiple channels (mailing lists, social networks, homepages) to distribute CfPs. Some organizers plan deadline extensions in advance, as a strategy to attract more submissions. Some communities employ databases on top of mailing lists for announcing scientific events e.g., researchers in information systems and databases use the DBWorld database (cf. Section 3). The strategies mentioned so far target authors of submissions, whereas event organizers also have to find sponsors, high-profile program committee members and keynote speakers. This is currently done by contacting researchers or companies that the organizers know already. An approach for a centralized and holistic infrastructure for managing the information about scientific events was missing so far.

Related Work
CfP classification and annotation: CFP Manager [4] is an information extraction tool specific to the domain of computer science; it extracts metadata of events from an unstructured text representation of CfPs. Because of the different representations and terminologies of CfPs across research communities, this approach requires domain specific implementations. The extracted data is limited to the keywords used in the content of CfPs. In addition, CFP Manager does not support data curation workflows involving multiple stakeholders. Hurtado Martin et al. proposed an approach based on user profiles, which takes a scholar's recent publication list and recommends related CfPs using content analysis [3]. Xia et al. presented a classification method to filter CfPs by social tagging [10]. Wang et al. proposed another approach to classify CfPs by implementing three different methods but focus on comparing the classification methods rather than services to improve scientific communication [9].
Websites: Google Scholar Metrics (GSM) 6 provides ranked lists of conferences and journals by scientific field based on a 5-year impact analysis over the Google Scholar citation data. 20 top-ranked conferences and journals are shown for each (sub-)field. The ranking is based on the two metrics h5-index 7 and h5-median 8 . GSM's ranking method only considers the number of citations, whereas we intend to offer a multi-disciplinary service with a flexible search mechanism based on several quality metrics. DBLP 9 , one of the most widely known bibliographic databases in computer science, provides information mainly about publications but also considers related entities such as authors, editors, conference proceedings and journals. Events, deadlines and subjects are out of DBLP's scope. DBLP allows event organizers to upload XML data with bibliographic data for ingestion. The dataset of DBLP is available as an RDF dump 10 DBWorld 11 collects data about upcoming events and other announcements in the field of databases and information systems. Each record comprises event title, deadline, event homepage and the full-text description. WikiCFP 12 is a popular service for publishing CfPs. Like DBWorld, WikiCFP only supports a limited set of structured event metadata (title, dates, deadlines), which results in limited search and exploration functionality. WikiCFP employs crawlers to track high-profile conferences. Although WikiCFP claims to be a semantic wiki, there is no collaborative authoring, versioning, minimal structure and the data is not downloadable as RDF or accessible via a SPARQL endpoint. Cfplist 13 works similar to WikiCFP but focuses on social science related subjects. Data is contributed by the community using an online form.SemanticScholar 14 offers a keyword-based search facility that shows metadata about publications, authors. It uses artificial intelligence methods in the back-end and retrieves results based on highly relevant hits with possibility of filtering.
Datasets: ScholarlyData 15 provides RDF dumps for scientific events. Conference-Ontology, a new data model developed for ScholarlyData, improves over already existing ontologies about scientific events such as the Semantic Web Dog Food (SWDF) ontology. Springer LOD 16 is a portal publishing conference metadata collected from the traditional publishing process of Springer as Linked Open  Data. All these conferences are related to Computer Science. The data is available through a SPARQL endpoint, which makes it possible to search or browse the data. A graph visualization of the results is also available. For each conference, there is information about its acronym, location and time, and a link to the conferences series. The aim of this service is to enrich Springer's own metadata and link them to related datasets in the LOD Cloud.
Other services: Conference.city 17 is a new service initialized in 2016 that lists upcoming conferences by location. For each conference, title, date, deadline, location and number of views of its conference.city page are shown. Based on the location of the conference, Google plug-ins are used to recommend flights, accommodation and restaurants. The service collects data mainly from event homepages and from mailing lists. In addition, it allows users to add a conference using a form. PapersInvited 18 focuses on collecting CfPs from event organizers and attracting potential participants who already have access to the ProQuest service 19 . ProQuest acts as a hub between repositories holding rich and diverse scholarly data.The collected data is not made available to the public.

Conclusion:
The comparison of currently available services in Table 1 shows that collaborative management of scholarly communication metadata in particular for events is not yet sufficiently supported.

Requirements
A collaborative and partially decentralized environment is required to enable community-based scientific data curation and extension, and to tap into the 'wisdom of the crowd' for elicitation and representation of metadata associated 17 http://conference.city/ to scholarly communication. In particular, such a system is aimed to address the following requirements as services, which we have derived from the challenges C1-C4 pointed out in the problem statement and from the review of related work (R):

Approach
The core of the OpenResearch approach is to balance manual/crowd-sourced contributions and automated methods. OpenResearch uses semantic descriptions of scientific events based on a comprehensive ontology; this enables distributed data collection by embedding markup in conference websites aligned with schema.org, and links to other portals and services. Semantic MediaWiki (SMW) serves as data curation interface employing semantic forms, templates various extensions and semantic annotations in the wiki markup. In the remainder, we describe the data model of OpenResearch and its architecture.

Data Model
The vocabulary used in OpenResearch reuses existing vocabularies from related domains, since reuse increases the value of semantic data.  Web Portal Ontology (SWPO) 21 , and the Funding, Research Administration and Projects Ontology (FRAPO) 22 , as well as schema.org. The SWC, SWPO and schema.org vocabularies provide means for modeling general events and SWC and SWPO also conferences. FRAPO provides terms to express scientific projects and their relations. Conference Linked Data (COLINDA) 23 contains information about scientific events collected from other systems such as WikiCFP and Event-Seer and published as Linked Data, and the CfP ontology 24 provides means for modeling calls. A specific ontology for CfPs has been proposed in [8].
The property alignment is implemented using the SMW mechanism for importing vocabularies 25 . This includes definitions of the reused vocabularies in special vocabulary pages e.g. for SWC 26 , which lists all imported properties and annotates them with SMW data types for the values. Wiki categories and properties are then aligned with the vocabulary terms using special imported from links. For instance Category:Conference is aligned to swc:ConferenceEvent with [[imported from::swc:ConferenceEvent]]. For modeling the calls and roles for a conference we defined new properties in our own vocabulary 27 . Fig. 1 28 provides an example for using the data model. In contrast to the existing data model for calls and roles in the SWC ontology we are following a flat structure, which allows users, e.g., to directly attach a deadline to an event rather than creating a new instance for a call in addition to the actual event. Figure 2 depicts the three layers of OpenResearch's architecture: Data gathering, Data processing and Data representation. Data processing This layer enables the storing and management of unstructured (text markup), semi-structured (annotations and infoboxes), structured data (RDF data adhering to an ontology) and schema data (the underlying ontology) Two database management systems are used in the OpenResearch architecture: one to store the schema-level information, the other to store the generated semantic triples. SMW supports multiple triple stores for storing the RDF graph, e.g., Blazegraph or Virtuoso. We use Blazegraph as it has been selected Wikimedia Foundation based on a performance and quality. 33 A MySQL relational database is used to store the templates, properties and, form names.

Data Gathering and Scrapers
Data exploring This layer comprises various means for human and machinereadable consumption of the data. Several types of data representation are made possible by data exploration. CfPs are represented as individual wiki pages for each event instance, including a semantic representation of their metadata. SMW provides a full-text search facility and supports semantic queries. Queries and the visualization of their results are detailed in Section 6. Furthermore, the RDF triple store can be accessed using a SPARQL endpoint or downloadable RDF dump.

OpenResearch Services
On top of the basic architectural layers, OpenResearch offers services for different stakeholders of scientific communication. As a semantic wiki, it offers initial LOD services and semantic representation of metadata about events. We address the issues discussed in section 2 by establishing a set of quality metrics for scientific events and implementing them as properties. We adopt the definition of quality as fitness for use, which, here, means the extent to which the specification of an event satisfies its stakeholders [5,6]. In the remainder of this section, the current services are explained in three categories: wiki pages, LOD services and queries.
Semantic Wiki Pages SMW powers OpenResearch to provide semantic representation of CfPs as one wiki page per event. In OpenResearch, specific semantic forms have been designed for each type of entities to make content creation and revision as easy as possible for users. Properties of each semantic object are populated via fields in these semantic forms. The following example shows the generated SMW wiki markup containing general information about an event. Further information about committee members, extensions and other important dates can also be provided in other parts of the form. The complete textual representation of the CfPs can also be added as content of the wiki page with embedded semantic annotations.

LOD Services All data created within OpenResearch is published as Linked
Open Data (LOD). In the sequel, we describe ways for accessing OpenResearch LOD. Afterwards, we outlines how the LOD approach enables building further services on top by sketching two possible ways of consuming the OpenResearch LOD: interlinking with relevant datasets, and using OpenResearch LOD as external plug-in for the Fidus Writer scientific authoring platform 34 .
Accessing OpenResearch LOD An updated version of the OpenResearch dataset is produced daily and available for download and query. 35 . The data is also queryable via a SPARQL endpoint 36 . In addition, the semantic representation of the metadata for each event is represented as an RDF feed in each page. The RDF feed for the EKAW 2016 resource is available at http://openresearch.org/ Special:ExportRDF/EKAW _ 2016. To expose dereferenceable resources conforming with Linked Data best practices, the URI resolver provides URIs with content negotiation; e.g., for the EKAW 2016 resource the URI is http://openresearch. org/Special:URIResolver/EKAW _ 2016.
Interlinking To increase the coherence of the data, we interlink the OpenResearch LOD with other relevant datasets. We are applying the same technical framework that we are using for OpenAIRE 37 Interlinking [1]. The following use cases enabled by interlinking show how the results of connecting the linked dataset of OpenResearch with other relevant datasets enhance the services: 1. PC members recommendation: one of the difficult and time-consuming tasks for event organizers is to collect a group of high-profile researchers as PC members. Interlinking OpenResearch LOD with datasets including author and person information such as ORCID 38 helps in this regard. 2. Sponsoring recommendation: it is often a challenge especially for smaller events to find local and international sponsors. On the other hand organizations and companies who want to gain visibility and decide whether or not to sponsor an event can use OpenResearch.
Integration with an Authoring Platforms In this section we introduce our approach to improve the workflow of authoring processing [7]. The OpenResearch LOD will be plugged into the Fidus Writer authoring platform to improve the workflow in the following use cases: 1. Venue recommendation: One of the critical aspects in the process of writing and publishing is to find a suitable event to submit the scientific results. The OpenResearch dataset contains data about events annotated with corresponding scientific field as :category and keywords. We also annotate keywords from the content of the under-production scholarly document in the OSCOSS project that could be imported to the OpenResearch search services.For example, Find all events in the computer science field that focus on data analysis, big data, knowledge engineering, linked data. The result of queries can be shown to the authors with a user-friendly interface and filtering metrics such as deadline and location distance. 2. Direct link to submission pages: The OpenResearch data contains a property named submission link that provides a direct link to paper submission pages of events. The submission page of the targeted event can be made accessible easily from the authoring platform. 3. Notification services: there are different deadlines attached to the events that should be considered by authors such as abstract deadline, submission deadline or registration deadline as well as deadline extensions. Enabling notification services in the authoring platform will support both organizers and researchers.

Queries and Visualization of Results
To support the creation of various views, recommendations and ranked lists (by quality indicators), queries can be defined and executed using all defined properties and classes and the results can be embedded in wiki pages. For example, events can be ranked by acceptance rate using the corresponding properties in queries:  Figure 3 shows a map view of the upcoming events using location-based filtering. Similarly, calendar and timeline views show upcoming submission and notification deadlines as well as the events themselves. In addition, taking, for example, participation figures into account enables new indicators for measuring the quality and relevance of research that are not just based on citation counts [2]. Based on semantically enriched indicators, predefined SPARQL queries as well as form-based search facilities will be implemented for recommendation services.

Evaluation
The main objective of this work is to introduce a comprehensive approach for collaborative management of scholarly communication metadata with a special focus on events. We are for now mainly interested in collecting data, as this allows to provide more interesting analysis services.Nevertheless, we evaluated three aspects of OpenResearch including two surveys, performance measurements of the system as well as a usability analysis. Timeliness Questionnaire: In a survey, we asked 40 researchers from different fields including Computer Science, Social Science to explain how they explore scientific events 39 . Over 75% of the participants agree that having an event recommendation service is very relevant for them. For selecting an event to participate, all participants confirmed that they consider information that is not served directly by the current communication channels. Some of these criteria are networking possibilities, review quality, high-profile organizers, keynote speakers and sponsors, low acceptance rate, having high quality co-located events, close location, citations counts for accepted papers of previous years. Participants indicated that they explore scientific events using: search engines, mailing lists, social media and personal contacts. Then, they assess the CfPs to find out whether that event satisfies their criteria. Over 85% of the participants supported the idea of using a knowledge base for this purpose. and usability of the system 40 . Overall 12 users participated in the survey; they have had several roles in scientific events (participant, PC member, event organizer and keynote speaker). 75% of the users replied they had basic knowledge about wikis in general, however, half of them did not know about SMW. 66% got familiarized easily with OpenResearch which shows its suitability for researchers of different fields. Again 66% answered that they needed less than 5 minutes to add a single event which is relatively low time wrt. the time organizers need to announce their event in several channels. The average number of single events created by individual users is 10. More than half of the participants needed less than 5 minutes for a bulk upload.The participants largely agreed that these times are reasonable.
Performance measurement: Currently, OpenResearch is running on a Debian server at the University of Bonn with 8 GB of RAM allocated. By private invitation (OpenResearch has not yet been publicly announced at a large scale), 70 users have been added during the last two months. Above 300 events have been added by the users during last two months and several bulk uploads of data are performed every week by the admins; each time 100 pages were created. The measured time for bulk import varies with the content of CfPs and reduces when events exist already in the system. The table below shows a performance measurements of OR w.r.t. the average time and memory usage for several bulk imports and complex queries running over the event query form.

Conclusion and Future Work
With regard to scholarly communication we are currently at a crossroad: On the one hand, there are commercial publishers and new incumbents such as social networks for researchers (e.g. ResearchGate, Academia.edu), which provide commercial services to the research community. Researchers either pay directly for these services by means of publication and access fees or indirectly (such as in the case of social networks) with their data. Either way, these commercial services strive to create a lock-in effect, which forces researchers to continue using these services without being able to migrate and choose competing services. In future, we envision to intensify data flows and service integration between OpenResearch and other open scholarly services. In particular, we are planning to import information from events' web pages, mailing lists and proceedings catalogs. Crawling event's web pages and extracting, e.g., embedded structured information such as schema.org RDFa or microdata, including the Event class and properties such as name, organizer, location, startDate, endDate, subEvent, or superEvent, keeps us up to date with the organizers. Extracting information from unstructured emails is challenging, but some emails have iCalendar attachments. Further information about events and their proceedings could be scraped from semi-structured listings such as the index page of the CEUR-WS.org open access workshop proceedings. Furthermore, we plan to relate events with other entities e.g., publications, projects, datasets.