An architectural proposal to explore the data of a private community through visual analytic

In this document, a proposal is made to study the data that will be generated in the private and anonymous community of the WYRED project, in order to extract knowledge about how their users interact, both between them, and with the platform. To do this, it is started with the creation of a system that will generate a set of test data, as close as possible to the original. With this information and considering the impact of privacy when dealing with the data of the project, a flexible and complete architecture has been proposed for the development of interactive visualizations that will allow to visualize the previously generated data. Finally, a use case is presented where the suitability of the visual analytic is demonstrated to perform analysis of the data of the project and to extract knowledge, in a simple way.


INTRODUCTION
Today, social networks are one of the types of communities that are experiencing higher growth, thanks to the wide diffusion of the technologies of information and communication [1]. However, they continue presenting some problems, like the management of the privacy or the analysis of the data, to increase the knowledge of what is happening within them. In addition, experts find that, due to the volume of information that they generate, it is not currently possible to realize manual analysis of what occurs in them.
The WYRED project consists in the development of a technological ecosystem [2], in order to know in greater depth the interests and problems of young people, the way they have to face them and, ultimately, to be a place where their voice is heard and taken into account [3,4].
A technological ecosystem is a set of technological elements that allow to cover all the needs of a project, for it is necessary the management of the users and the generated information, the support for the diffusion of these data, the integration with other technological ecosystems and the ability of each of these aspects to evolve to fit the project changes [5][6][7]. In the case of the WYRED project, this ecosystem consists of four distinct parts: a service that is responsible for anonymizing users, a private platform where dialogues with young people take place, a system for dissemination in social networks and a public web to know the project.
The architectural proposal that is going to be presented in this paper is centered in the community of WYRED, that is similar to a forum, where the users organize the discussions in communities, threads and comments, but it can also host social dialogues and research projects. However, the project has a number of characteristics that distinguish it from the others [8], such as its use in an international context (several languages, different sociological characteristics and very different points of view) or the need to safeguard users' privacy, firstly because they may be minor and secondly, because it is sought to make the platform a place where they can interact freely, for which a high degree of anonymity is required. Due to the large number of data that the project is going to generate, the use of visual analytics is proposed as an effective technique for representing and extracting knowledge [9].
The main objective of this work is to propose, in these early stages of the WYRED project, an architectural proposal of a system that allows to support the development of interactive visualizations that help to better understand the data, to anticipate the future needs of the project.
This architecture has to be flexible enough to be able to adapt to the diverse characteristics of the project, allowing also to build on it any type of visualization that is required, at this moment or in the future. It must help researchers in two main tasks: • Know how the community evolves and the content that is being generated.

•
Assist in the decision-making process. Therefore, although the main topic of study of the project are the youth, the architecture will have as end users the project's researchers. In line with the final objective of the project, what is ultimately sought is to influence in the decisions of public representatives, to develop actions that help improve the lives of young people and, ultimately, to take advantage of their contributions.
This article is organized in the following sections: firstly, the WYRED project is introduced, then the architectural proposal is shown and the way in that a testing dataset has been generated, later a use case is presented followed by the main conclusions. Multi-stakeholder platform for enhancing youth digital opportunities)" Call.

WYRED PROJECT
Project that aims to provide a framework for research in which children and young people can express and explore their perspectives and interests in relation to digital society, but also a platform from which they can communicate their perspectives to other stakeholders effectively through innovative engagement processes. It will do this by implementing a generative research cycle involving networking, dialogue, participatory research and interpretation phases centred around and driven by children and young people, out of which a diverse range of outputs, critical perspectives and other insights will emerge to inform policy and decision-making in relation to children and young people's needs in relation to digital society.
WYRED aims to give young people a voice, and a space to explore their concerns and interests in relation to digital society and share their perspectives and insights to stakeholders with other strata of society.

ARCHITECTURAL PROPOSAL
An architectural proposal consists in defining each of the elements of a system and what is going to be the way in which they interact. This type of work becomes necessary when it is proposed to carry out a project of a certain size, since in it are present a multitude of requirements that must be fulfilled, to reach a high degree of satisfaction of the users. In case of not establishing it, there is a risk that the project will not achieve all the proposed objectives and/or the quality of the result is very low. In the case of this project, it has to support a large number of requirements, the main ones being:

•
The ability to work with different data sources. • Support to manage their privacy.

•
Automatic analysis of data (as far as possible).

•
The ability to represent data through interactive visualizations. To support these requirements, it has been decided to use an architecture called microkernel [10]. This architecture is based on providing minimal functionality in the kernel, and complementing it with a set of components that are the ones that perform the tasks required by the users. This model presents a change of philosophy with respect to the layers-pattern, characterized by stacking the layers horizontally, each having a specific role within the application.
The great advantage of applying this architecture in this case is that the core will only be in charge of obtaining the data and anonymizing them, each of the components will be in charge of processing that data and perform the corresponding visualization. This also allows to achieve a very flexible architecture, where you can easily add new visualizations or eliminate any existing ones, in case the results were not satisfactory [11].
Taking the Docker architecture as a reference, due to is one of the most known examples of microkernel architecture (https://goo.gl/LGk7vj), shown in the Fig. 1, this proposal has been designed, consisting of two layers that form the micronucleus and two main layers for each of the components, which will lead to the generation of interactive visualizations, Fig. 2.

Data sources
An architectural proposal to explore the data of a private TEEM 2017, October 2017, Cádiz, Spain community through visual analytic Obtaining data in this project involves more than just querying a database. This is because the information of the same is distributed among several services, present in several machines.
The private information of the users is stored in a CAS, Central Authentication Service (https://goo.gl/xD4Jkg), following the trail of other studies that have faced this problem [12,13].
In the case of the public information of the users, this is part of the WYRED platform and is available in its database. Finally, the information of the users' interaction with the platform is stored in a NoSQL database, in order to satisfactorily deal with the problems of scalability [14,15].
This layer of the microkernel, therefore, will have to be in charge of merging the data from the different data sources, in addition to the retrieval of the information.

Data anonymization
The layer responsible for anonymizing the data is of vital importance in this work, because it's handling data that contain personal information of the users. In addition, many of them are minor, so this process is obligatory to comply with the current data protection legislation.
The way to work with some of this data is simple, since issues such as name, surname or email can be eliminated without losing representative information. However, this is not enough to ensure that the data are already anonymous, since by combining the remaining data it may be possible to identify the initial user [16]. This type of data that is not unique, but has values that are not usually repeated (or its repetition rate is low) in a dataset, are called quasiidentifiers.
The proposal for the anonymization of the data consists of analyzing and detailing the quasi-identifier attributes that are going to be, and try to reduce them: • In the case of the date of birth, it is proposed to transform this data into the year of birth. In this way, the number of users with a unique value for this field will be very small or zero.

•
In the case of the place of residence, a similar process is planned, reducing the information to the province from where it takes part. In addition to these transformations, it is proposed that the results be always k-anonymous with a value of k=2. This means that there cannot be registers with unique values, since at least, each record must have 2 users with equal values. The use of this value ensures the anonymity of the data, which will be published openly, so that other researchers can use them as a source of information in their research, as stipulated by the European Union for projects funded under the Horizon 2020 project (https://goo.gl/b24XP9).

Module for the analysis of the most frequent subjects
The analysis of the most frequent subjects is one of the questions most repeated by the different researchers. Some focus only on the temporal evolution of these, however, other researchers also consider it very important to be able to explore the use of these themes according to the characteristics of individuals (age, gender, country, etc.). Thanks to the architecture proposed above, this module is able to access the data of the platform to be able to preprocess them. In this case, it is proposed to perform an automatic analysis of the most frequent subjects using LDA (Latent Dirichlet Allocation) [17]. One of the problems with this method is that it is not intended to work in multilingual systems, a very important issue because it is one of the characteristics of the context of use, however, some authors have proposed different methods to support it [18,19]. Another of the handicaps of this mechanism is that it is able to group the words that are part of the same theme, but not to associate a representative name to each theme.
To carry out the visualization proposal, the first thing that has been taken into account are its main associated tasks: • Knowing the evolution of a theme: maximum, minimum, patterns, etc.

•
Being able to compare the evolution of several topics.

•
Being able to know how users' attributes influence the evolution of the themes. Considering the above, it had to select one type of chart from among the many existing [20]. Because of the importance of the temporal characteristic, the first decision was to use a representation that had a horizontal axis to show each of the temporal instants. But it was still necessary to indicate how the frequency of a subject was to be coded, for which there were several possibilities such as line graphs, areas, or histograms.
At first it was thought to use a visualization based on the concept of Theme River [21]. This system has already been used effectively in other research [22,23], since it allows you to easily identify the most important trend changes. However, it has been shown unsuitable for detecting minor trend changes and, moreover, does not allow for a large number of subjects. To reduce these disadvantages, this representation has been combined with another one based on representing each topic individually, on parallel timelines [24]. This allows us to take advantage of the Theme River representation, when making comparisons, and of representations with parallel time lines, in order to know in greater depth the temporal evolution of the subject and to allow the representation of a greater number of them.
Regarding the interaction and adaptation capabilities of the chart, the following are proposed: • Ability to select the themes to represent.

•
Possibility of making comparisons, choosing the attribute of the users by which it is compared.

•
Supporting to rearrange themes, as it is easier to compare those that are closest to each other.

•
Ability to know the level of relevance of that topic at a specific time.

•
Possibility to zoom automatically. • Ability to restrict the selection to a temporary period.

Module for the detection of communities
Another of the most important aspects when exploring a community is to detect the communities implicitly created by users. For this purpose, a large number of techniques have been used, such as hierarchical clustering, the detection of central nodes or the centrality measure [25], but these requires the execution of complex algorithms. It is therefore proposed to address this task through interactive visualization.
The main task to be addressed with this visualization is to discover how users interact, in order to intuit the implicit communities that they form. For this reason, graph representation has been highlighted as the best system for visualizing this type of data [26]. This representation is composed of two main components, the nodes or vertices and the arcs or links, representing the users and their relations, respectively. In the specific case of the visualization proposed, the relationships refer to the number of comments they exchange.
Regarding the visualization, it is proposed to code each node with a size relative to the number of messages that it has published in the platform. In addition, the length of the links should represent the proximity or distance of one node relative to another, taking into account the interactions they have had. To do this, it is introduced the concept of relative distance dr between two nodes A and B, as a value proportional to the number of total links between the number of links they share: ( , ) = * ( ) + ( ) ( , ) g(A), degree of A, is the number of links that have origin or destination A, and E(A, B) the number of links that share both.
Regarding the interaction characteristics implemented, to solve the task of subcommunity detection in a simple and effective way, the following have been established: • Possibility of knowing in detail the characteristics of a user, when visiting a node. • Ability to zoom to be able to analyze the graph in greater depth and to help that this visualization can still be useful with a high number of nodes.

•
Supporting to select a set of nodes and know the average value of their attributes.

•
Possibility to move and analyze in detail each of the communities that are formed.

Module for the exploration of users
The problem of representing the attributes of the users of a platform is quite complex, due to the large number of users and features to be shown. For these reasons, it is necessary to use a visualization that scales on demand, compact and easy to interpret. For this reason, the parallel coordinates have been chosen [27,28], since they allow us to represent n dimensions or attributes in a two-dimensional context. With respect to interaction characteristics, the following are proposed: • Possibility of reordering the attributes to be visualized, in order to be able to detect if there is correlation between them or not. • Ability to filter through each of the attributes, supporting multiple filtering.

•
Possibility of restricting the time period to be studied.

Module for the geographical exploration of the project
This module is responsible for answering the need to know which countries are the most active and how this dimension affects the analysis of the data of the platform, the visualization that best represents this concept is the map. However, there are many types of maps, both attending to the characteristics they represent and the projection that they use. In the proposal, it has been chosen to use the Mercator projection, because it's the most familiar, to represent the countries and regions of the world. In addition, the color will be used to represent the number of messages that have been generated by the users of each of the territories. With respect to interaction characteristics, the following are proposed: • Possibility to move around the map.

•
Ability to have semantic zoom, so that when the zoom level is high, the map stops representing the countries and goes on to show their provinces. • Support to refocus the map.

•
Ability to know the exact number of messages in each country. • Possibility to filter data by country.

DATASET GENERATION
One of the problems that has occurred in this work has been the lack of a dataset with which to develop an architectural proposal, due to the WYRED project's community had not enough activity at this moment. For this reason, the decision of try to generate a testing dataset as close as possible to the real datasets that the project will generate was taken. The main approaches to make it are the following: • Use data from a similar community.

•
Use other data sources that have common characteristics. • Generate data artificially. Extracting data from a community close to the one that is being studied is the simplest and fastest process for obtaining a set of data. This can be done using the largest social networks (Twitter, Facebook, Flickr, etc.), which have been analysed in depth by many authors who, in most cases, An architectural proposal to explore the data of a private TEEM 2017, October 2017, Cádiz, Spain community through visual analytic 5 have made available to other researchers their data [29,30]. But this solution is not valid in all cases, because these are too generic communities and the data is usually anonymized.
Other authors [31] have proposed to use some data that are easier to obtain, such as entries in the log files, to generate the dataset. In such way that those characteristics that are present in the records and in the target dataset, are maintained and those that do not appear, are generated from the combination of others that do form part of them. This system has the advantage that part of the data corresponds to real information and, therefore, it is possible to study it to find patterns and verify hypotheses, while the rest of the data can serve to add context to them.
Other researchers focus their studies on generating the dataset completely, artificially. Within this field, we must highlight those who focus on simulating the interaction and those who in addition to the above, try to generate the content that would occur. In the first case, they have worked in mathematical modeling the growth and the evolution of the interactions in a network [32], which allows them to reach a set of data whose behavior is representative. In the second case, the authors face the high complexity involved in the generation of content, for example, textual type content, along with the assignment of representative attributes to each individual and their interactions. The main work that tackle this is LDBC-SNB Data Generator [33], which is a program developed to generate community datasets for LDBC (Linked Data Benchmark Council) [34]. To assign attributes to values logically, the authors rely on S3G2 [35] a framework that defines the correlation that exists between certain attributes. For the choice of values, the software has a set of dictionaries where the different values that the attributes can take are selected, selecting the final value through various functions that model the probability of an event.

Figure 3: Dependency between the attributes of a user
At first, to build the dataset was tried to use LDBC-SNB Data Generator, however, this was not feasible, when generating a dataset that is not customizable and does not contain some of the necessary attributes. That is why it has been decided to build the dataset from scratch, for this have taken the following steps: 1. Analysis of the entities to be simulated.
2. Identification of its main attributes.
3. Creation of the dependency graph between them according to the model described in S3G2 [35]. In Fig. 3 an example of how to make this is shown.

RESULTS
To develop the proposed architecture, it has been used to use web programming languages and technologies. This decision mainly allows two things: to make the developments accessible to a greater public and to equip them with a greater degree of interactivity. At the moment of developing each module, there is an aspect that has taken great importance, the ability to filter the data that you want to study. For this purpose, controls have been established at the top for this purpose, which helps to comply with the mantra of the visual analytics enunciated by Ben Shneiderman [38] and expanded by Keim et al. [39]: Analyze first, show the important, zoom, filter and analyze further, details on demand.
The use of a modular architecture does not necessarily imply the use of each of the components separately. For this reason, they have been combined through the linked views technique [40], to form a monitoring panel that allows exploring all facets of the project, at same time, as shown in Fig. 4. The dashboard can be seen in the following URL https://goo.gl/CrBnni. To demonstrate how this proposed system could be used, a use case is described. The research question of this use case is: what are the main communities on education and employment and what are their characteristics? So, the first thing that a research has to do, is identify which are the themes that are presented in the research question, in this case, education and employment. For this reason, the researcher has to select them in the selector of themes (Fig.  5). Then, in the community explorer can be identified the main communities formed about these themes. This can be seen in the Fig. 6, where each point is a user and they have been grouped into three communities. To explore a community, the researcher has to select the users that form it dragging the selection rectangle that it's shown when he clicks into the community explorer. To improve the usability, the selected users maintain their color while the unselected users turns brown.    If the data of the first community is analyzed (Fig. 7), it can be seen that the main users are Turkish, because this country has the darkest color in the map explorer, who talk more about education that employment, as the themes explorer shown. In the case of the second community, which data can be consulted in the Fig. 8, the users are mainly from Spain, who in the specific days talked a lot about employment, although their most common theme is education. Finally, the third community ( Fig. 9) is made up by Italians whose behavior is similar to the second community. However, there is an unusual low number of postsecondary students, this can be appreciated because the majority of the lines whose destination is postsecondary are in grey, in the user explorer.
To show the interactive characteristics of the development and how a researcher can use these visualizations to solve this use case, a video has been recorded (https://goo.gl/js3hkp).

CONCLUSIONS
Due to the current lack of data generated by the WYRED project, the automatic generation of the data has been analyzed and a proposal has been developed to construct a dataset as similar as possible to the real datasets of the project.
In this research, the architecture proposal has also been presented to elaborate a set of interactive visualizations that allow to explore the data of the WYRED project. This modular architecture, based on the microkernel architecture, consists of 2 basic layers: data acquisition and anonymization, and 4 modules: exploration of the main themes, representation of the communities, visualization of the characteristics of the users and geographic exploration. Therefore, it can be affirmed that this work has fulfilled the goals set out in the beginning: • Showing how to extract knowledge through interactive visualizations of the WYRED project data.

•
Keeping the information anonymous.

•
Analyzing the problems of working with large and complex dataset.

•
Creating a flexible architecture to fit all requirements of WYRED and allowing the adaptation to the WYRED ecosystem evolution.
Regarding the future lines of research, it is considered that there are some aspects in which the research could be continued to enhance and expand this work: • Conducting a study with users of the usability of the proposed system. To do this, users should be selected, which could be limited to 5 according to the Nielsen study [41].

•
Studying and implement the collaborative use of the visualizations, so that different researchers can cooperate in the analysis, both synchronously and asynchronously [42].

•
Addressing the integration of the proposed system with other systems, to favor the research work [42].

ACKNOWLEDGMENTS
With the support of the EU Horizon 2020 Programme in its "Europe in a changing world -inclusive, innovative and reflective Societies (HORIZON