Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification

\emph{Funnelling} (Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a metaclassifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear. In this paper we describe \emph{Generalized Funnelling} (gFun), a generalization of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary \emph{view-generating functions}, i.e., language-dependent functions that each produce a language-independent representation ("view") of the (monolingual) document. We describe an instance of gFun in which the metaclassifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded by \emph{Word-Class Embeddings}), word-word correlations (as encoded by \emph{Multilingual Unsupervised or Supervised Embeddings}), and word-context correlations (as encoded by \emph{multilingual BERT}). We show that this instance of \textsc{gFun} substantially improves over Fun and over state-of-the-art baselines, by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implements gFun is publicly available.


INTRODUCTION
Transfer Learning (TL) [62] is a class of machine learning tasks in which, given a training set of labelled data items sampled from one or more "source" domains, we must issue predictions for unlabelled data items belonging to one or more "target" domains, related to the source domains but different from them. In other words, the goal of TL is to "transfer" (i.e., reuse) the knowledge that has been obtained from the training data in the source domains, to the target domains of interest, for which few labelled data (or no labelled data at all) exist. The rationale of TL is thus to increase the performance of a system on a downstream task (when few labelled data for this task exist), or to make it possible to carry out this task at all (when no training data at all for this task exist), while avoiding the cost of annotating new data items specific to this task.
TL techniques can be grouped into two main categories, according to the characteristics of the feature spaces in which the instances are represented. Homogeneous TL (which is often referred to as domain adaptation [69]) encompasses problems in which the source instances and the target instances are represented in a shared feature space. Conversely, heterogeneous TL [13] denotes the case in which the source data items and the target data items lie in different, generally nonoverlapping feature spaces. This article focuses on the heterogeneous case only; from now on, by HTL we will thus denote heterogeneous transfer learning.
A prominent instance of HTL in the natural language processing and text mining areas is Cross-Lingual Transfer Learning (CLTL), in which data items have a textual nature and the different domains are actually different languages in which the data items are expressed. In turn, an important instance of CLTL is the task of cross-lingual text classification (CLTC), which consists of classifying documents, each written in one of a finite set L = { 1 , ..., | L | } of languages, according to a shared codeframe (a.k.a. classification scheme) Y = { 1 , ..., |Y | }. The brand of CLTC we will consider in this paper is (cross-lingual) multilabel classification, namely, the case in which any document can belong to zero, one, or several classes at the same time.
The CLTC literature has focused on two main variants of this task. The first variant (that is sometimes called the many-shot variant) deals with the situation in which the target languages are such that language-specific training data are available for them as well; in this case, the goal of CLTC is to improve the performance of target language classification with respect to what could be obtained by leveraging the language-specific training data alone. If these latter data are few, the task if often referred to as few-shot learning. (We will deal with the many-shot/few-shot scenario in the experiments of Section 4.4.) The second variant is usually called the zero-shot variant, and deals with the situation in which there are no training data at all for the target languages; in this case, the goal of CLTC is to allow the generation of a classifier for the target languages, which could not be obtained otherwise. (We will deal with the zero-shot scenario in the experiments of Section 4.6.) Many-shot CLTC is important, since in many multinational organisations (e.g., Vodafone, FAO, the European Union) many labelled data may be available in several languages, and there may be a legitimate desire to improve on the classification accuracy that monolingual classifiers are capable of delivering. The importance of few-shot and zero-shot CLTC instead lies in the fact that, while modern learning-based techniques for NLP and text mining have shown impressive performance when trained on huge amounts of data, there are many languages for which data are scarce. According to [29], the amount of (labelled and unlabelled) resources for the more than 7,000 languages spoken around the world follows (somehow unsurprisingly) a power-law distribution, i.e., while a small set of languages account for most of the available data, a very long tail of languages suffer from data scarcity, despite the fact that languages belonging to this long tail may have large speaker bases. Few-shot / zero-shot CLTL thus represents an appealing solution to dealing with this situation, since it attempts to bridge the gap between the high-resource languages and the low-resource ones.
However, the application of CLTC is not necessarily limited to scenarios in which the set of the source languages and the set of the target languages are disjoint, nor it is necessarily limited to cases in which there are few or no training data for the target domains. CLTC can also be deployed in scenarios where a language can play both the part of a source language (i.e., contribute to performing the task in other languages) and of a target language (i.e., benefit from training data expressed in other languages), and where sizeable quantities of labelled data exist for all languages at once. Such application scenarios, despite having attracted less research attention than the fewshot and zero-shot counterparts, are frequent in the context of multinational organisations, such as the European Union or UNESCO, or multilingual countries, such as India, South Africa, Singapore, and Canada, or multinational companies (e.g., Amazon, Vodafone). The aim of CLTC, in these latter cases, is to effectively exploit the potential synergies among the different languages in order to allow all languages to contribute to, and to benefit from, each other. Put it another way, the raison d'être of CLTC here becomes to deploy classification systems that perform substantially better than the trivial solution (the so-called naïve classifier) consisting of |L| monolingual classifiers trained independently of each other.

Funnelling and Generalized Funnelling
Esuli et al. [20] recently proposed Funnelling (Fun), an HTL method based on a two-tier classifier ensemble, and applied it to CLTC. In Fun, the 1st-tier of the ensemble is composed of |L| languagespecific classifiers, one for each language in L. For each document , one of these classifiers (the one specific to the language of document ) returns a vector of |Y| calibrated posterior probabilities, where Y is the codeframe. Each such vector, irrespective of which among the L classifiers has generated it, is then fed to a 2nd-tier "meta-classifier" which returns the final label predictions.
The |Y|-dimensional vector space to which the vectors of posterior probabilities belong, thus forms an "interlingua" among the |L| languages, since all these vectors are homologous, independently of which among the |L| classifiers have generated them. Another way of saying it is that all vectors are aligned across languages, i.e., the -th dimension of the vector space has the same meaning in every language (namely, the "posterior" probability that the document belongs to class ). During training, the meta-classifier can thus learn from all labelled documents, irrespectively of their language. Given that the meta-classifier's prediction for each class in Y depends on the posterior probabilities received in input for all classes in Y, the meta-classifier can exploit class-class correlations, and this (among other things) gives Fun an edge over CLTC systems in which these correlations cannot be brought to bear.
Fun was originally conceived with the many-shot / few-shot setting in mind; in such a setting, Fun proved superior to the naïve classifier and to 6 state-of-the-art baselines [20]. Esuli et al. [20] also sketched some architectural modifications that allow Fun to be applied to the zero-shot setting too.
In this paper we describe Generalized Funnelling (gFun), a generalisation of Fun consisting of an HTL architecture in which 1st-tier components can be arbitrary view-generating functions (VGFs), i.e., language-dependent functions that each produce a language-independent representation ("view") of the (monolingual) document. We describe an instantiation of gFun in which the metaclassifier receives as input, for the same (monolingual) document, a vector of calibrated posterior probabilities (as in Fun) as well as other language-independent vectorial representations, consisting of different types of document embeddings. These additional vectors are aggregated (e.g., via concatenation) with the original vectors of posterior probabilities, and the result is a set of extended, language-aligned, heterogeneous vectors, one for each monolingual document.
The original Fun architecture is thus a particular instance of gFun, in which the 1st-tier is equipped with only one VGF. The additional VGFs that characterize gFun each enable the metaclassifier to gain access to information on types of correlation in the data additional to the class-class correlations captured by the meta-classifier. In particular, we investigate the impact of word-class correlations (as embodied in Word-Class Embeddings (WCEs) [44]), word-word correlations (as embodied in Multilingual Unsupervised or Supervised Embeddings (MUSEs) [11]), and correlations between contextualized words (as embodied in embeddings generated by multilingual BERT [16]). As we will show, gFun natively caters for both the many-shot/few-shot and the zero-shot settings; we carry out extensive CLTC experiments in order to assess the performance of gFun in both cases. The results of these experiments show that mining additional types of correlations in data does make a difference, and that gFun outperforms Fun as well as other CLTC systems that have recently been proposed.
The rest of this article is structured as follows. In Section 2 we describe the gFun framework, while in Section 3 we formalize the concept of "view-generating function" and present several instances of it. Section 4 reports the experiments (for both the many-shot and the zero-shot variants) 1 that we have performed on two large datasets for multilingual multilabel text classification. In Section 5 we move further and discuss a more advanced, "recurrent" VGF that combines MUSEs and WCEs in a more sophisticated way, and test it in additional experiments. We review related work and methods in Section 6. In Section 7 we conclude by sketching avenues for further research. Our code that implements gFun is publicly available. 2

GENERALIZED FUNNELLING
In this section, we first briefly summarise the original Fun method, and then move on to present gFun and related concepts.

A brief introduction to Funnelling
Funnelling, as described in [20], comes in two variants, called Fun(tat) and Fun(kfcv). We here disregard Fun(kfcv) and only use Fun(tat), since in all the experiments reported in [20] Fun(tat) clearly outperformed Fun(kfcv); see [20] if interested in a description of Fun(kfcv). For ease of notation, we will simply use Fun to refer to Fun(tat).
In Fun (see Figure 1), in order to train a classifier ensemble, 1st-tier language-specific classifiers ℎ 1 1 , ..., ℎ 1 | L | (with superscript 1 indicating the 1st tier) are trained from their corresponding languagespecific training sets Tr 1 , ..., Tr | L | . Training documents ∈ Tr may be represented by means of any desired vectorial representation 1 ( ) = d, such as, e.g., TFIDF-weighted bag-of-words, or character -grams; in principle, different styles of vectorial representation can be used for the different 1st-tier classifiers, if desired. The classifiers may be trained by any learner, provided the resulting classifier returns, for each language , document , and class , a confidence score ℎ 1 (d, ) ∈ R; in principle, different learners can be used for the different 1st-tier classifiers, if desired.
Each 1st-tier classifier ℎ 1 is then applied to each training document ∈ Tr , thus generating a vector of confidence scores for each ∈ Tr . (Incidentally, this is the phase in which Fun(tat) and Fun(kfcv) differ, since Fun(kfcv) uses instead a -fold cross-validation process to classify the training documents.) The next step consists of computing (via a chosen probability calibration method) language-and class-specific calibration functions that map confidence scores ℎ 1 (d, ) into calibrated posterior probabilities Pr( |d). 3 Fun then applies to each confidence score and obtains a vector of calibrated posterior probabilities Note that the index for language has disappeared, since calibrated posterior probabilities are comparable across different classifiers, which means that we can use a shared, language-independent space of vectors of calibrated posterior probabilities. At this point, the 2nd-tier, language-independent "meta"-classifier ℎ 2 can be trained from all training documents ∈ | L | =1 Tr , where document is represented by its 2 ( ) vector. This concludes the training phase.
In order to apply the trained ensemble to a test document ∈ Te from language , Fun applies classifier ℎ 1 to 1 ( ) = d and converts the resulting vector ( ) of confidence scores into a vector 2 ( ) of calibrated posterior probabilities. Fun then feeds this latter to the meta-classifier ℎ 2 , which returns (in the case of multilabel classification) a vector of binary labels representing the predictions of the meta-classifier.

Introducing heterogeneous correlations through Generalized Funnelling
As explained in [20], the reasons why Fun outperforms the naïve monolingual baseline consisting of |L| independently trained, language-specific classifiers, are essentially two. The first is that Fun learns from heterogeneous data; i.e., while in the naïve monolingual baseline each classifier is trained only on |Tr | labelled examples, the meta-classifier in Fun is trained on all the | L | =1 |Tr | labelled examples. Put it another way, in Fun all training examples contribute to classifying all unlabelled examples, irrespective of the languages of the former and of the latter. The second is that the meta-classifier leverages class-class correlations, i.e., it learns to exploit the stochastic dependencies between classes typical of multiclass settings. In fact, for an unlabelled document the meta-classifier receives |Y| inputs from the 1st-tier classifier which has classified , and returns |Y| confidence scores, which means that the input for class ′ has a potential impact on the output for class ′′ , for every ′ and ′′ .
In Fun, the key step in allowing the meta-classifier to leverage the different language-specific training sets consists of mapping all the documents onto a space shared among all languages. This is made possible by the fact that the 1st-tier classifiers all return vectors of calibrated posterior probabilities. These vectors are homologous (since the codeframe is the same for all languages), and are also comparable (because the posterior probabilities are calibrated), which means that we can have all vectors share the same vector space irrespectively of the language of provenance.
In gFun, we generalize this mapping by allowing a set Ψ of view-generating functions (VGFs) to define this shared vector space. VGFs are language-dependent functions that map (monolingual) documents into language-independent vectorial representations (that we here call views) aligned Fig. 1. The Fun architecture, exemplified with |L|=3 languages (Chinese, Italian, English). Note that the different term-document matrices in the 1st-tier may contain different numbers of documents and/or different numbers of terms. The three grey diamonds on the left represent calibrated classifiers that map the original vectors (e.g., TFIDF vectors) into |Y|-dimensional spaces. The resulting vectors are thus aligned and can all be used for training the meta-classifier, which is represented by the grey diamond on the right. across languages. Since each view is aligned across languages, it is easy to aggregate (e.g., by concatenation) the different views of the same monolingual document into a single representation that is also aligned across languages, and which can be thus fed to the meta-classifier. Different VGFs are meant to encode different types of information so that they can all be brought to bear on the training process. In the present paper we will experiment with extending Fun by allowing views consisting of different types of document embeddings, each capturing a different type of correlation within the data.
The procedures for training and testing cross-lingual classifiers via gFun are described in Algorithm 1 and Algorithm 2, respectively. The first step of the training phase is the optimisation of the parameters (if any) of the VGFs ∈ Ψ (Algorithm 1 -Line 4), which is carried out independently for each language and for each VGF. A VGF produces representations that are aligned across all languages, which means that vectors coming from different languages can be "stacked" (i.e., placed in the same set) to define the view (Algorithm 1 -Line 7), which corresponds to the portion of the entire (now language-independent) training set of the meta-classifier. Note that the vectors in a given view need not be probabilities; we only assume that they are homologous and comparable across languages. The aggregation function (aggfunc) implements a policy for aggregating the different views for them to be input to the meta-classifier; it is thus used both during training (Algorithm 1 -Line 12) and during test (Algorithm 2 -Line 3). In case the aggregation function needs to learn some parameters, those are estimated during training (Algorithm 1 -Line 10).
Finally, note that both the training phase and the test phase are highly parallelisable, since the (training and/or testing) data for language ′ can be processed independently of the analogous data for language ′′ , and since each view within a given language can be generated independently of the other views for the same language. Note that the original formulation of Fun (Section 2.1) thus reduces to an instance of gFun in which there is a single VGF (one that converts documents into calibrated posterior probabilities) and the aggregation function is simply the identity function. In this case, the fit of the VGF (Algorithm 1 -Line 4) comes down to computing weighted (e.g., via TFIDF) vectorial representations of the training documents, training the 1st-tier classifiers, and calibrating them. Examples of the parameters obtained as a result of the fitting process include the choice of vocabulary, the IDF scores, the parameters of the separating hyperplane, and those of the calibration function. During the test phase, invoking the VGF (Algorithm 2 -Line 3) amounts to computing the weighted vectorial representations and the 2 ( ) representations (Equation 2) of the test documents, using the classifiers and meta-classifier generated during the training phase.
In what follows we describe the VGFs that we have investigated in order to introduce into gFun sources of information additional to the ones that are used in Fun. In particular, we describe in detail each such VGF in Sections 3.1-3.4, we discuss aggregation policies in Section 3.5, and we analyse a few additional modifications concerning data normalisation (Section 3.6) that we have introduced into gFun and that, although subtle, bring about a substantial improvement in the effectiveness of the method.

VIEW-GENERATING FUNCTIONS
In this section we describe the VGFs that we have investigated throughout this research, by also briefly explaining related concepts and works from which they stem.
As already stated, the main idea behind our instantiation of gFun is to learn from heterogeneous information about different kinds of correlations in the data. While the main ingredients of the text classification task are words, documents, and classes, the key to approach the CLTC setting lies in the ability to model them consistently across all languages. We envision ways for bringing to bear the following stochastic correlations among these elements: (1) Correlations between different classes: understanding how classes are related to each other in some languages may bring about additional knowledge useful for classifying documents in other languages. These correlations are specific to the particular codeframe used, and are obviously present only in multilabel scenarios. They can be used (in our case: by the meta-classifier) in order to refine an initial classification (in our case: by the 1st-tier classifiers), since they are based on the relationships between posterior probabilities / labels assigned to documents. (2) Correlations between different words: by virtue of the "distributional hypothesis" (see [52]), words are often modelled in accordance to how they are distributed in corpora of text with respect to other words. Distributed representations of words encode the relationships between words and other words; when properly aligned across languages, they represent an important help for bringing lexical semantics to bear on multilingual text analysis processes, thus helping to bridge the gap among language-specific sources of labelled information. (3) Correlations between words and classes: profiling words in terms of how they are distributed across the classes in a language is a direct way of devising cross-lingual word embeddings (since translation-equivalent words are expected to exhibit similar class-conditional distributions), which is compliant with the distributional hypothesis (since semantically similar words are expected to be distributed similarly across classes). (4) Correlations between contextualized words: the meaning of a word occurrence is dependent on the specific context in which the word occurrence is found. Current language models are well aware of this fact, and try to generate contextualized representations of words, which can in turn be used straightforwardly in order to obtain contextualized representations for entire documents. Language models trained on multilingual data are known to produce distributed representations that are coherent across the languages they have been trained on.
We recall from Section 2.1 that class-class correlations are exploited in the 2nd-tier of Fun. We model the other types of correlations mentioned above via dedicated VGFs. We investigate instantiations of the aforementioned correlations by means of independently motivated modular VGFs. Here we provide a brief overview of each them.
• the Posteriors VGF: it maps documents into the space defined by calibrated posterior probabilities. This is, aside from the modifications discussed in Section 3.6, equivalent to the 1st-tier of the original Fun, but we discuss it in detail in Section 3.1. • the MUSEs VGF (encoding correlations between different words): it uses the (supervised version of) Multilingual Unsupervised or Supervised Embeddings (MUSEs) made available by the authors of [11]. MUSEs have been trained on Wikipedia 4 in 30 languages and have later been aligned using bilingual dictionaries and iterative Procrustes alignment (see Section 3.2 and [11]). • the WCEs VGF (encoding correlations between words and classes): it uses Word-Class Embeddings (WCEs) [44], a form of supervised word embeddings based on the class-conditional distributions observed in the training set (see Section 3.3). • the BERT VGF (encoding correlations between different contextualized words): it uses the contextualized word embeddings generated by multilingual BERT [17], a deep pretrained language model based on the transformer architecture (see Section 3.4).
In the following sections we present each VGF in detail.

The Posteriors VGF
This VGF coincides with the 1st-tier of Fun, but we briefly explain it here for the sake of completeness.
Here the idea is to leverage the fact that the classification scheme is common to all languages, in order to define a vector space that is aligned across all languages. Documents, regardless of the language they are written in, can be redefined with respect to their relations to the classes in the codeframe. Using a geometric metaphor, the relation between a document and a class can be defined in terms of the distance between the document and the surface that separates the class from its complement. In other words, while the language-specific vector spaces where the original document vectors lie are not aligned (e.g., they can be characterized by different numbers of dimensions, and the dimensions for one language bear no relations to the dimensions for another language), one can profile each document via a new vector consisting of the distances to the separating surfaces relative to the various classes. By using the binary classifiers as "pivots" [1], documents end up being represented in a shared space, in which the number of dimensions are the same for all languages (since the classes are assumed to be the same for all languages), and the vector values for each dimension are comparable across languages once the distances to the classification surfaces are properly normalized (which is achieved by the calibration process).
Note that this procedure is, in principle, independent of the characteristics of any particular vector space and learning device used across languages, both of which can be different across the languages. 5 For ease of comparability with the results reported by Esuli et al. [20], in this paper we will follow these authors and encode (for all languages in L) documents as bag-of-words vectors weighted via TFIDF, which is computed as where # Tr ( ) is the number of documents in Tr in which word occurs at least once and where #( , x ) stands for the number of times appears in document x . Weights are then normalized via cosine normalisation, as For the very same reasons we also follow [20] in adopting (for all languages in L) Support Vector Machines (SVMs) as the learning algorithm, and "Platt calibration" [50] as the probability calibration function.

The MUSEs VGF
In CLTL, the need to transfer lexical knowledge across languages has given rise to cross-lingual representations of words in a joint space of embeddings. In our research, in order to encode wordword correlations across different languages we derive document embeddings from (the supervised version of) Multilingual Unsupervised or Supervised Embeddings (MUSEs) [11]. MUSEs are word embeddings generated via a method for aligning unsupervised (originally monolingual) word embeddings in a shared vector space, similar to the method described in [39]. The alignment is obtained via a linear mapping (i.e., a rotation matrix) learned by an adversarial training process in which a generator (in charge of mapping the source embeddings onto the target space) is trained to fool a discriminator from distinguishing the language of provenance of the embeddings, i.e., from discerning if the embeddings it receives as input originate from the target language or are instead the product of a transformation of embeddings originated from the source language. The mapping is then further refined using a technique called "Procrustes alignment". The qualification "Unsupervised or Supervised" refers to the fact that the method can operate with or without a dictionary of parallel seed words; we use the embeddings generated in supervised fashion. We use the MUSEs that Conneau et al. [11] make publicly available 6 , and that consist of 300dimensional multilingual word embeddings trained on Wikipedia using fastText. To date, the embeddings have been aligned for 30 languages with the aid of bilingual dictionaries.
Fitting the VGF for MUSEs consists of first allocating in memory the pre-trained MUSE matrices M ∈ R ( ×300) (where is the vocabulary size for the -th language), made available by Conneau et al. [11], for each language involved, and then generating document embeddings for all training documents as weighted averages of the words in the document. As the weighting function, we use TFIDF (Equation 3). This computation reduces to performing the projection X · M , where the matrix X ∈ R ( |Tr i |× ) consists of the TFIDF-weighted vectors that represent the training documents (for this we can reuse the matrices X computed by the Posteriors VGF, since they are identical to the ones needed here). The process of generating the views of test documents via this VGF is also obtained via a projection X · M , where in this case the X matrix consists of the TFIDF-weighted vectors that represent the test documents. and a class (column), as from the RCV1/RCV2 dataset. Yellow indicates a high value of correlation while blue indicates a low such value. Words "avvocato" and "avocat" are Italian and French translations, resp., of the English word "lawyer"; words "calcio" and "futbol" are Italian and Spanish translations, resp., of the English word "football"; Italian word "borsa" instead means "bag".

The WCEs VGF
In order to encode word-class correlations we derive document embeddings from Word-Class Embeddings (WCEs [44]). WCEs are supervised embeddings meant to extend (e.g., by concatenation) other unsupervised pre-trained word embeddings (e.g., those produced by means of word2vec, or GloVe, or any other technique) in order to inject task-specific word meaning in multiclass text classification. The WCE for word is defined as where is a real-valued function that quantifies the correlation between word and class as observed in the training set, and where is any dimensionality reduction function. Here, as the function we adopt the normalized dot product, as proposed in [44], whose computation is very efficient; as we use the identity function, which means that our WCEs are |Y|-dimensional vectors. So far, WCEs have been tested exclusively in monolingual settings. However, WCEs are naturally aligned across languages, since WCEs have one dimension for each ∈ Y, which is the same for all languages ∈ L. Document embeddings relying on WCEs thus display similar characteristics irrespective of the language in which the document is written in. In fact, given a set of documents classified according to a common codeframe, WCEs reflect the intuition that words that are semantically similar across languages (i.e., are translations of each other) tend to exhibit similar correlations to the classes in the codeframe. This is, to the best of our knowledge, the first application of WCEs to a multilingual setting.
The intuition behind this idea is illustrated by the two examples in Figure 2, where two heatmaps display the correlation values of five WCEs each. Each of the two heatmaps illustrates the distribution patterns of four terms that are either semantically related or translation equivalents of each other (first four rows in each subfigure), and of a fifth term semantically unrelated to the previous four (last row in each subfigure). Note that not only semantically related terms in a language get similar representations (as is the case of "attorney" and "lawyer" in English), but also translation-equivalent terms do so (e.g., "avvocato" in Italian and "avocat" in French).
The VGF for WCEs is similar to that for MUSEs, but for the fact that in this case the matrix containing the word embeddings needs to be obtained from our training data, and is not pretrained on external data. More specifically, fitting the VGF for WCEs comes down to first computing, for each language ∈ L, the language-specific WCE matrix W according to the process outlined in [44], and then projecting the TFIDF-weighted matrix X obtained from Tr , as X · W . (Here too, we use the TFIDF variant of Equation 3.) During the testing phase, we simply perform the same projection X · W as above, where X now represents the weighted matrix obtained from the test set.
Although alternative ways of exploiting word-class correlations have been proposed in the literature, we adopted WCEs because of their higher simplicity with respect to other methods. For example, the GILE system [46] uses label descriptions in order to compute a model of compatibility between a document embedding and a label embedding; differently from the latter work, in our problem setting we do not assume to have access to textual descriptions of the semantics of the labels. The LEAM model [64], instead, defines a word-class attention mechanism and can work with or without label descriptions (though the former mode is considered preferable), but has never been tested in multilingual contexts; preliminary experiments we have carried out by replacing the GloVe embeddings originally used in LEAM with MUSE embeddings, have not produced competitive results.

The BERT VGF
BERT [17] is a bidirectional language model based on the transformer architecture [61] trained on a masked language modelling objective and next sentence prediction task. The transformer architecture has been initially proposed for the task of sequence transduction relying solely on the attention mechanism, and thus discarding the usual recurrent components deployed in encoder-decoder architectures. BERT's transformer blocks contain two sub-layers. The first is a multi-head selfattention mechanism, and the second is a simple, position-wise fully connected feed-forward network. Differently from other architectures [49], BERT's attention is set to attend to all the input tokens (i.e., it deploys bidirectional self-attention), thus making it well-suited for sentence-level tasks. Originally, the BERT architecture was trained by Devlin et al. [17] on a monolingual corpus composed of the BookCorpus and English Wikipedia (for a total of roughly 3,300M words). Recently, a multilingual version, called mBERT [16], has been released. The model is no different from the standard BERT model; however, mBERT has been trained on concatenated documents gathered from Wikipedia in 104 different languages. Its multilingual capabilities emerge from the exposure to different languages during this massive training phase.
In this research, we explore mBERT as a VGF for gFun. At training time, this VGF is first equipped with a fully-connected output layer, so that BERT can be trained end-to-end using binary cross-entropy as the loss function. Nevertheless, as its output (i.e., the one that is eventually fed to the meta-classifier also at testing time) we use the hidden state associated with the document embedding (i.e., the [CLS] token) at its last layer.

Policies for aggregating VGFs
The different views of the same document that are independently generated by the different VGFs need to be somehow merged together before being fed to the meta-classifier. This is undertaken by operators that we call aggregation functions. We explore two different policies for view aggregation: concatenation and averaging.
Concatenation simply consists of juxtaposing, for a given document, the different views of this document, thus resulting in a vector whose dimensionality is the sum of the dimensionalities of the contributing views. This policy is the more straightforward one, and one that does not impose any constraint on the dimensionality of the individual views as generated from different VGFs.
Averaging consists instead of computing, for a given document, a vector which is the average of the different views for this document. In order for it to be possible, though, this policy requires that the views (i) all have the same dimensionality, and (ii) are aligned among each other, i.e., that the -th dimension of the vector has the same meaning in every view. This is obviously not the case with the views produced by the VGFs we have described up to now. In order to solve this problem, we learn additional mappings onto the space of class-conditional posterior probabilities, i.e., for each VGF (other than the Posteriors VGF of Section 3.1, which already returns vectors of |Y| calibrated posterior probabilities) we train a classifier that maps the view of a document into a vector of |Y| calibrated posterior probabilities. The net result is that each document is represented by vectors of |Y| calibrated posterior probabilities (where is the number of VGFs in our system). These vectors can be averaged, and the resulting average vector can be fed to the meta-classifier as the only representation of document . The way we learn the above mappings is the same used in Fun; this also brings about uniformity between the vectors of posterior probabilities generated by the Posteriors VGF and the ones generated by the other VGFs. Note that in this case, though, the classifier for VGF is trained on the views produced by for all training documents, irrespectively of their language of provenance; in other words, for performing these mappings we just train ( − 1) (and not ( − 1) × |L|) classifiers, one for each VGF other than the Posteriors VGF.
Each of these two aggregation policies has different pros and cons.
The main advantage of concatenation is that it is very simple to implement. However, it suffers from the fact that the number of dimensions in the resulting dense vector space is high, thus leading to a higher computational cost for the meta-classifier. Above all, since the number of dimensions that the different views contribute is not always the same, this space (and the decisions of the meta-classifier) can be eventually dominated by the VGFs characterized by the largest number of dimensions.
The averaging policy (Figure 3), on the other hand, scales well with the number of VGFs, but requires learning additional mappings aimed at homogenising the different views into a unified representation that allows averaging them. Despite the additional cost, the averaging policy has one appealing characteristic, i.e., the 1st-tier is allowed to operate with different numbers of VGFs for different languages (provided that there is at least one VGF per language); in fact, the meta-level representations are simply computed as the average of the views that are available for that particular language. For reasons that will become clear in Section 4.6, this property allows gFun to natively operate in zero-shot mode.
In Section 4.7 we briefly report on some preliminary experiments that we had carried out in order to assess the relative merits of the two aggregation policies in terms of classification performance. As we will see in Section 4.7 in more detail, the results of those experiments indicate that, while differences in performance are small, they tend to be in favour of the averaging policy. This fact, along with the fact that the averaging policy scales better with the number of VGFs, and along with the fact that it allows different numbers of VGFs for different languages, will eventually lead us to opt for averaging as our aggregation policy of choice.

Normalisation
We have found that applying some routine normalisation techniques to the vectors returned by our VGFs leads to consistent performance improvements. This normalisation consists of the following steps: (1) Apply (only for the MUSEs VGF and WCEs VGF) smooth inverse frequency (SIF) [3] to remove the first principal component of the document embeddings obtained as the weighted average of word embeddings. In their work, Arora et al. [3] show that removing the first principal component from a matrix of document embeddings defined as a weighted average of word embeddings, is generally beneficial. The reason is that the way in which most word embeddings are trained tends to favour the accumulation of large components along semantically meaningless directions. However, note that for the MUSEs VGF and WCEs VGF we use the TFIDF weighting criterion instead of the criterion proposed by Arora et al. [3], since in our case we are modelling (potentially large) documents, instead of sentences like in their case. 7 (2) Impose unit L2-norm to the vectors before aggregating them by means of concatenation or averaging.
(3) Standardize 8 the columns of the language-independent representations before training the classifiers (this includes (a) the classifiers in charge of homogenising the vector spaces before applying the averaging policy, and (b) the meta-classifier).
The rationale behind these normalisation steps, when dealing with heterogeneous representations, is straightforward and two-fold. On one side, it is a means for equating the contributions brought to the model by the different sources of information. On the other, it is a way to counter the internal covariate shift across the different sources of information (similar intuitions are well-known and routinely applied when training deep neural architectures -see, e.g., [27]). What might come as a surprise is the fact that normalisation helps improve gFun even when we equip gFun only with the Posteriors VGF (which coincides with the original Fun architecture), and that this improvement is statistically significant. We quantify this variation in performance in the experiments of Section 4.

EXPERIMENTS
In order to maximize the comparability with previous results we adopt an experimental setting identical to the one used in [20], which we briefly sketch in this section. We refer the reader to [20] for a more detailed discussion of this experimental setting.

Datasets
The first of our two datasets is a version (created by Esuli et al. [20]) of RCV1/RCV2, a corpus of news stories published by Reuters. This version of RCV1/RCV2 contains documents each written in one of 9 languages (English, Italian, Spanish, French, German, Swedish, Danish, Portuguese, and Dutch) and classified according to a set of 73 classes. The dataset consists of 10 random samples, obtained from the original RCV1/RCV2 corpus, each consisting of 1,000 training documents and 1,000 test documents for each of the 9 languages (Dutch being an exception, since only 1,794 Dutch documents are available; in this case, each sample consists of 1,000 training documents and 794 test documents). Note though that, while each random sample is balanced at the language level (same number of training documents per language and same number of test documents per language), it is not balanced at the class level: at this level the dataset RCV1/RCV2 is highly imbalanced (the number of documents per class ranges from 1 to 3,913 -see Table 1), and each of the 10 random samples is too. The fact that each language is equally represented in terms of both training and test data allows the many-shot experiments to be carried out in controlled experimental conditions, i.e., minimizes the possibility that the effects observed for the different languages are the result of different amounts of training data. (Of course, zero-shot experiments will instead be run by excluding the relevant training set(s).) Both the original RCV1/RCV2 corpus and the version we use here are comparable at topic level, as news stories are not direct translations of each other but simply discuss the same or related events in different languages.
The second of our two datasets is a version (created by Esuli et al. [20]) of JRC-Acquis, a corpus of legislative texts published by the European Union. This version of JRC-Acquis contains documents each written in one of 11 languages (the same 9 languages of RCV1/RCV2 plus Finnish and Hungarian) and classified according to a set of 300 classes. The dataset is parallel, i.e., each document is included in 11 translation-equivalent versions, one per language. Similarly to the case of RCV1/RCV2 above, the dataset consists of 10 random samples, obtained from the original JRC-Acquis corpus, each consisting of at least 1,000 training documents for each of the 11 languages    [20] included at most one of the 11 language-specific versions in a training set, in order to avoid the presence of translation-equivalent content in the training set; this enables one to measure the contribution of training information coming from different languages in a more realistic setting. When a document is included in a test set, instead, all its 11 language-specific versions are also included, in order to allow a perfectly fair evaluation across languages, since each of the 11 languages is thus evaluated on exactly the same content. For both datasets, the results reported in this paper (similarly to those of [20]) are averages across the 10 random selections. Summary characteristics of our two datasets are reported in Table 1; excerpts from sample documents from the two datasets are displayed in Table 2.

Evaluation measures
To assess the model performance we employ 1 , the standard measure of text classification, and the more recently theorized [55]. These two functions are defined as: where TP, FP, FN, TN represent the number of true positives, false positives, false negatives, and true negatives generated by a binary classifier. 1 ranges between 0 (worst) and 1 (best) and is the harmonic mean of precision and recall, while ranges between -1 (worst) and 1 (best). To turn 1 and (whose definitions above are suitable for binary classification) into measures for multilabel classification, we compute their microaverages ( 1 and ) and their macroaverages ( 1 and ). We also test the statistical significance of differences in performance via paired sample, two-tailed t-tests at the = 0.05 and = 0.001 confidence levels.

Learners
Wherever possible, we use the same learner as used in [20], i.e., Support Vector Machines (SVMs) as implemented in the scikit-learn package. 9 For the 2nd-tier classifier of gFun, and for all the baseline methods, we optimize the parameter, that trades off between training error and margin, by testing all values = 10 for ∈ {−1, ..., 4} by means of 5-fold cross-validation. We use Platt calibration in order to calibrate the 1st-tier classifiers used in the Posteriors VGF and (when using averaging as the aggregation policy) the classifiers that map document views into vectors of posterior probabilities. We employ the linear kernel for the 1st-tier classifiers used in the Posteriors VGF, and the RBF kernel (i) for the classifiers used for implementing the averaging aggregation policy, and (ii) for the 2nd-tier classifier.
In order to generate the BERT VGF (see Section 3.4), we rely on the pre-trained model released by Huggingface 10 [66]. For each run, we train the model following the settings suggested by Devlin et al. [17], i.e., we add one classification layer on top of the output of mBERT (the special token [CLS]) and fine-tune the entire model end-to-end by minimising the binary cross-entropy loss function. We use the AdamW optimizer [36] with the learning rate set to 1e-5 and the weight decay set to 0.01. We also set the learning rate to decrease by means of a scheduler (StepLR) with step size equal to 25 and gamma equal to 0.1. We set the training batch size to 4 and the maximum length of the input (in terms of tokens) to 512 (which is the maximum input length of the model). Given that the number of training examples in our datasets is comparatively smaller than that used in Devlin et al. [17], we reduce the maximum number of epochs to 50, and apply an early-stopping criterion that terminates the training after 5 epochs showing no improvement (in terms of 1 ) in the validation set (a held-out split containing 20% of the training documents) in order to avoid overfitting. After convergence, we perform one last training epoch on the validation set.
Each of the experiments we describe is performed 10 times, on 10 different samples extracted from the dataset, in order to assess its statistical significance by means of the paired t-test mentioned in Section 3.6. All the results displayed in the tables included in this paper are averages across these 10 samples and across the |L| languages in the datasets.
We run all the experiments on a machine equipped with a 12-core processor Intel Core i7-4930K at 3.40GHz with 32GB of RAM under Ubuntu 18.04 (LTS) and Nvidia GeForce GTX 1080 equipped with 8GB of RAM.

Baselines
As the baselines against which to compare gFun we use the naïve monolingual baseline (hereafter indicated as Naïve), Funnelling (Fun), plus the four best baselines of [20], namely, Lightweight Random Indexing (LRI [43]), Cross-Lingual Explicit Semantic Analysis (CLESA [59]), Kernel Canonical Correlation Analysis (KCCA [63]), and Distributional Correspondence Indexing (DCI [42]). For all systems but gFun, the results we report are excerpted from [20], so we refer to that paper for the detailed setups of these baselines; the comparison is fair anyway, since our experimental setup is identical to that of [20].
We also include mBERT [17] as an additional baseline. In order to generate the mBERT baseline, we follow exactly the same procedure as described above for the BERT VGF. Note that the difference between mBERT and BERT VGF comes down to the fact that the former leverages a linear transformation of the document embeddings followed by a sigmoid activation in order to compute the prediction scores. On the other hand, BERT as a VGF is used as a feature extractor (or embedder). Once the document representations are computed (by mBERT), we project them to the space of the posterior probabilities via a set of SVMs. We also experiment with an alternative training strategy in which we simply train the classification layer, and leave the pre-trained parameters of mBERT untouched, but omit the results obtained using this strategy because in preliminary experiments it proved inferior to the other strategy by a large margin.
Similarly to [20] we also report an "idealized" baseline (i.e., one whose performance all CLC methods should strive to reach up to), called UpperBound, which consists of replacing each non-English training example by its corresponding English version, training a monolingual English classifier, and classifying all the English test documents. UpperBound is present only in the JRC-Acquis experiments since in RCV1/RCV2 the English versions of non-English training examples are not available.

Results of many-shot CLTC experiments
In this section we report the results that we have obtained in our many-shot CLTC experiments on the RCV1/RCV2 and JRC-Acquis datasets. 11 These experiments are run in "everybody-helpseverybody" mode, i.e., all training data, from all languages, contribute to the classification of all unlabelled data, from all languages. We will use the notation -X to denote a gFun instantiation that uses only one VGF, namely the Posteriors VGF; gFun-X is thus equivalent to the original Fun architecture, but with the addition of the normalisation steps discussed in Section 3.6. Analogously, -M will denote the use of the MUSEs VGF (Section 3.2), -W the use of the WCEs VGF (Section 3.3), and -B the use of the BERT VGF (Section 3.4). Tables 3 and 4 report the results obtained on RCV1/RCV2 and JRC-Acquis, respectively. We denote different setups of gFun by indicating after the hyphen the VGFs that the variant uses. For each dataset we report the results for 7 different baselines and 9 different configurations of gFun, as well as for two distinct evaluation metrics ( 1 and ) aggregated across the |Y| different classes by both micro-and macro-averaging.
The results are grouped in four batches of methods. The first one contains all baseline methods. The remaining batches present results obtained using a selection of meaningful combinations of VGFs: the 2nd batch reports the results obtained by gFun when equipped with one single VGF, the 3rd batch reports ablation results, i.e., results obtained by removing one VGF from a setting containing all VGFs, while in the last batch we report the results obtained by jointly using all the VGFs discussed.
Something that jumps to the eye is that gFun-X yields better results than Fun, which is different from it only for the the normalisation steps of Section 3.6. This is a clear indication that these normalisation steps are indeed beneficial.
Combinations relying on WCEs seem to perform comparably better in the JRC-Acquis dataset, and worse in RCV1/RCV2. This can be ascribed to the fact that the amount of information brought about by word-class correlations is higher in the case of JRC-Acquis (since this dataset contains no fewer than 300 classes) than in RCV1/RCV2 (which only contains 73 classes). Notwithstanding this, the WCEs VGF seems to be the weakest among the VGFs that we have tested. Conversely, the strongest VGF seems to be the one based on mBERT, though it is also clear from the results that other VGFs contribute to further improve the performance of gFun; in particular, the combination gFun-XMB stands as the top performer overall, since it is always either the best performing model or a model no different from the best performer in a statistically significant sense.
Upon closer examination of Tables 3 and 4, the 2nd, 3rd, and 4th batches help us in highlighting the contribution of each signal (i.e., information brought about by the VGFs).
Let us start from the 4th batch, where we report the results obtained by the configuration of gFun that exploits all of the available signals (gFun-XWMB). In RCV1/RCV2 such a configuration yields superior results to the single-VGF settings (note that even though results for gFun-B (.608) are higher than those for gFun-XWMB (.596), this difference is not statistically significant, with a -value of .680, according to the two-tailed t-test that we have run). Such a result indicates that there is indeed a synergy among the heterogeneous representations.
In the 3rd batch, we investigate whether all of the signals are mutually beneficial or if there is some redundancy among them. We remove from the "full stack" (gFun-XWMB) one VGF at a time. The removal of the BERT VGF has the worst impact on 1 . This was expected since, in the single-VGF experiments, gFun-B was the top-performing setup. Analogously, by removing representations generated by the Posteriors VGF or those generated by the MUSEs VGF, we have a smaller decrease in 1 results. On the contrary, ditching WCEs results in a higher 1 score (our top-scoring configuration); the difference between gFun-XWMB and gFun-XMB is not statically significant in RCV1/RCV2 (with a -value between 0.001 and 0.05), but it is significant in JRC-Acquis. This is an interesting fact: despite the fact that in the single-VGF setting the WCEs VGF is the worst-performing, we were not expecting its removal to be beneficial. Such a behaviour suggests that the WCEs are not well-aligned with the other representations, resulting in worse performance across all the four metrics. This is also evident if we look at results reported in [47]. If we remove from gFun-XMW (.558) the Posteriors VGF, thus obtaining gFun-MW, we obtain a 1 score of .536; by removing the MUSEs VGF, thus obtaining gFun-XW, we lower the 1 to .523; instead, by discarding the WCEs VGF, thus obtaining gFun-XM, we increase 1 to .575. This behaviour tells us that the information encoded in the Posteriors and WCEs representations is diverging: in other words, it does not help in building more easily separable document embeddings. Results on JRC-Acquis are along the same line.
In Figure 4, we show a more in-depth analysis of the results, in which we compare, for each language, the relative improvements obtained in terms of 1 (the other evaluation measures show similar patterns) by mBERT (the top-performing baseline) and a selection of gFun configurations, with respect to the Naïve solution. These results confirm that the improvements brought about by gFun-X with respect to Fun are consistent across all languages, and not only as an average across them, for both datasets. The only configurations that underperform some monolingual naïve solutions (i.e., that have a negative relative improvement) are gFun-M (for Dutch) and gFun-W (for Dutch and Portuguese) on RCV1/RCV2. These are also the only configurations that sometimes fare worse than the original Fun.
The configurations gFun-B, gFun-XMB, and gFun-XWMB, all perform better than the baseline mBERT on almost all languages and on both datasets (the only exception for this happens for Portuguese when using gFun-XWMB in RCV1/RCV2), with the improvements with respect to mBERT being markedly higher on JRC-Acquis. Again, we note that, despite the clear evidence that the VGF based on mBERT brings to bear the highest improvements overall, all other VGFs do contribute to improving the classification performance; the histograms of Figure 4 now reveal that the contributions are consistent across all languages. For example, gFun-XMB outperforms gFun-B for six out of nine languages in RCV1/RCV2, and for all eleven languages in JRC-Acquis.
As a final remark, we should note that the document representations generated by the different VGFs are certainly not entirely independent (although their degree of mutual dependence would be hard to measure precisely), since they are all based on the distributional hypothesis, i.e., on the notion that systematic co-occurrence (of words and other words, of words and classes, of classes and other classes, etc.) is an evidence of correlation. However, in data science, mutual independence is not a necessary condition for usefulness; we all know this, e.g., from the fact that the "bag of words" model of representing text works well despite the fact that it makes use of thousands of features that are not independent of each other. Our results show that, in the best-performing setups of gFun, several such VGFs coexist despite the fact that they are probably not mutually independent, which seems to indicate that the lack of independence of these VGFs is not an obstacle.

Results of zero-shot CLTC experiments
Fun was not originally designed for dealing with zero-shot scenarios since, in the absence of training documents for a given language, the corresponding first-tier language-dependent classifier cannot be trained. Nevertheless, Esuli et al. [20] managed to perform zero-shot cross-lingual experiments by plugging in an auxiliary classifier trained on MUSEs representations that is invoked for any target language for which training data are not available, provided that this language is among the 30 languages covered by MUSEs.
Instead, gFun caters for zero-shot cross-lingual classification natively, provided that at least one among the VGFs it uses is able to generate representations for the target language with no training data (for the VGFs described in this paper, this is the case of the MUSEs VGF and mBERT VGF for all the languages they cover). To see why, assume the gFun-XWMB instance of gFun using the averaging procedure for aggregation (Section 3.5). Assume that there are training documents for English, and that there are no training data for Danish. We train the system in the usual way (Section 2). For a Danish test document, the MUSEs VGF 12 and the mBERT VGF contribute to its representation, since Danish is one of the languages covered by MUSEs and mBERT. The aggregation function averages across all four VGFs (-XWMB) for English test documents, while it only averages across two VGFs (-MB) for Danish test documents. Note that the meta-classifier does not perceive differences between English test documents and Danish test documents since, in both cases, the representations it receives from the first tier come down to averages of calibrated (and normalized) posterior probabilities. Therefore, any language for which there are no training examples can be dealt with by our instantiation of gFun provided that this language is catered for by MUSEs and/or mBERT.
To obtain results directly comparable with the zero-shot setup employed by Esuli et al. [20], we reproduce their experimental setup. Thus, we run experiments in which we start with one single source language (i.e., a language endowed with its own training data), and we add new source languages iteratively, one at a time (in alphabetical order), until all languages for the given dataset are covered. At each iteration, we train gFun on the available source languages, and test on all the target languages. At the -th iteration we thus have source languages and |L| target (test) languages, among which languages have their own training examples and the other (|L| − ) languages do not. For this experiment we choose the configuration involving all the VGFs (gFun-XWMB).
The results are reported in Figure 5 and Figure 6, where we compare the results obtained by Fun and gFun-XWMB on both datasets, for all our evaluation measures. Results are presented in a grid of three columns, in which the first one corresponds to the results of Fun as reported in [20], the second one corresponds to the results obtained by gFun-XWMB, and the third one corresponds to the difference between the two, in terms of absolute improvement of gFun-XWMB w.r.t. Fun. The results are arranged in four rows, one for each evaluation measure. Performance scores are displayed through heat-maps, in which columns represent target languages, and rows represent training iterations (with incrementally added source languages). Colour coding helps interpret and compare the results: we use red for indicating low values of accuracy and green for indicating high values of accuracy (according to the evaluation measure used) for the first and second columns; the third column (absolute improvement) uses a different colour map, ranging from dark blue (low improvement) to light green (high improvement). The tone intensities of the Fun and gFun colour maps for the different evaluation measures are independent of each other, so that the darkest red (resp., the lightest green) always indicates the worst (resp., the best) result obtained by any of the two systems for the specific evaluation measure.
Note that the lower triangular matrix within each heat map reports results for standard (manyshot) cross-lingual experiments, while all entries above the main diagonal report results for zeroshot cross-lingual experiments. As was to be expected, results for many-shots experiments tend to display higher figures (i.e., greener cells), while results for zero-shot experiments generally display lower figures (i.e., redder cells). These figures clearly show the superiority of gFun over Fun, and especially so for the zero-shot setting, for which the magnitude of improvement is decidedly higher. The absolute improvement ranges from 18% of to 28% of on RCV1/RCV2, and from 35% of 1 to 44% of in the case of JRC-Acquis. In both datasets, the addition of new languages to the training set tends to help gFun improve the classification of test documents also for other languages for which a training set was already available anyway. This is witnessed by the fact that the green tonality of the columns in the lower triangular matrix becomes gradually darker; for example, in JRC-Acquis, the classification of test documents in Danish evolves stepwise from = 0.52 (when the training set consists only of Danish documents) to = 0.62 (when all languages are present in the training set). 13 A direct comparison between the old and new variants of funnelling is conveniently summarized in Figure 7, where we display average values of accuracy (in terms of our four evaluation measures) obtained by each method across all experiments of the same type, i.e., standard cross-lingual (CLTC -values from the lower diagonal matrices of Figures 5 and 6) or zero-shot cross-lingual (ZSCLCvalues from the upper diagonal matrices), as a function of the number of training languages, for both datasets. These histograms reveal that gFun improves over Fun in the zero-shot experiments. Interestingly enough, the addition of languages to the training set seems to have a positive impact in gFun, both for zero-shot and cross-lingual experiments. 13 That the addition of new languages to the training set helps improve the classification of test documents for other languages for which a training set was already available, is true also in Fun. However, this does not emerge from Figure 5 and Figure 6 (which are taken from [20]). This has already been noticed by Esuli et al. [20], who argue that this happens only in the zero-shot version of Fun, and is due to the zero-shot classifier's failure to deliver well calibrated probabilities.

Testing different aggregation policies
In this brief section we summarize the results of preliminary, extensive experiments in which we had compared the performance of different aggregation policies (concatenation vs. averaging);
we here report only the results for the gFun-XM and gFun-XMW models (the complete set of experiments is described in [47]). Table 5 reports the results we obtained for RCV1/RCV2 and JRC-Acquis, respectively. The results conclusively indicate that the averaging aggregation policy yields either the best results, or results that are not different (in a statistically significant sense) from the best ones, in all cases. This, along with other motivations discussed in Section 3.5 (scalability, and the fact that it enables zero-shot classification) makes us lean towards adopting averaging as the default aggregation policy.
Incidentally, Table 5 also seems to indicate that WCEs work better in JRC-Acquis than in RCV1/RCV2. This is likely due to the fact that, as observed in [44], the benefit brought about by WCEs tends to be more substantial when the number of classes is higher, since a higher number of classes means that WCEs have a higher dimensionality, and that they thus bring more information to the process.

Learning-Curve Experiments
In this section we report the results obtained in additional experiments aiming to quantify the impact on accuracy of variable amounts of target-language training documents. Given the supplementary nature of these experiments, we limit them to the RCV1/RCV2 dataset. Furthermore, for computational reasons we carry out these experiments only on a subset of the original languages (namely, English, German, French, and Italian). In Figure 8 we report the results, in terms of 1 , obtained on RCV1/RCV2. For each of the 4 languages we work on, we assess the performance of gFun-XMB by varying the amount of target-language training documents; we carry out experiments with 0%, 10%, 20%, 30%, 50%, and 100% of the training documents. For example, the experiments on French (Figure 8, bottom left) are run by testing on 100% of the French test data a classifier trained with 100% of the English, German, and Italian training data and with variable proportions of the French training data. We compare the results with those obtained (using the same experimental setup) by the Naïve approach (see Section 1 and 4.1) and by Fun [20].
It is immediate to note from the plots that the two baseline systems have a very low performance when there are few target-language training examples, but this is not true for gFun-WMB, which has a very respectable performance even with 0% target-language training examples; indeed, gFun-WMB is able to almost bridge the gap between the zero-shot and many-shot settings, i.e., for gFun-WMB the difference between the 1 values obtained with 0% or 100% target-language training examples is moderate. On the contrary, for the two baseline systems considered, the inclusion of additional target-language training examples results in a substantial increase in performance; however, both baselines substantially underperform gFun-WMB, for any percentage of targetlanguage training examples, and for each of the 4 target languages. examples (i.e., for 0%, 10%, 20%, 30%, 50%, 100%) for four languages (i.e., German, English, French, and Italian). The configuration of gFun deployed is gFun-XMB. We compare the performance of gFun-XMB with that displayed by FUN [20] and by the Naïve approach.

LEARNING ALTERNATIVE COMPOSITION FUNCTIONS: THE RECURRENT VGF
The embeddings-based VGFs that we have described in Sections 3.2 and 3.3 implement a simple dot product as a means for deriving document embeddings from the word embeddings and the TFIDF-weighted document vector. However, while such an approach is known to produce document representations that perform reasonably well on short texts [14], there is also evidence that more powerful models are needed for learning more complex "composition functions" for texts [12,58]. In NLP and related disciplines, composition functions are defined as functions that take as input the constituents of a sentence (sometimes already converted into distributed dense representations), and output a single vectorial representation capturing the overall semantics of the given sentence. In this section, we explore alternatives to the dot product for the VGFs based on MUSEs and WCE.
For this experiment, for generating document embeddings we rely on recurrent neural networks (RNNs). In particular, we adopt the gated recurrent unit (GRU) [10], a lightweight variant of the long-short term memory (LSTM) unit [26], as our recurrent cell. GRUs have fewer parameters than LSTMs and do not learn a separate output function (such as the output gate in LSTMs), and are thus more efficient during training. (In preliminary experiments we have carried out, we have found no significant differences in performance between GRU and LSTM; the former is much faster to train, though.) This gives rise to what we call the Recurrent VGF.
In the Recurrent VGF we thus infer the composition function at VGF fitting time. During the training phase, we train an RNN to generate good document representations from a set of languagealigned word representations consisting of the concatenation of WCEs and MUSEs. This VGF is trained in an end-to-end fashion. The output representations of the training documents generated by the GRU are projected onto a |Y|-dimensional space of label predictions; the network is trained by minimising the binary cross-entropy loss between the predictions and the true labels. We explore different variants depending on how the parameters of the embedding layer are initialized (see below). We do not freeze the parameters of the embedding layers, so as to allow the optimisation procedure to fine-tune the embeddings. We use the Adam optimizer [32] with initial learning rate set at 1e-3 and no weight decay. We halve the learning rate every 25 epochs by means of StepLR (gamma = 0.5, step size = 25). We set the training batch size to 256 and compute the maximum length of the documents dynamically at each batch by taking their average length. Documents exceeding the computed length are truncated, whereas shorter ones are padded. Finally, we train the model for a maximum of 250 epochs, with an early-stopping criterion that terminates the training after 25 epochs with no improvement on the validation 1 .
There is only one Recurrent VGF in the entire gFun architecture, which processes all documents, independently of the language they belong to. Once trained, the last linear layer is discarded. All training documents are then passed through the GRU and converted into document embeddings, which are eventually used to train a calibrated classifier which returns posterior probabilities for each class in the codeframe.

Experiments
We perform many-shot CLTC experiments using the Recurrent VGF trained on MUSEs only (denoted -R M ), or trained on the concatenation of MUSEs and WCEs (denoted -R MW ). We do not explore the case in which the GRU is trained exclusively on WCEs since, as explained in [44], WCEs are meant to be concatenated to general-purpose word embeddings. Similarly, we avoid exploring combinations of VGFs based on redundant sources of information, e.g., we do not attempt to combine the MUSEs VGFs with the Recurrent VGF, since this latter already makes use of MUSEs.
Tables 6 and 7 report on the experiments we have carried out using the Recurrent VGF, in terms of all our evaluation measures, for RCV1/RCV2 and JRC-Acquis, respectively. These results indicate that the Recurrent VGF under-performs the dot product criterion (this can be easily seen by comparing each result with its counterpart in Tables 3 and 4). A possible reason for this might be the fact that the amount of training documents available in our experimental setting is insufficient for learning a meaningful composition function. A further possible reason might be the fact that, in classification by topic, the mere presence or absence of certain predictive words captures most of the information useful for determining the correct class labels, while the information conveyed by word order is less useful, or too difficult to capture. In future work it might thus be interesting to test the Recurrent VGF on tasks other than classification by topic.
Another aspect that jumps to the eye is that the relative improvements brought about by the addition of WCEs tend to be larger in JRC-Acquis than in RCV1/RCV2 (in which the presence of WCEs is sometimes detrimental). This is likely due to the fact that JRC-Acquis has more classes, something that ends up enriching the representations of WCEs. Somehow surprisingly, though, the best configuration is one not equipped with WCEs (and this happens also for JRC-Acquis).  Table 7. As Table 6, but using JRC-Acquis instead of RCV1/RCV2. This might be due to a redundancy of the information captured by WCEs with respect to the information already captured in the other views. In the future, it might be interesting to devise ways for distilling the novel information that a VGF could contribute to the already existing views, and discarding the rest during the aggregation phase.

RELATED WORK
The first published paper on CLTC is [6]; in this work, as well as in [22], the task is tackled by means of a bag-of-words representation approach, whereby the texts are represented as standard vectors of length |V |, with V being the union of the vocabularies of the different languages. Transfer is thus achieved only thanks to features shared across languages, such as proper names. Years later, the field started to focus on methods originating from distributional semantic models (DSMs) [34,52,54]. These models are based on the so-called "distributional hypothesis", which states that similarity in meaning results in similarity of linguistic distribution [25]. Originally, these models [18,41] made use of latent semantic analysis (LSA) [15], which factors a term cooccurrence matrix by means of low-rank approximation techniques such as SVD, resulting in a matrix of principal components, where each dimension is linearly independent of the others. The first examples of cross-lingual representations were proposed during the '90s. Many of these early works relied on abstract linguistic labels, such as those from discourse representation theory (DRT) [30], instead of on purely lexical features [2,53]. Early approaches were based on the construction of high-dimensional context-counting vectors where each dimension represented the degree of co-occurrence of the word with a specific word in one of the languages of interest. However, these original implementations of DSMs required to explicitly compute the term co-occurrence matrix, making these approaches unfeasible for large amounts of data.
A seminal work is that of Mikolov et al. [39], who first noticed that continuous word embedding spaces exhibit similar topologies across different languages, and proposed to exploit this similarity by learning a linear mapping from a source to a target embedding space, exploiting a parallel vocabulary for providing anchor points for learning the mapping. This has spawned several studies on cross-lingual word embeddings [4,21,67]; however, all these methods relied on external manually generated resources (e.g., multilingual seed dictionaries, parallel corpora, etc.). This is a severe limitation, since the quality of the resulting word embeddings (and the very possibility to generate them) relies on the availability, and the quality, of these external resources [35].
Machine Translation (MT) represents a natural direct solution to CLTC tasks. Unfortunately, when it comes to low-resource languages, MT systems may be either not available or not sufficiently effective. Nevertheless, the MT-based approach will presumably become more and more viable as the field of MT progresses: recently, Isbister et al. [28] have shown evidence that relying on MT in order to translate documents from low-resource languages to higher-resource languages (e.g., English) for which state-of-the-art models are available, is indeed preferable to multilingual solutions.
Pre-trained word-embeddings [7,40,48] have been a major breakthrough for NLP and have become a key component of most natural language understanding architectures. As of today, many methods developed for CLTC rely on pre-trained cross-lingual word embeddings [5,11,39,56] (for a more in-depth review on the subject see [51]). These embeddings strive to map representations from one language to the other via different techniques (e.g., Procrustes alignment), thus representing different languages in different, but aligned, vector spaces. For example, [8,68] exploit aligned word embeddings in order to successfully transfer knowledge from one language to another. The approach proposed in [8] is a hybrid parameter-based / feature-based method to CLTC, in which a set of convolutional neural networks is trained on both source and target texts, encoded via aligned word representations (namely, MUSEs [11]) while sharing kernel parameters to better identify the common features across different languages. Furthermore, the authors insert in the loss function a regularisation term based on maximum mean discrepancy [23] in order to encourage representations that are domain-invariant.
Standard word embeddings have recently been called static (or global) representations. This is because they do not take into account the context of usage of a word, thus allowing only a single context-independent representation for each word; in other words, the different meanings of polysemous words are collapsed into a single representation. By contrast, contextual word embeddings [17,37,38,49] associate each word occurrence with a representation that is a function of the entire sequence in which the word appears. Before processing each word with the "contextualising" function, tokens are mapped to a primary static word representation by means of a language model, typically implemented by a transformer architecture previously trained on large quantities of textual data. This has yielded a shift in the way we operate with embedded representations, from a setting in which pre-trained embeddings were used to initialize the embedding layer of a deep architecture that is later fully trained, to another in which the representation of words, phrases, and documents, is carried out by the transformer; what is left for training entails nothing more than learning a prediction layer, and possibly fine-tuning the transformer for the task at hand.
Such a paradigm shift has fuelled the appearance of models developed (or adapted) to deal with multilingual scenarios. Current multilingual models are large architectures directly trained on several languages at once, i.e., are models in which multilingualism is imposed by constraining all languages to share the same model parameters [17,19,33]. Given their extensive multilingual pre-training, such models are almost ubiquitous components of CLTC solutions.
For example, Zhang et al. [68] rely on pre-trained multilingual BERT in order to extract word representations aligned between the source and the target language. In a multitask-learning fashion, two identical-output (linear) classifier sare set up: the first is optimized on the source language via cross-entropy loss, while the second (i.e., the auxiliary classifier) is instead set to maximize the margin disparity discrepancy [70]. This is achieved by driving the auxiliary classifier to maximally differ (in terms of predictions) from the main classifier when applied to the target language, while returning similar predictions on the source language.
Guo et al. [24] tackle mono-lingual TC by exploiting multilingual data. They do so by using a contrastive learning loss as applied to Chinese BERT, a pre-trained (monolingual) language model. Then a unified model, which is composed of two trainable pooling layers and two auto-encoders, is trained on the union of the training data coming from both the source and the target languages. It is important to note that such a parameter-based approach requires parallel training data in order to successfully train the auto-encoders (i.e., so that they are able to create representations shared between the source and the target languages).
Karamanolakis et al. [31] propose a parameter-based approach. They first train a classifier on the source language, and then leverage the learned parameters of a set of "seed" words to initialize the target language model (where refers to the number of words that can be translated to the target language by a translation oracle). Subsequently, this model is used as a teacher, in knowledgedistillation fashion, to train a student classifier which is able to generalize beyond the words transferred from the source classifier to the target classifier.
Wang et al. [65] leverage graph convolutional networks (GCNs) to integrate heterogeneous information within the task. They create a graph with the help of external resources such as a machine translation oracle and a POS-tagger. In the constructed graph, documents and words are treated as nodes, and edges are defined according to different relations, such as part-of-speech roles, semantic similarity, and document translations. Documents and words are connected by their co-occurrences, and the edges are labelled with their respective POSs. Document-document edges are also defined according to document-document similarity, as well as between translation equivalents. Once the heterogeneous cross-lingual graph is constructed, GCNs are applied in order to calculate higher-order representations of nodes with aggregated information. Finally, a linear transformation is applied to the document components in order to compute the prediction scores.
van der Heijden et al. [60] demonstrates the effectiveness of meta-learning approaches to crosslingual text classification. Their goal is to create models that can adapt to new domains rapidly from few training examples. They propose a modification to MAML (Model-Agnostic Meta-Learning) called ProtoMAMLn. MAML is a meta-learning approach that optimises the base learner on the so-called "query set" (i.e., in-domain samples) after it has been updated on the so-called "support set" (that is, out-of-domain samples). ProtoMAMLn is an adaptation of ProtoMAML, where prototypes (computed by "Prototypical Network" [57]) are also L2-normalized.
Unlike our system, all the previously discussed approaches are designed to deal with a single source language only. Nevertheless, as we have already specified in Section 1, a solution designed to natively deal with multiple sources would be helpful. A similar idea is presented in [9], where the authors propose a method that relies on an initial multilingual representation of the document constituents. The model focuses on learning, on the one hand, a private (invariant) representation via an adversarial network, and on the other one, a common (language-specific) representation via a mixture-of-experts model. We do not include the system of [9] as a baseline in our experiments since it was designed to dealing with single-label problems.

CONCLUSIONS
We have presented Generalized Funnelling (gFun), a novel hierarchical learning ensemble method for heterogeneous transfer learning, and we have applied it to the task of cross-lingual text classification. gFun is an extension of Funnelling (Fun), an ensemble method where 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and where the final classification decision is taken by a meta-classifier that uses this vector as its input, and that can thus exploit class-class correlations. gFun extends Fun by allowing 1st-tier components to be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language-agnostic representation ("view") of the document. In the instance of gFun that we have described here, for each document the meta-classifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations of the document that embody other types of correlations, such as word-class correlations (as encoded by "word-class embeddings"), word-word correlations (as encoded by "multilingual unsupervised or supervised embeddings"), and correlations between contextualized words (as encoded by multilingual BERT). In experiments carried out on two large, standard datasets for multilingual multilabel text classification, we have shown that this instance of gFun substantially improves over Fun, and over other strong baselines such as multilingual BERT itself. An additional advantage of gFun is that it is much better suited to zero-shot classification than Fun, since in the absence of training examples for a given language, views of the test document different from the one generated by a trained classifier can be brought to bear.
Aside from its very good classification performance, gFun has the advantage of having a "plugand-play" character, since it allows arbitrary types of view-generating functions to be plugged into the architecture. A common characteristic in recent CLTC solutions is to leverage some kind of available, pre-trained cross-or multilingual resource; nevertheless, to the best of our knowledge, a solution trying to capitalise on multiple different (i.e., heterogeneous) resources has not yet been proposed. Furthermore, most approaches aim at improving the performance on the target language by exploiting a single source language (i.e., they are single-source approaches). In this, gFun differs from the discussed solutions since (i) it fully capitalises on multiple, heterogeneous available resources, (ii) while capable in principle to deal with single-source settings, it is especially designed to be deployed in multi-source settings and (iii) it is an "everybody-helps-everybody" solution, meaning that each language-specific training set contributes to the classification of all the documents, irrespectively of their language, and that all the languages benefit from the inclusion of other languages in the training phase (in other words, all the languages play both the role of source and target at the same time).
Finally, we note that gFun is a completely general-purpose heterogeneous transfer learning architecture, and its application (once appropriate VGFs are deployed) is not restricted to crosslingual settings, or even to scenarios where text is involved. Indeed, in our future work we plan to test its adequacy to cross-media applications, i.e., situations in which the domains across which knowledge is transferred are represented by different media (say, text and images). the H2020 Programme ICT-48-2020. The authors' opinions do not necessarily reflect those of the European Commission.