Lost in Transduction: Transductive Transfer Learning in Text Classification

Obtaining high-quality labelled data for training a classifier in a new application domain is often costly. Transfer Learning (a.k.a. “Inductive Transfer”) tries to alleviate these costs by transferring, to the “target” domain of interest, knowledge available from a different “source” domain. In transfer learning the lack of labelled information from the target domain is compensated by the availability at training time of a set of unlabelled examples from the target distribution. Transductive Transfer Learning denotes the transfer learning setting in which the only set of target documents that we are interested in classifying is known and available at training time. Although this definition is indeed in line with Vapnik’s original definition of “transduction”, current terminology in the field is confused. In this article, we discuss how the term “transduction” has been misused in the transfer learning literature, and propose a clarification consistent with the original characterization of this term given by Vapnik. We go on to observe that the above terminology misuse has brought about misleading experimental comparisons, with inductive transfer learning methods that have been incorrectly compared with transductive transfer learning methods. We then, give empirical evidence that the difference in performance between the inductive version and the transductive version of a transfer learning method can indeed be statistically significant (i.e., that knowing at training time the only data one needs to classify indeed gives an advantage). Our clarification allows a reassessment of the field, and of the relative merits of the major, state-of-the-art algorithms for transfer learning in text classification.

In machine learning, the term transduction as introduced by Vapnik means "inference from particular to particular" [15], i.e., describes the inference carried out by learning methods that (i) are given access not only to a labelled training set, but also to the only set of unlabelled data we are interested in classifying (in this article we will call the latter the object set), and (ii) do not label the documents in the object set by means of a general-purpose classifier. 3 In other words, transduction is meant to be applied to settings in which we exactly know, before any learning has taken place, that we will not be interested in classifying any unlabelled data other than those belonging to a finite, specific set that is already available to us at training time. 4 Scenarios such as these are common, e.g., in market research [6], in e-discovery [34], or when assisting the production of systematic reviews [25].
The main contributions of this article can be summarized as follows. In Section 2, we propose a clarification of terminology that restores the original sense of the term "transductive inference", as proposed by Vapnik, in the context of transfer learning, while in Section 3, we discuss how the meaning of "transduction" has shifted in recent literature. In Section 4, we then identify cases in which the misuse of terminology has led to confusion and incorrect experimental comparisons. In Section 5, we go on to provide empirical evidence that the differences in performance between the inductive and transductive variants of a given transfer learning method can be statistically significant, which implies that experimental comparisons that confuse the two variants (among which the ones identified in Section 4) are seriously flawed. For doing so, we provide examples of these statistically significant differences, which we obtain by (i) comparing the performance of previously published transfer methods belonging to the inductive group and the transductive group, and (ii) by comparing the performance of two ITL methods (Structural Correspondence Learning (SCL) [7,41] and Distributional Correspondence Indexing (DCI) [32]) with corresponding transductive variants that we have generated. Section 6 presents some concluding remarks.

A TAXONOMY OF LEARNING METHODS
In this section, we formalize the difference between methods for inductive learning (IL), semisupervised learning (SSL), transductive learning (TL), inductive transfer learning (ITL), and transductive transfer learning (TTL).
Let us first define some basic concepts. A domain is a triple D = (X , F , ϕ), where X is a random variable taking values on documents, F is a feature space (e.g., a vector space R m ), and ϕ is the representation function ϕ : X → F , which maps documents into the feature space. Note that the image of ϕ is also a random variable, that we call the domain distribution and denotes as P D .
A sample σ of a domain D is an empirical distribution containing random variates of P D , i.e., a set σ = {x i } n i=1 ⊂ P D of feature vectors drawn from the domain distribution. We will use σ D to indicate that sample σ originates from domain D.
For ease of discussion, in this article, we restrict our attention to binary classification; however, everything we say can straightforwardly be extended to other types of classification, such as single-label multiclass classification, multi-label multiclass classification, and ordinal classification. 3 As Vapnik puts it, "The direct estimation of values of a function only at points of interest using a given set of functions forms a new type of inference which can be called transductive inference. In contrast to the inductive solution that derives results in two steps, from particular to general (the inductive step) and then from general to particular (the deductive step), the transductive solution derives results in one step, directly from particular to particular (the transductive step). " [48, p. 12] 4 Put it another way, should we later become interested in classifying another set of unlabelled data, the learning phase should be carried out anew.
A binary classifier is a function h : F → Y , with Y = {−1, +1} the label space. We use σ L D to denote any labelled sample {(x i , y i )} n i=1 ⊂ P D × Y , where document x i has label y i . The following instantiations of the aforementioned concepts will prove useful in our subsequent definitions: in the rest of the article, S and T will denote the source and target domains, while Tr L S , Tr U S , Tr U T , Ob U S , Ob U T , Te U S , and Te U T , will denote samples, where Tr L S is a labelled training set, Tr U S and Tr U T are unlabelled training sets, Ob U S and Ob U T are unlabelled object sets, and Te U S and Te U T are unlabelled test sets, all drawn from S and T as indicated. As we will see in Definition 2.2, the notion of an "unlabelled training set" is justified, since in SSL also unlabelled data play a role in training a classifier. We recall from Section 1 that an object set is a set of unlabelled documents such that (a) it is available at training time, and (b) the unlabelled data it contains are the only unlabelled data that we are interested in classifying.
Definition 2.1. An inductive learning (IL) method is a method that, given a labelled training set Tr L S , learns a general hypothesis h : P S → Y . The adequacy of h must be measured according to an evaluation function that measures the agreement between the predicted labels h(x i ) and the true labels y i for a test set Te U S of documents. (Te U S is viewed as "unlabelled" because the true labels y i are hidden from h.) The purpose of Te U S is to support this evaluation, which means that Te U S must not be seen at training time (that this practice has not always been adhered to in past transfer learning literature is, as we will see, a central claim of our article). Unlike Ob U S in TL (see below), Te U S is expected to be sufficiently representative of P S , since the goal of the evaluation is to estimate the accuracy of h at classifying any possible unlabelled sample from the domain. Note that the training and test documents are assumed to be drawn iid from the same (and only) domain S. Definition 2.2. An Semi-Supervised Learning (SSL) method is a method that, given a labelled training set Tr L S and an unlabelled training set Tr U S , learns a general hypothesis h : P S → Y . This case is also inductive, with the only difference that the learning device has access not only to labelled data Tr L S but also to unlabelled data Tr U S drawn from the same domain S. Definition 2.3. A Transductive Learning (TL) method is a method that, given a labelled training set Tr L S and an unlabelled object set Ob U S , generates predicted labels h(x i ) for all documents x i in Ob U S directly, i.e., without using a general rule h : P S → Y . Note that in this case, there is no requirement that the method also returns a general rule h : P S → Y , i.e., the method might just learn a function h : Ob U S → Y that takes binary decisions only for the elements of Ob U S . It is important to distinguish a TL method (or algorithm) from a TL problem (or setting): in a nutshell, a problem is characterized by what we have and by what we want to achieve, while a method is characterized by how we achieve it. A TL problem is a situation in which, given a labelled training set Tr L S and an unlabelled object set Ob U S , we need to generate predicted labels h(x i ) for all documents x i in Ob U S . In principle, IL methods are also applicable to TL problems, since Ob U S is just a specific sample from P S ; in other words, we can generate predicted labels h(x i ) for all documents x i in Ob U S indirectly, i.e., by learning a general rule h : P S → Y and using it to generate these predicted labels. Adopting such a solution might be called a "TLP-via-ILM approach", solving a TL problem via an IL method. Similarly, a "TLP-via-SSLM approach" would consist of solving a TL problem via an SSL method, and could be achieved by using an additional unlabelled training set Tr U S (with Tr U S ∩ Ob U S = {}).
While legitimate, these solutions are suboptimal according to what is now known as "Vapnik's principle" [48], which suggests that 5 : If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem.
In other words, Vapnik suggests that the optimal way (i.e., the one conducive to higher accuracy) of solving a TL problem is directly, via a TL method, and not indirectly, via (fully supervised (FS) or semi-supervised (SS)) IL methods.
Definition 2.4. An Inductive Transfer Learning (ITL) method method is a method that, given a labelled training set Tr L S (plus, optionally, an unlabelled training set Tr U S ), and an unlabelled training set Tr U T , from two different but related domains S and T , learns a general hypothesis h : P T → Y .
Note that this approach includes aspects from induction (the requirement that a general hypothesis is generated) and semi-supervision (the optional presence of the unlabelled training set). In this case, the iid assumption no longer holds since S = (X S , F S , ϕ S ) and T = (X T , F T , ϕ T ) are different. This difference might be of type X S X T (with F S = F T ), which is usually known as cross-domain adaptation, or of type F S F T (with X S ∼ X T ) 6 , in which case the problem is typically known as cross-lingual adaptation 7 . Therefore, in both cases ϕ S ϕ T holds.

Definition 2.5. A Transductive Transfer Learning (TTL)
method is a method that, given a labelled training set Tr L S and an unlabelled object set Ob U T (and optionally two unlabelled training setsTr U S andTr U T , withTr U T ∩Ob U T = {}) from two different but related domains S and T , generates predicted labels h(x i ) for all documents x i in Ob U T directly, i.e., without using a general rule h : The main differences of a TTL algorithm with respect to an ITL one, thus lie in the facts that in the former, unlike in the latter, (i) there is an object set Ob U T which is observed at training time, and (ii) we generate no general hypothesis h : P T → Y but only predicted labels h(x i ) for documents x i in Ob U T . 8 The main difference of a TTL algorithm with respect to a TL one is instead that in the former, unlike in the latter, the training set and the object set are not iid, since they originate from two different domains S and T .
Similarly to what we said for TL methods and TL problems, we should distinguish between TTL methods and TTL problems, the latter being the settings in which we need to generate predicted 5 Vapnik's is a common-sense principle, one of the many "laws of parsimony" that guide scientific development. Another instance of Vapnik's principle in machine learning is represented by "quantification" (a.k.a. "supervised prevalence estimation"-see [18]), the task of predicting the distribution across the classes of a set of unlabelled items: while this can be achieved by classifying each item and counting how many items have been assigned to which class, it is more effective (in keeping with Vapnik's principle) to solve this problem directly, without resorting to classification. 6 In set theory, two sets A and B are said to be equivalent, denoted A ∼ B or A ≡ B, if there exists a bijection between the two, i.e., if they have the same cardinality. In cross-lingual adaptation, this comes down to assuming that a one-toone correspondence between the documents in the source language and the documents in the target language is always possible (using, e.g., a translation oracle), since the documents in X S and X T are conceptually equivalent. 7 Other instantiations exist, in which the cross-domain and cross-lingual adaptations are tackled simultaneously; see e.g., [32]. 8 In this respect, it is worth mentioning that Vapnik and his co-authors [15] suggested that transductive inference might still be attained in scenarios in which the iid assumption is relaxed.
An IL / SSL / TL / ITL / TTL problem (or setting) is characterised by the sets indicated in the middle five columns in rows 1 / 2 / 3 / 4 / 5, respectively. An IL / SSL / TL / ITL / TTL method (or algorithm) is characterised by the fact that, in the presence of the sets indicated in the middle five columns, the only output is the one indicated in the last column in rows 1 / 2 / 3 / 4 / 5, respectively. labels h(x i ) for all documents x i in an object set Ob U T , given a labelled training set Tr L S (and optionally two unlabelled training sets Tr U S and Tr U T , with Tr U T ∩ Ob U T = {}). A TTL problem may also be solved via an ITL method (which might be called a "TTLP-via-ITLM approach"), i.e., by labelling the documents x i in Ob U T indirectly by learning a general-purpose rule h : P T → Y , but this would be yet another violation of Vapnik's principle.
The definitions above are concisely summarized in Table 1.
It is possible to characterize the learning methods described above with respect to the stand they take according to three basic dichotomies: Note that in this section, and in the rest of the article, we have assumed that the learning problem is one of classification. However, everything we say in this article straightforwardly apply to other supervised learning tasks, such as regression. 9

THE SHIFTING MEANING OF "TRANSDUCTION"
The definition of "transduction" given in Section 2 is the one given by Vapnik [48, p. 12] (see also Footnote 3), and refers, according to the terminology we have introduced in Section 2, to TL methods. However, in the context of transfer learning, that definition would partially clash with that of Arnold et al. [1], where the term TTL appeared for the first time. In these authors' definition, the term TL encompasses all scenarios where all the data we want to classify are already available at training time, and has nothing to do with the type of method used for classifying these data. In other words, Arnold et al. [1] seem to be thinking of TL problems rather than of TL methods: what in Section 2, we have called a "TTLP-via-ITLM" approach would squarely count, according to [1], as a TL method. Indeed, some among the models that [1] proposed solve a TL problem via an IL method. Several works that followed (e.g., [3,42]) adopted this definition of TTL. To the best of our knowledge, the only works about transfer learning which use the term "transduction" in Vapnik's original sense are [44,45] (although they presents no text classification experiments) and [2,22] (which will be discussed in Section 4.2).
Years after [1] was published, the term "transduction" was used in the widely cited survey by Pan and Yang [36], which has henceforth become a standard reference for transfer learning. 10 However, these authors altered again the meaning of TTL, which they used to describe the more general setting in which "no labelled data in the target domain are available while a lot of labelled data in the source domain are available" [36], thus removing the constraint that all the unlabelled data we are interested in classifying must be available at training time. 11 In their terminology, cross-domain adaptation and cross-lingual adaptation (see Section 2) become two subproblems of TTL, regardless of whether there is or not a set of unlabelled data available at training time that is the only data we are interested in classifying (i.e., an object set). Probably, [5,19] were the first works that adopted this altered definition.
The lack of a clear distinction between induction and transduction, in the terms defined by Vapnik, in the field of transfer learning, makes it difficult for readers to understand whether a transfer learning method as applied to a transductive problem is actually an ITL method (i.e., it labels the items in the object set by using a classifier that can be applied to any future set of unlabelled data) or is instead a TTL method (i.e., it labels the unlabelled data seen at training time directly); we show examples of the two types of methods in Section 4. This aspect is worth taking into account since, despite the fact that a TTL method could well be applied to different unlabelled sets by rerunning the method from scratch every time, this additional cost is avoided in ITL. On the other side, on a TTL problem one should expect better accuracy from a TTL method than from an ITL method, since the former is solving a less general (hence easier) problem than the latter, and thus might be preferred, given that generalization is not needed (see [11] for a broader discussion).
One consequence of the above-mentioned terminological confusion is the existence of "unfair" comparisons in the field, where some TTL methods have been claimed to be superior to some ITL methods by testing them on ITL problems, i.e., in problems in which the methods were not assumed 10 At the time of writing, this article has 11,989 citations on Google Scholar. 11 This reformulation of the problem was deliberate and acknowledged in their survey, and was thus not due to a mistake. In their own words, "Note that the word "transductive" is used with several meanings. In the traditional machine learning setting, transductive learning (...) refers to the situation where all test data are required to be seen at training time, and that the learned model cannot be reused for future data. (Thus, when some new test data arrive, they must be classified together with all existing data. (...) In our categorization of transfer learning, in contrast, we use the term transductive to emphasize the concept that in this type of transfer learning, the tasks must be the same and there must be some unlabelled data available in the target domain. " [36, p. 1352] 13:8 A. Moreo et al. to be learning from the unlabelled documents used for testing the systems, and in which TR methods were not meant to be applied at all. This will be the topic of the next two sections.

INDUCTIVE AND TRANSDUCTIVE TRANSFER PROBLEMS
In this section, we give a general view of previous efforts in the field on the basis of the distinctions discussed before, i.e., we will classify the methods according to whether they have been tested on an inductive setting or on a transductive setting, and according to whether they are actually ITL methods or TTL methods. In doing so, we do not describe each method in detail; we refer the interested reader to [36,37] for a more detailed discussion, or to the original papers.
The goal of this section is not to offer a review of past literature, but rather to show the need for a clear distinction between induction and transduction. On the basis of this, we also identify cases in which the lack of such a clear distinction has led to unfair experimental comparisons and, in turn, to unreliable conclusions on the relative merits of different methods.

Transductive Transfer Problems
In a TTL problem the learner is given access to the unlabelled object set Ob U T right from the beginning. The best-known benchmarks that have been used in order to test solutions to this problem are adaptations of the Reuters-21578, SRAA, and 20Newsgroups datasets (all well-known datasets for TC by topic) proposed by Dai et al. [13,14] for cross-domain adaptation.
The adaptation that [13,14] propose leverages the hierarchical structure of the set of classes that characterise these datasets in order to generate new benchmarks for testing transfer learning systems. This procedure consists of picking two top-level classes, say, A and B, with subclasses A 1 , . . . , A x and B 1 , . . . , B y , respectively, where the task is defined as a binary classification problem in which one needs to discriminate class A from class B. Then, two disjoint "folds" are extracted to form the source data (S) and target data (T ); for instance, A S = α i=1 A i and A T = x i=α +1 A i will represent the source and target parts for class A, while B S = β i=1 B i and B T = y i=β +1 B i will represent the source and target parts for class B, for some 1 < α < x and 1 < β < y. Note that the documents in S and those in T are indeed related (they belong to the same top-level class) but different (they belong to different subclasses of the same top-level class), as requested in transfer learning. The training (source) set and the test (target) set are defined as Tr L S = A S B S and Te U S = A T B T , respectively. Note that what we have described here is a setup for testing ITL methods; if we want to test TTL methods, A T B T must play the role of the object set Ob U T and of the test set Te U T at the same time, i.e., the documents in A T B T are available to the algorithm at training time, and the accuracy of the algorithm is measured in terms of how good it is at labelling them. Note also that there is no other unlabelled set, either from the source domain or from the target domain.
Datasets structured like this were first used by Dai et al. [13,14] to test two different approaches: Co-Clustering based Classification (CoCC) [13], which co-clusters domains and words as a means to propagate the class structure from the source domain to the target domain; and TrAd-aBoost [14], an extension of AdaBoost that implements transfer learning. Since then, many authors have adopted experimental settings with the same structure, in order to test transfer learning systems based on topic models (e.g., Topic-Bridged PLSA (TPLSA) [56], Topic-Bridged LDA (TLDA) [51], and Partially Supervised Cross-Collection LDA (PSCCLDA) [4]), Non-negative Matrix Factorization (NMF) (e.g., MTrick [61]), probabilistic models (e.g., Topic Correlation Analysis (TCA) [26]), and clustering techniques (e.g., Cross-Domain Spectral Classification (CDSC) [29]). However, although these methods have been tested on transductive transfer problems (i.e., by having A T B T play the role of Ob U T and Te U T at the same time), not all of them are transductive TTL methods as defined in Section 2. Indeed, TrAdaBoost [14], TLDA [51], and TCA [26] are ITL methods; i.e., when applied to a transductive problem, a "TTLP-via-ITLM approach" must be followed. When ITL methods are tested on an ITL problem, they are meant to be tested on a test set Te U T different from the unlabelled set Tr U T on which they have been trained, in order to show that they generalize. Analogously, when these methods are tested on a TTL problem, the unlabelled training set Tr U T and the object set Ob U T must be different too. It is one of the central observations of this article that this caveat has not always been adhered to in comparative experimentations, and this has brought about flawed comparative results that are still being relied upon today.

Inductive Transfer Problems
In an ITL problem, the learner has access to the labelled set Tr L S from the source domain and the unlabelled set Tr U T from the target domain (an unlabelled set Tr U S from the source domain might be available as well). There is no object set Ob U T since the goal is to generate (induce) a generalpurpose classifier for the entire target domain. The test set Te U T is thus only meant to be used for evaluation purposes, i.e., for estimating the effectiveness of the classifier in classifying any document from the target domain.
The most popular benchmarks for testing solutions to these problems are Multi-Domain Sentiment (MDS) [7], which was proposed for cross-domain adaptation, and its cross-lingual extension Webis-CLS-10 [40]. Both datasets consist of Amazon product reviews for different product categories, and include 2,000 labelled reviews per product category and a number of unlabelled reviews, ranging from 3,586 (DVD reviews in MDS) to more than 50,000 (in Webis-CLS-10). Neutral reviews have been filtered out, and the task is thus defined as a binary sentiment classification problem (Positive vs. Negative).
This has promoted a (somehow unmotivated) partition of transfer learning methods, according to which most of the methods tested on transductive transfer problems deal with classification by topic, while most of the methods tested on inductive transfer problems deal instead with classification by sentiment. The net result is that inductive transfer problems have received comparatively more attention than their transductive transfer counterparts. In what follows we give a comprehensive overview of the most important methods in the area, and show that some of them are actually transductive transfer methods, something that was not to be expected given the characteristics of the datasets they have been tested on.
Arguably, the most important methods proposed for the inductive transfer problem are SCL [7] for cross-domain adaptation, and its cross-lingual version (CL-SCL) [41]. SCL bridges the gap between the source and target domains by solving intermediate structural problems defined upon the notion of pivot features (frequent and predictive features that behave approximately similarly in both domains). Pivots are typically discovered by inspecting the supervised source set (e.g., by measuring the mutual information between a feature and the class labels); their distributional properties are mined by inspecting the unlabelled source and target training sets Tr U S and Tr U T . Other methods that follow similar principles have been described since then, including further pivot-based approaches like Spectral Feature Alignment (SFA) [35] for cross-domain adaptation, and DCI [32] for cross-domain and cross-lingual adaptation. Other methods that similarly rely on mutual information as a means to quantify semantic correlations among words have been described, as e.g., Sentiment-Sensitive Thesaurus (SST) [10] does in order to expand a sentiment thesaurus.
Although, the concept of "pivot" concerns, strictly speaking, pairs of related words, the same concept is still present behind many NMF techniques, though blurred under the notion of "latent topic". Examples of NMF techniques, include Topical Correspondence Transfer (TCT) [59] for cross-domain adaptation, Semi-Supervised Matrix Completion (SSMC) [53], Two-Step Learning (TSL) [52], and the Subspace Learning Framework (CL-SLF) [58] for cross-lingual adaptation. Very recently, [22] has proposed Transductive Kernel Classifier (TKC), a TR method based on string kernels that was also evaluated on the MDS dataset.
Yet another group of approaches tested on inductive transfer problems has emerged, fostered by the recent upsurge of deep learning. We distinguish between deep architectures and word embeddings-based approaches. The first approach based on deep architectures was Stacked Denoising Autoencoders (SDA) [17], a method that exploited the autoencoding architecture to enforce a consistent representation between source and target in cross-domain adaptation. This was followed by other SDA-based approaches such as Cross-Domain Feature Learning (CDFL) [57], approaches based on adversarial neural networks such as Domain Adversarial Neural Network (DANN) [16], and a transductive variant (TransDANN) [2], Cross-Lingual Distillation with Feature Adaptation (CLDFA) [55], and combinations of adversarial training with attention mechanisms, such as Adversarial Memory Network (AMN) [28] and Hierarchical Attention Transfer Network (HATN) [27]. Finally, methods for learning (monolingual) word embeddings (Sentiment-Sensitive Embeddings (SSE)) [9] for cross-domain adaptation, bilingual word embeddings (Bilingual Model (BM)) [54], bilingual phrase embeddings [39], or for jointly learning bilingual word and document embeddings (Bilingual Document Representation Learning (BiDRL)) [60] for cross-lingual adaptation, have also been proposed.
Some of the aforementioned methods make use of parallel data (generated via automatic translation tools as in SSMC [53], CL-SLF [58], BiDRL [60], CLDFA [55], or inspecting already existing parallel resources as in BM [54]) or counted with a fraction of labelled data from the target domain (as is the case of SSMC [53]). Somehow surprisingly, it turns out that most of these methods are actually of the transductive transfer type (and this is something the reader might not expect, considering the datasets those methods have been tested on, and the baselines they have compared against); concretely, this affects the methods SSMC [53], CL-SLF [58], BiDRL [60], and CLDFA [55]. The reason is that the parallel data the authors considered in their experiments are the translations that Prettenhofer and Stein made available for the non-English test documents in Webis-CLS-10. This means that, even assuming the approaches could have been trained on a different set of parallel documents (and this is something which incidentally remains unclear), the truth is that the results they reported are inevitably optimized for the specific test documents (unfairly taken to be the object set), and can thus not be granted to be representative of the more general inductive transfer problem. TransDANN [2] and TKC [22] also fall in the "transductive group", though in these cases the incursion was deliberated and openly acknowledged.
Methods like SSMC [53], CL-SLF [58], BiDRL [60], and CLDFA [55] thus follow a controversial approach that, in line with the definitions of Section 2, we could call "ITLP-via-TTLM". That is, the authors of these papers have applied a TTL method to a dataset for testing the accuracy of ITL methods by (unfairly) assuming the test set Te U T to be an object set Ob U T . From a methodological point of view, the comparison against ITL methods is unfair since the performance of a TTL method is tailored to (i.e., optimized for) the object set Ob U T , which is assumed to be unavailable for a proper ITL method. From a conceptual point of view, the goals that ITL and TTL methods pursue are not comparable either, since a TTL method does not necessarily learn a general hypothesis, as a true ITL method is instead expected to.

FROM INDUCTION TO TRANSDUCTION: TWO EMPIRICAL CASES
Up to now we have commented on the fundamental differences between ITL methods and TTL methods. In order to quantify the impact of these differences in terms of effectiveness, we generate transductive variants of two representative ITL methods, SCL [7,41] and DCI [32] (Sections 5.1 and 5.2), and we empirically evaluate the difference in performance between the inductive and the transductive versions (Section 5.3). We have chosen SCL and DCI for several reasons. First, SCL and DCI cater for both cross-domain adaptation and cross-lingual adaptation, which allows us to evaluate the impact of the above differences on a variety of transfer learning scenarios. Second, the code implementing SCL and DCI has been made publicly available by their authors, which eases our task. (Implementation details are given in Section 5.3.) Third, SCL and DCI are among the most representative ITL methods in the TC literature.
While the former method relies on the SCL paradigm already discussed in Section 4.2, DCI relies on the "distributional hypothesis" 12 to generate a vector space specifically devised for knowledge transfer. In this vector space, words that play similar roles across domains are close to each other (e.g., word "read" from the book domain is close to word "listen" from the music domain, as both play analogous roles in their respective domains) since word vectors are defined with respect to the pivot words (frequent and highly predictive words that behave similarly across domains; example pivot words are "excellent" or "poor" in any domain having to do with product reviews). Both methods consist of two main phases: representation (Section 5.1) and classification (Section 5.2), which we describe in the next sections.
The transductive variants we generate for the (originally inductive) SCL and DCI methods serve the sole purpose of evaluating whether the differences in performance between inductive and transductive versions is significant or not; these transductive variants are rather obvious, and should not be considered part of our original contribution.

Document Representation
SCL and DCI bridge the gap between the source domain S = (x S , F S , ϕ S ) and target domain T = (x T , F T , ϕ T ), where F S = R m and F T = R n are two vector spaces (into which documents are mapped via, e.g., tf-idf weighting), by working out additional representation functions ϕ S : R m → R k and ϕ T : R n → R k that generate document representations in a shared vector space R k , whose dimensions are the above-mentioned pivot words. Here, m and n are the number of distinct features (i.e., the vocabulary sizes) in the source and target domains, respectively, and k is a user-defined parameter which specifies the number of dimensions of the shared space, i.e., the number of pivot words.
The representation functions are implemented as linear mappings: where Z S and Z T are the projection matrices whose rows are the k-dimensional word profiles (or embeddings). 13 In a domain D, entry Z i j of projection matrix Z quantifies the degree of correlation between the ith word in the original vector space and the jth pivot word. 12 The distributional hypothesis states that words with similar meanings tend to co-occur in the same contexts [20]. 13 Word profiles that SCL and DCI generate are indeed essentially word embeddings (low-dimensional and dense vectorial representations of words). However, they are generated by means of simple operations on the co-occurrence matrices, and are not the products of any neural procedure.
SCL and DCI implement different criteria for computing this correlation. In SCL, the correlations between the words and a given pivot in a domain D are measured by solving a structural (classification) problem in which all words are used as features to predict the presence or absence of the pivot in a sample of documents from the domain distribution P D . The correlation of each word with respect to the pivot is thus taken to be the corresponding coefficient of the hyperplane that defines the separation. The projection matrix Z D ∈ R n,k is defined as the k principal components of a matrix in R n,p containing, as its columns, all p hyperplanes, with p the number of pivots. When the feature spaces F S and F T are not disjoint (that is, when we are not tackling cases of cross-lingual adaptation), SCL replaces the original vector with a concatenation of the vector and the projection [7], i.e., x ← [x; ϕ D (x)]. However, we have obtained much better results by normalising each component before concatenating them. Specifically, we reduce the dimensionality of x from n to k, in order to match that of ϕ D (x), via principal component analysis, and we then L2-normalize each component before concatenating them.
In DCI, the correlation Z i j is defined in terms of "distributional correspondence" between the ith word and the jth pivot, and is computed via a distributional correspondence function 14 (DCF) f using a sample of documents from the domain distribution P D . Each profile dimension is standardized so that the columns of Z have zero mean and unit variance. Note that, differently from SCL, in DCI it holds that k = p, since the dimensionality of the matrix is not reduced. In this work, we adopt cosine as the DCF since it outperformed all other DCFs in the experiments reported in [32,33].
We also use the same pivot selection strategy used in SCL [7,41] and DCI [32] (a strategy that has its roots in the principles exposed in [8]), i.e., we select pivots by first filtering out words that are not frequent enough, and then removing from the remaining words the ones that are not discriminating enough (according to the mutual information between the word and the label, as estimated on the training set Tr L S ). In the cross-lingual case, pivot selection involves a wordtranslation oracle, i.e., a mapping from source words to target words (see [41]).
The projections Z S and Z T are learnt from documents-by-words matrices of tf-idf normalised weights. These matrices should be as large as possible in order to effectively capture the distributional properties of the words. This means that, in scenarios in which the unlabelled sets Tr U S and Tr U T , of sizes q and r , are available, we first represent them as matrices Tr U S ∈ R q,m and Tr U T ∈ R r,n and then compute where ψ is either SCL or DCI, and where ì p is the list of pivot words (properly translated to the target language in cases of cross-lingual adaptation). In transductive settings where unlabelled sets are not available, Z S and Z T are directly modelled on the training samples in Tr L S and in the object samples in Ob U T , of sizes q and r (properly converted into matrices Tr L S ∈ R q ,m and Ob L T ∈ R r ,n ), as

Learning and Classification
In the transductive modality both ϕ S and ϕ T have to be invoked on Tr L S and Ob U T in order to generate (labelled and unlabelled) representations in the shared space before training the transductive classifier. This is required because the transductive classifier directly outputs labels for the elements in Ob U T as part of the learning procedure (the transductive step). In the inductive settings, SCL and DCI first use ϕ S to represent the training documents in Tr L S to train the classifier (the inductive step), while ϕ T is invoked only at testing time in order to classify the documents in Te U T (the deductive step).

Transductive SVMs.
The underlying machine learning algorithm we use for the transductive versions of SCL and DCI 15 is Transductive Support Vector Machines (TSVM) [24] with soft margins, that assign labels for elements in the object set as part of the learning process. TSVMs implement transduction by attempting to maximize the margin of the hyperplane that separates both the training and the unlabelled data (instead of the training data alone, as for inductive SVMs). For the TSVMs we have used the linear kernel, which has consistently delivered good accuracy in TC applications so far [23].
The transductive SVM classification problem is stated as the structural risk minimization problem: Minimize over y * 1 , . . . ,y * k , ì w, b, ξ 1 , . . . , ξ n , ξ * 1 , . . . , ξ * Note that y * j are the predicted labels for the object documents and, though the algorithm actually produces a classifier, defined as h(x) = siдn( ì w · ì x + b), this classifier is not used to (re)classify the object documents. Indeed, there is no guarantee that the label attributed in the transductive step coincides with the label that classifier h would assign, that is, h(x j ) = y * j does not necessarily hold; specifically, it is not true for the documents x j for which ξ * j > 1. The implementation of TSVMs we have used is the one made available by Thorsten Joachims in his SVM liдht package 16 [24].

Experiments
In this section, we desribe the results of experiments that compare the transductive versions of SCL (hereafter: TSCL) and DCI (hereafter: TDCI) against the original inductive ones (hereafter: ISCL and IDCI). The experimental settings we explore account for (i) classification by sentiment and by topic, (ii) inductive settings and transductive settings, and (iii) cross-domain and cross-lingual adaptation. In doing so, we deliberately apply TSCL and TDCI also in environments in which the use of transductive techniques is questionable: the aim of this experimentation is thus that of providing empirical evidence that confounding the inductive and transductive paradigms can indeed bring unfair benefits to transductive approaches in terms of performance against their inductive competitors, and that this improvement is statistically significant. Somehow unconventionally, this experimentation does not aim at setting a new best performance for a given dataset since, as will become clear, some of the current best results from the literature have been obtained, as we argue, unfairly.
The datasets we consider include the following:  sci, rec, and talk). We adopted the setup proposed by [13,14]), in which six datasets are defined by selecting a pair of top categories for each dataset. One top category of the pair acts as the positive category and the other as the negative category (e.g., comp vs. sci and rec vs. talk). The subcategories of a top category are then considered as the different domains on which the transfer learning process is applied (e.g., sci.crypt, sci.med for the top category sci). -MDS 20 : a set of Amazon product reviews for the four domains Books, DVD, Electronics, and Kitchen appliances. The preprocessed version contains bags of uni-and bi-grams, and is labelled according to binary sentiment polarity. There are 2,000 labelled instances for each domain, which are to be split in five folds according to [7] for performance evaluation. This means that each reported accuracy value is an average across five experiments, each of which considers 1600 training examples from the source domain and 400 test examples from the target domain. This is the only dataset in which accuracy scores are computed via k-fold cross-validation. -Webis-CLS-10 21 of positive and negative Amazon product reviews for three domains (Books, DVD, and Music) in four languages (English, German, French, and Japanese). English is always used as the source language, following [40].
Tables 2 and 3 display additional characteristics of the datasets; see also Section 4.2 for further details.
As the evaluation measure we adopt "vanilla" accuracy, i.e.,   [13,14] T r U indicates the sample that, in the experiments, sometimes plays the role of T r U S and sometimes plays the role of T r U T . When the cardinality of a sample is indicated as an interval, this indicates how this cardinality varies across the various tasks (indicated in column "Tasks"). The last column indicates the works where this dataset was first used. Notational conventions are as in Table 2.
where T P, FP, F N , and T N are the numbers of true positives, false positives, false negatives, and true negatives, respectively, as from the standard 2×2 contingency table. Adopting vanilla accuracy (the metric of choice in previous related work) as the evaluation measure is perfectly reasonable since the datasets are balanced. We have implemented TSCL by adapting the publicly available implementation of CL-SCL [41] made available as part of the Natural-language Understanding Toolkit (NUT) package. 22 Apart from bypassing the translation of pivot words when the source and target languages are the same, and apart from implementing the normalised concatenation described in Section 5.1, the main change we have made concerns the replacement of the original learning device in charge of the final predictions with SVM liдht . 23 TSCL is thus obtained by having SVM liдht operate in transductive mode and making the object set (with labels omitted) available at training time. In previous literature, SCL has been tested on Webis-CLS-10 [41] and on MDS [7]. For Webis-CLS-10, we thus adopt the configuration proposed in [41] for this dataset, that uses p = 450 pivots, k = 100 principal components of the shared space, and discards pivot candidates appearing in fewer than ϕ = 30 support documents. We explore this and other configurations that have been proposed in past literature for MDS. In particular, we also test the configuration of this dataset proposed in [7] (that consists of setting p = 100, k = 50, and ϕ = 5), but we also explore other configurations that worked well for DCI (see [33]) and that consider a higher number of pivots (up to p = 1000), and thus a higher dimensionality (up to k = 1000). As done in [41] for Webis-CLS-10, we choose the configuration that works best for the first task of the dataset (Books-DVD, as typically encountered in most papers); we end up using p = 1000, k = 1000, and ϕ = 5. We also report results for ISCL and TSCL on the Reuters-21578, SRAA, and 20Newsgroups datasets, for which, to the best of our knowledge, no published results for SCL existed so far. Similarly, we choose the configuration 22 https://github.com/pprett/nut. 23 Note that the modification we have made to the NUT software only affects the final classification, and not the generation of the vector representations in the shared space. These representations depend on the predictions of a set of classifiers that are tasked to solve the structural problems. The learners we used for solving these intermediate structural problems still rely on the implementation of Prettenhofer and Stein's truncated stochastic gradient descent variant made available at https://github.com/pprett/bolt. that yielded the best result in one of the tasks (we choose comp vs. sci -the first task from the dataset with more tasks), which results in setting p = 1000, k = 100, and ϕ = 5. We do not consider configurations involving p > 450 in Webis-CLS-10 since translating pivots is assumed to incur a cost; p = 450 has been agreed upon in past literature as a reasonable cost-effective tradeoff, and setting k > 100 did not yield any better results.
We have implemented TDCI by adapting the PyDCI [33] package 24 to use SVM liдht as the learning device, in place of the scikit-learn implementation of SVMs (which does not cater for transduction). Those modifications are now integrated as part of the PyDCI package. We set the number of pivots to p = 450 for Webis-CLS-10 following [40], and to p = 1000 for the other datasets as proposed in [33].
Since we have adopted a different learner, the accuracy values we report here do not coincide with those previously reported for SCL in [7,41], nor with those reported for DCI in [33]. Although, no significant variations exist in the latter case, the differences between SCL and ISCL turn out to be more pronounced.
We set the parameters C and C * controlling the tradeoff between training error and margin, to the SVM liдht default values in all cases.
We compare the performance of TDCI with most of the baselines discussed in Section 4. 25 For TrAdaBoost [14] we report results for TrAdaB (that uses SVM as the learner) and TrAdaB(T) (that uses instead TSVM). Note that ISCL acts as an alternative implementation of CL-SCL in Webis-CLS-10 [41], and of SCL in MDS [7]. The accuracy scores for the baseline methods are taken from the original publications. In all cases, we also report results for (i) ISVM, an (inductive) SVM that simply classifies the target documents without carrying out any sort of adaptation; (ii) TSVM, a transductive SVM that trains on the source domain using the target object set as unlabelled examples (again, without any adaptation); and (iii) Upper, a SVM that trains and tests in the target domain; we report the accuracy of a 5-Fold Cross Validation (5FCV) in the object set. In Webis-CLS-10, we also report (iv) CL-MT, an inductive SVM that trains on the source English documents and tests on translations of the non-English target documents (we used the translations made available by [40]). SVM liдht is used to generate the classifier in all these baselines Tables 4-6 report the accuracy scores of the methods discussed across the various datasets. Boldface indicates the best score for each dataset; the accuracy scores of the transductive variants TSCL and TDCI are listed together with the (percentage of) relative accuracy improvement with respect to the inductive ISCL and IDCI counterparts (positive is better). Methods that access (thus, optimize for) the object set are marked with the " " symbol. This symbol is thus used to establish which systems can be legitimately compared with each others. Table 6 is the one mixing more transductive and inductive methods. It looks clear that methods belonging to the transductive group (those marked with the " " symbol) tend to obtain higher scores than methods from the inductive group (the difference in performance is indeed statistically significant according to a two-sided t-test for means of two independent samples at a confidence level of α = 0.05 -the trivial baselines ISVM, TSVM, Upper, and CL-MT were left out of the test for obvious reasons).
Unsurprisingly, the transductive variants of SCL and DCI bring about a considerable gain in most cases (up to a relative improvement of 16.5% of accuracy in Japanese-DVD for SCL and 10.9% in comp vs. sci for DCI). There are a few exceptions though, which in some cases (comp vs. rec and comp vs. talk in 20Newsgroups, and Japanese-Books in Webis-CLS-10) are particularly pronounced.  [35] CoCC [13] TrAdaB [14] TrAdaB(T) [14] TPLSA [56] TLDA [51] CDSC [29] MTrick [61] TCA [26] PSCCLDA [ Symbol " " indicates that the method in the corresponding column has access to the object set. Boldface indicates the highest score for each dataset.  [7] SFA [35] TCT [59] SDA [17] CDFL [57] TrAdaB [21] DANN [16] AMN [28] TKC [22] ISCL IDCI TSCL TDCI Note that in these cases the inductive variant performed very well (actually outperforming all other competitors in the case of IDCI), which may be an indication that transduction might come at a risk (this is indeed confirmed by the relative performance between the ISVM and TSVM baselines in those cases).
The smallest improvements are achieved in the MDS dataset for TDCI. Probably, the reason is that the contribution of the object set is limited since in this case a 5FCV is adopted for evaluation; this means that in each experiment only 400 object documents are observed, while the number of object documents observed during training is comparatively higher in other datasets (see Table 3). TDCI outperforms on average all other competitors in the transductive setting (Table 4) even considering the comp vs. rec and comp vs. talk anomaly described above. A direct comparison between the performance of TDCI and the baselines in the inductive settings (Tables 5 and 6) is to be taken with a grain of salt (that, is indeed a core claim of this article) since the baselines are assumed to be inductive (though we argued in Section 4 that some of them are actually transductive). In particular, SSMC, CL-SLF, BiDRL, CLDFA, and TKC access the test data during training and, not surprisingly, most of them rank on top of the ITL competitors in terms of performance.
Finally, a Wilcoxon signed-rank test reveals the differences in performance between ISCL vs TSCL and between IDCI vs TDCI to be statistically significant at confidence level 0.05 (with pvalues of 2.4E −12 and 5.2E −3 , respectively), and at a much higher confidence level (p-values of 4.0E −13 and 3.5E −5 ) if we discard the anomalous cases.

CONCLUSIONS
Quite obviously, the accuracy of a classifier improves when the learner knows at training time the set of documents the classifier will later be evaluated on. Transductive approaches focus on devising ways of improving the prediction of labels in cases when the specific object sets is available and known in advance. This improvements comes at the cost of sacrificing the generalization ability that inductive approaches show off. Inductive and transductive approaches thus pursue radically different goals, and are thus not interchangeable at will (they are only interchangeable in lab experiments, by wrongly assuming the test set to play the role of an object set). This is a major difference that has largely been overlooked in the transfer learning literature, fostered by a misuse of terminology in the field and leading to unfair comparisons. We have proposed a clarification of terminology, and shown empirical evidences that there was a need for it. To this aim, we have produced a transductive variant of two representative inductive methods, SCL and DCI that we used to deliberately reproduce a wrong experimental practice (imitating past evaluations in the field), in which we compare the performance of TSCL and TDCI to their inductive counterparts in ITL problems. The goal of this evaluation is to show evidence that confounding terminology may lead to unfair comparisons, and that the differences in performance can be statistically significant.