CONNECTOR, fitting and clustering of longitudinal data to reveal a new risk stratification system

The transition from the evaluation of a single time point to the examination of the entire dynamic evolution of a system is possible only in the presence of the proper framework. The strong variability of dynamic evolution makes the definition of an explanatory procedure for data fitting and data clustering challenging. Here we present CONNECTOR, a data-driven framework able to analyze and inspect longitudinal data in a straightforward and revealing way. When used to analyze tumor growth kinetics over time in 1599 patient-derived xenograft (PDX) growth curves from ovarian and colorectal cancers, CONNECTOR allowed the aggregation of time-series data through an unsupervised approach in informative clusters. Through the lens of a new perspective of mechanism interpretation, CONNECTOR shed light onto novel model aggregations and identified unanticipated molecular associations with response to clinically approved therapies.


| INTRODUCTION
In the biological and medical elds, longitudinal data are valuable to explore the evolution of a given event and are expected to have higher predictive power than crosssectional studies, in which variables are collected at one time point only across a sample population.The investigation of the evolution of a system delivers useful insights into (i) how the measurements change over time within samples; (ii) the time span of relevant events; and (iii) how time evolution is associated with clinical surveillance.
Longitudinal data come in many forms.However, their main characteristic is that they consist of portions of functions or curves, with quantities observed as they evolve through time.In the modeling of temporal data, the state of the art of mathematical methodologies can be classi ed into two types of approaches: statistical models, in which no biological mechanisms are specied, and mechanistic models, in which all relations are speci ed (Kendall et al., 1999).The statistical approach makes few assumptions (hence, it tends to be biologically naïve) and provides no information about the underlying mechanisms of the system under study.Conversely, being based on a theoretical framework imposed by the modelers, mechanistic models are more suited to yield biologically-aware knowledge.However, due to the complexity of biological systems, it is di cult for mechanistic models to include the intricate relationships that connect all the factors underlying temporal dynamics.
In this context, we propose a work ow based on methods for Functional Data Analysis (Ferraty and Vieu, 2006;Ramsay and Silverman, 2005).The fundamental aims of FDA are those of classical statistics for simple points in a general but nite dimension.However, the classical methods developed for nite dimension and independent observations cannot be directly applied to in nite dimensional data such as functions or curves.Indeed, being functional data, the sampled variables are strongly correlated and the problem becomes ill-conditioned in the context of multivariate linear models.By analyzing data that vary over time, FDA statistics provide the analytical ground to interpret longitudinal data.
Using FDA methods, here we present CONNECTOR, a data-driven framework to analyze and inspect longitudinal data with the aim to o er a new perspective of mechanism interpretation.CONNECTOR provides several graphical visualizations, supporting users alongside all the analytical steps and required parameters optimizations.Moreover, we distribute a Docker image to guarantee full reproducibility of the presented analyses and ease of use.
To illustrate the e ectiveness of this computational analysis work ow, we have leveraged cancer growth data.The collection of cancer growth data increased constantly in the last years.These data, retrieved from di erent types of biological material, i.e. cancer cell lines (Sharma et al., 2010), patient-derived xenografts (PDXs) and organoids (Hidalgo et al., 2014;Rizzo et al., 2021) are used as pre-clinical models to investigate the mechanisms underlying cancer progression and to identify e ective treatments for speci c patients' subsets.
In the case of PDXs, a common approach for monitoring tumor growth kinetics consists of evaluating average percentages of tumor volume variations between two time points of interest, hence relying on repetition of the measurements that are presumed independent.The analysis of variance (ANOVA) or t-tests for such values are computed in order to evaluate the punctual time e ect among all study arms.In some cases, a categorical system in which the average percentages are classi ed into speci c clinical categories is implemented.In (Oberg et al., 2021) a linear mixed e ects regression model (with eventually quadratic or cubic terms added) was used to t and compare the tumor area, after natural logarithm transform.A broad class of mechanistic models can be used for tting (Kareva I, 2018) data.However, there is no global consensus or biological/clinical evidence on the optimal model that is expected to best t the growth data (Sarapata and De Pillis, 2014;Benzekry et al., 2014).
All previous approaches revolved around a knowledge-based reduction of the systems convolution.CONNECTOR is presented here as a data-driven

| Overview of the CONNECTOR framework
CONNECTOR is a tool for the unsupervised analysis of longitudinal data, that is, it can process any sample consisting of measurements collected sequentially over time.CONNECTOR is built on the model-based approach for clustering functional data presented in (James and Sugar, 2003), which is particularly e ective when observations are sparse and irregularly spaced, as growth curves usually are.
Hereafter an overview of the software is illustrated (Figure 1), while the full description is reported in the Materials and Methods section.Any collection of observations recorded at several di erent time points, de- The framework pipeline of the CONNECTOR package.The four main stages of the data processing are illustrated together with the supporting plots returned by CONNECTOR.The input data are the sampled curves, which can be also associated to annotation features.Data are pre-processed and curves are plotted as lines with dots corresponding to the sampled time points.The heatmap of the full time grid is also provided.The model selection is supported with the cross validated loglikelihoods and the knots positions for the choice of the dimension of the spline basis, and with the functional Davies and Bouldin (fDB, see eq. ( 5)) violin plots, the total tightness (see eq. ( 4)) violin plots and the stability matrices for the choice of the number of clusters.The output of the process is illustrated with the plots of the clustered curves.
scribing a system evolution, is accepted as a longitudinal data set.These data, together with the description of each sample through a set of relevant features, are the input data sets of CONNECTOR.
The CONNECTOR framework is based on the following steps: • The pre-processing step consists of a visualization of the longitudinal data by a line plot and a time grid heat map helping in the inspection of the sparsity of the time points.
• Then, by means of the FDA analysis, the sampled curves are processed by CONNECTOR with a functional clustering algorithm based on a mixed e ect model.The curves are modeled using a nite set of functions (natural cubic splines) with random e ects term on the coe cients.The model is tted and the unknown cluster membership is estimated.This step requires a model selection phase, in which several measures are computed that help the user to properly set the two free parameters of the model.The dimension of the spline basis vector is the rst free parameter to be set.Two plots, the cross-log-likelihood plot and the knots distribution plot, are generated for this task.The rst is generated by the computation of the cross-log-likelihood exploiting the ten-fold crossvalidation method (James et al., 2000); the largest stable value of the mean cross-log-likelihood function is attained at the optimal dimension of the spline ba- The second free parameter is the number of clusters.
The total tightness and the functional Davies-Bouldin (fDB) index are returned jointly with the stability matrices to support the parameter setting.The violin plots of the cluster dispersion (i.e.tightness) and of the cluster separation (i.e.fDB index) assist the user in the selection of the number of clusters.For the sake of completeness, for di erent numbers of clusters, the stability matrices are also provided to verify consistency of the samples cluster assignments across di erent execution runs.
• Once the model selection phase is completed, the output of CONNECTOR is composed of several graphical visualizations to easily mine the results, see bottom panel in Figure 1.The dynamics data are plot in clusters, and for each cluster the cluster mean curve is also reported.The curves can be colored using the features associated with each sample as reported in the Sample Data le.The discriminant plot o ers a visualization of the sample separation in the CONNECTOR clusters, projected on a plane.Moreover, the discriminant function plot shows the discriminant power of each time point.Finally, the estimated curve, the con dence intervals and the observations are reported for each sample.

| Application of CONNECTOR to PDX models of ovarian cancer
CONNECTOR is extremely exible in analyzing longitudinal data and aims to tackle a key di culty in the visualization of high-dimensional clustering data.To show the basic use of CONNECTOR, we analyzed the spontaneous tumor growth patterns of PDX lines derived from the propagation of one chemotherapy-naïve high-grade serous epithelial ovarian cancer (HGS-EOC), which had been passaged for ve generations until production of 21 xenografts (Erriquez et al., 2016).
The progression curves of the PDX lines are reported in Figure 2A.The time grid suggests that until 70 days the frequency of the observations is larger than 0.75 and it drops for larger times, so we truncated the curves at day 70.The model selection suggests that the optimal parameters set is dimension of the spline basis equals to 3 and number of clusters equals to 4 (CTGC-A, CTGC-B, CTGC-C, CTGC-D).This combination of the parameters led to an average value of the cluster stability equal to 0.8.The detailed CONNECTOR analysis is reported in Figure S1.ering this setting the cluster stability is 0.9.The detailed CONNECTOR analysis is reported in Figure S2.In this cohort of PDXs, the response to drugs of individual models was uneven; some models showed signi cant reduction, while other models proved to be resistant, Figure 2B.Notably, the discriminant function reveals that the time with the highest discriminatory power corresponds to day 35.Indeed, around that time, the curves start to be separated as a consequence of the chemotherapy treatments e ects.In detail, in CTGC-A, CTGC-B and CTGC-F, the tumors remained stable during the treatment time window, while in CTGC-C and CTGC-D a clear increment, followed by a reduction of the tumor mass, was appreciable.This variable distribution of responses can be explained, again, with a high degree of intra-tumor heterogeneity in the original tumor from which the di erent PDX lines were propagated; indeed, genetic deviation is expected to in uence response to therapy (Schmitt et al., 2016).

| Application of CONNECTOR to PDX models of metastatic colorectal cancer
To show the full potential of CONNECTOR in a complex use-case, we analyzed a large dataset of tumor growth curves of patient-derived xenografts (PDXs) from mice treated with cetuximab, an anti-EGFR antibody that is approved for clinical use in patients with RAS/RAF wildtype metastatic colorectal cancer (mCRC).Results from these xenotrials were obtained from a continuously expanding collection of ~400 mCRC PDXs, part of which had been used in previous studies (Bertotti et al., 2011(Bertotti et al., , 2015;;Zanella et al., 2015;Isella et al., 2017;Lupo et al., 2020).tumor volumes were measured weekly after tumor implantation and over the course of treatment.To obtain comparable data on tumor regression or growth under treatment, factoring away di erent engraftment e ciencies, therapy was started when the average tumor volume in the cohort expanded from an individual PDX reached 400 mm 3 and was administered for three weeks.
In previous projects we categorized PDX response to cetuximab based on parameters that are loosely inspired to the RECIST clinical criteria (Ko et al., 2021;Bertotti et al., 2011).This classi cation includes three classes, based on the average tumor volume variation at endpoint compared with tumor volume at baseline (the day before treatment initiation): partial responses (PR) are de ned as tumors that regress by 50% or more during treatment; progressive diseases (PD) are de ned as tumors that increase their volume by 35% or more, despite treatment; tumor volume changes above PR or below PD thresholds are de ned as stable diseases (SD).To allow a direct comparison of CONNECTOR's clustering results with our historical annotation, we decided to analyze, for each tumor, the available measurements between the day before treatment initiation and the following 3 weeks.The selected dataset was extracted from the laboratory LIMS (Baralis et al., 2012)  We studied how the PDX growth curves are distributed intra-parameter setting, among the number of CTGCs selected for each run, and inter-parameter setting, exploring how the curves move among the CTGCs along the three runs (Figure 3A).We observed that by increasing the number of CTGCs, the curves that were separated were mainly those characterized by a sudden volume increase.Based on such analysis, the dataset was optimally described by three primary classes, namely CTGC-A, -B and -C, leading to an average stability matrix of 1.We then assessed whether a higher resolution could be achieved by segregating the larger CTGCs into subclasses.We thus processed again, as independent datasets, the CTGCs composed by more than 200 curves (namely CTGC-A and -B, see Figure S3B  By calculating the Shannon index representing the consistency of the curves obtained from the same model, we observed that in most cases there is high agreement among the CTGCs assigned to di erent curves of the same parental tumor (mean = 1.54, s.d.= 0.52).For the sake of clarity, an additional graphical visualization, based on the t-distributed stochastic neighbour embedding (t-SNE) is proposed.The plot is included in Figure 3C, where distinctly isolated CTGCs can be appreciated, in good agreement with the CONNECTOR CTGCs.The analysis of CTGC associations with the transcriptional subtypes identi ed by a PDX-based, cancer cellintrinsic (CRIS) classi er (Isella et al., 2017)  of keratin pearls in one of them.Indeed, after unblinding, three of the selected keratin-high tumors were found to belong to the latter group, con rming that CTGC-B gene expression traits were likely associated with the acquisition of histological traits typical of squamous epithelia.

| CONNECTOR clusters reveal a subgroup of non-responder tumors that express high levels of strati ed epithelium keratins
DEG analyses comparing keratin-high and keratinlow tumors con rmed the GO functional enrichments previously observed in CTGC-B tumors (see Table S3 and Figure S6).Furthermore, novel associations to terms related to cell motility, wound healing and angiogenesis emerged.This may indicate that the keratinhigh subset is endowed with more aggressive and invasive properties with respect to keratin-low tumors.
Interestingly, we found a robust enrichment for CRIS-B ( sher p value 2.09 ⇥ 10 6 ) among keratin-high tumors.CRIS-B was previously reported as an aggressive transcriptional subtype associated with poor prognosis (Isella et al., 2017) and unde ned etiology.Our data may indicate that the CRIS-B phenotype is sustained by an aberrant di erentiation pattern towards epithelial cornication.In agreement, some of the keratins typical of CTGC-B (KRT17, KRT6A, KRT80) are also marker genes for CRIS-B, further reinforcing the link between the two classes.
The keratinocytic transdi erentiation of CTGC-B tumors may be driven by activation of a stemness-related pathway sustained by the transcription factor HOPX, which is indeed upregulated in our keratin-high subgroup (Log fold change 3.1, adjusted p value 4.23 ⇥ 10 6 , Figure 4C).HOPX is a known regulator of di erentiation lineages (Liu and Zhang, 2020) in various tissues, and sustains the reservoir of intestinal stem cells that can originate all intestinal epithelial compartments (Takeda et al., 2011).Although the role of HOPX in CRC is controversial (Yamashita et al., 2013;Dmitrieva-Posocco et al., 2022), the assumption that its activation could contribute to the progression of a subset of aggressive tumors is in agreement with the observation that high expression of HOPX is signi cantly associated with bad prognosis in CRC (overall survival Log Rank p value 8.6 ⇥ 10 3 in the TCGA colorectal cohort, Figure4D and (Liu and Zhang, 2020)).
The connection between CRC corni cation and HOPX activity is supported by the role of HOPX in the epidermis, where HOPX-positive precursors give rise to a set of keratin 6-positive inner bulge cells of the hair follicle (Takeda et al., 2013).This may explain why HOPX activation, besides sustaining the progression of aggressive and invasive tumors, also induces an aberrant transdi erentiation towards keratinized lining epithelium, which ultimately associates with the tumor growth pattern typical of class B CTGC.

| DISCUSSION
The main goal of this study was to develop a tool for exploring longitudinal data with an unsupervised approach.
Hence, we worked on a tool for aggregating curves into classes that could suggest directions and hypotheses for the in-depth examination of di erent datasets.
We reviewed the current literature on FDA and chose to build CONNECTOR based on the functional clustering method proposed in (James and Sugar, 2003).Indeed, depending on the sampling strategy (high frequency or sparse, regular or irregular), di erent methods have been proved e cient to the clustering task.
We were interested in curves that are observed at  et al., 2016;Schwarz et al., 2015;Castellarin et al., 2013), was reported as a poor-prognosis tumor subgroup, composed of highly invasive tumors that are resistant to currently available therapeutic options (Isella et al., 2017).
Through CONNECTOR, we identi ed a previously unnoticed characteristic of such tumors, which mayat least partially -explain their aggressiveness and, at the same time, their keratinized phenotype.A literature search put forward the transcription factor HOPX as a potential regulator of the transdi erentiation process experienced by the keratin-high subgroup of cetuximabresistant tumors (Takeda et al., 2013).HOPX is involved in the modulation of stem cell renewal in both the intestine and the epidermis (Mariotto et al., 2016).In the epidermis, HOPX is responsible for the production of keratin 6-positive cells, which contribute to the post-injury regeneration of corni ed epithelia by entering proliferation at the wound edge (Moll et al., 2008).
HOPX proved to be strongly upregulated in keratinhigh, CTGC-B CRC tumors, and we provide evidence that HOPX expression correlates with poor prognosis in CRC patients.On this ground, we speculate that the same program that regulates the regeneration of keratinized epithelia may be aberrantly activated in CRC by HOPX overexpression, thus committing cancer cells towards an aggressive phenotype with traits of epithelial corni cation.In this regard, it is worth noting that high expression of keratin 80, one of the markers of the keratin-high subgroup, has been associated with poor prognosis in CRC and with CRIS-B speci c phenotypic traits, such as epithelial to mesenchymal transition and cell invasion (Li et al., 2018).This nding may have substantial implications.Indeed, if functionally validated, the notion that HOPX is a driver for the maintenance of keratin-high tumors with keratinocytic metaplastic differentiation may pave the way for the development of new therapeutic approaches aimed at interfering with HOPX function.We also note that keratinization did not stand out as a predominant characteristic of CRIS-B in the original study (Isella et al., 2017), possibly because the phenotype is speci cally associated with the subpopulation of CRIS-B tumors that are resistant to anti-EGFR therapy.This makes CONNECTOR a versatile tool for discerning relevant trends from longitudinal data, which are harder to detect using more classical sources of molecular and phenotypic annotation.
Preclinical cohort studies involving the collection of longitudinal high-dimensional data are being increasingly conducted to evaluate drug activities and to explore disease evolution, and results from such e orts show potential to be translated into experimental trials and clinical practice.Disease monitoring over time is also an already well-established method that is routinely incorporated in several clinical trials.The CONNECTOR framework was designed and implemented as a userfriendly tool to streamline gathering this kind of information, while providing hints to increase interpretability and molecular accuracy.

| Functional Data Analysis
Functional Clustering Model.The functional clustering method implemented in CONNECTOR is based on the functional clustering model presented in (James and Sugar, 2003).Let us denote as g i (t ) the curve of the i th selected individual.In practice, we observe g i (t ) with measurement errors and only at few discrete time points.Let Y i be the vector of observed values of g i (t ) Then we have where g i and " i are the vector of true values and measurement errors at time grid, respectively.As there are only nite number of observations, individual curves are modeled using basis functions, in particular cubic splines.Let where s(t ) is a p dimensional spline basis vector and ⌘ i is a vector of spline coe cients.The ⌘ i 's are treated with a random-e ects model.In particular, the spline coe cients are modeled using a Gaussian distribution, where z i denotes the unknown cluster membership.Let G denote the true number of clusters.Cluster means are furthermore rewritten as where 0 and ↵ k are p and h dimensional vectors, ⇤ is a (p, h) matrix and h  min(p, G 1).Thus, h represents the dimension of the mean space, allowing a further lower-dimensional representation of the curves with means in a restricted subspace (for h < G 1).
With this formulation, the functional clustering model can be written as where S i = (s(t i 1 ), . . ., s(t in i )) T is the spline basis matrix for the i th curve.There are many possible forms for R and .For now, we use R = 2 I and a common for all clusters, as we are interested in sparse datasets, for which the smallest number of parameters is advisable.
The model is tted following (James and Sugar, 2003) and all the estimated parameters and the predicted cluster membership are returned.
Model Selection.Two free parameters have to be properly chosen before tting: the dimension of the spline basis p and the number of clusters to t G .
The dimension of the spline basis can be chosen as the one corresponding to the largest cross-validated likelihood, as proposed in (James et al., 2000).CONNEC-TOR uses a ten-fold crossvalidation, which involves splitting data into 10 roughly equal-sized parts, tting the model to 9 parts and calculating the loglikelihood on the excluded part.Repeat 10 times and combine loglikelihoods.Notice that, the resulting plot of the mean tested likelihoods versus the dimension of the basis, should be treated as a guide rather than an absolute rule, keeping in mind that working with sparse data pushes to spare parameters.Moreover, as the position of the knots depends on their number, CONNECTOR returns a plot with this information as the parameter p varies.
The number of clusters G must be chosen.CONNEC-TOR provides two di erent plots to properly guide in one of the most di cult problems in cluster analyzis.As in the nite-dimensional case, where data are points in place of curves, we need some proximity measures to validate and compare results of a clustering procedure.
We chose to follow (Ferraty and Vieu, 2006) and rely on the parameterized family of semi-metrics between curves de ned as where f and g are two curves and f (q ) and g (q ) are their q th derivatives.Note that for q = 0, eq. ( 3) is the distance induced by the classical L 2 norm.It turns out that D q can be reliably calculated in our setting where we are interested in proximity measures between each curve in a cluster and the centre-curve of the cluster (tightness of the cluster), as well as proximity measures between each centre-curve and the centre-curves of di erent clusters (separateness of clusters).Hence we may have f being the estimated i th curve and g being the estimated mean curve of cluster k , or f and g being both mean curves.In any case, D q can be calculated taking advantage of the spline representation of the estimated curves and mean curves, see eq. ( 1).Indeed, the computation of successive derivatives, which is numerically sensitive, can be performed by di erentiating their analytic form.It remains to compute the integral which can be done by numerical method.As the proper calculation of D q is a basic condition for the tools shown below, all details are illustrated in the Supporting Information S1.
We de ne a rst quantity to infer the appropriateness of the data partition, which we call total tightness.

It is the dispersion measure given by
where ĝi is the estimated i -th curve given in eq. ( S2) and ḡ k is the center of k -th cluster given in eq.(S1), see Supporting Information S2.As the number of clusters increases, the total tightness decreases to zero, the value which is attained when the number of tted clusters equals the number of sampled curves.In this limiting case, any k -th cluster mean curve coincides with an estimated curve and D 0 ( ĝi , ḡ k ) = 0 for any i and k .A proper number of clusters can be inferred as large enough to let the total tightness drop down to relatively little values but as the smallest over which the total tightness does not decrease substantially.Hence, we look for the location of an "elbow" in the plot of the total tightness against the number of clusters.
We de ne a second index, which is a cluster separation measure.Following (Davies and Bouldin, 1979), we extend the well known Davies-Bouldin (DB) index to the functional setting.Let us call the new index functional DB (fDB).It is de ned as follows where, for each cluster k and k 0 with G k the number of curves in the k -th cluster.The signi cance of eq. ( 5) remains unchanged with respect to the nite-dimensional case.It is the average of the similarity measures of each cluster with its most similar cluster.The "best" choice of clusters, then, will be that which minimizes this average similarity.It should be noted that M k 0 k is the distance between centroids The observation of the fDB violin plots, the total tightness violin plots together with the stability matrix enables the user to properly set the free parameter G , which represents the number of clusters to be tted.
Notice that the functional clustering method allows for a lower dimensional representation of the curves through the parameter h.This reduction is often needed as data could be not enough to estimate the large number of parameter of the functional clustering model.
CONNECTOR optimizes the choice of the parameter h returning the largest value for which the estimation of the parameters is successful with a reasonable, set by the user, frequency.Hence the value of h is not chosen directly, but returned by CONNECTOR.
Functional Clustering Tools.In (James and Sugar, 2003) three important tools to analyze the clustering are presented: the discriminant plot, the discriminant function and the estimation of the entire curve for each single subject.
The discriminant plot is a low dimensional plot of the curves dataset.It helps to visualize the clusters, as each curve is projected in a low dimensional space so that it can be plotted as points.In particular, each curve is represented by its projection onto the h-dimensional space spanned by the means µ k .
The discriminant functions are plots of the weights ⇤ T S T ⌃ 1 , versus time, to apply to each dimension for determining cluster membership.The term S ⇤ is a measure of average separation between clusters and ⌃ is a measure of their variability.There will be h discriminant functions and each curve shows the times with higher discriminatory power, which are the times corresponding to largest absolute (positive or negative) values on the y-axis.
The functional clustering procedure predict unobserved portions of the true curves for each subject.The estimated curves are returned by CONNECTOR and plotted with con dence intervals as well.In Supporting Information S3 is reported an exhaustive review on the performance of di erent procedures with respect to the results obtained by CONNECTOR.

| The CONNECTOR framework
The CONNECTOR software is composed of two main modules which cover the whole analysis , consisting of an R library, called CONNECTOR Package, and a docker image.Docker containerization is utilized to simplify the distribution, use and maintenance of the analysis tools; while the R library provides an easier user interface for which no knowledge of the docker commands is needed.
The framework pipeline is de ned by three necessary steps: 1. the data importing and processing to create the R object exploited through the entire analysis, 2. the model selection by identifying the optimal values of the two free parameters of the FCM approach (the basis spline dimension p and the number of clusters G ), 3. the inspection and visualization of the obtained clusters.
In this contest, the CONNECTOR Package provides the basic functions to deal with each step of the pipeline.Finally, a step-by-step guide for installing and utilizing the framework is reported at https://qbioturin.github.io/connector/.
To make accessible the basic CONNECTOR func-

| Transcriptional analyses
Di erentially expressed genes (DEGs) between di erent CTGC or keratin-high and -low groups were obtained using the R package DESeq2 (v1.26.0) (Love et al., 2014) with design: batch+cluster.Here batch is used to correct for sequencing batches and cluster to select samples belonging to the CTGC of interest (or keratinhigh and -low).We did not compare CTGC clusters for which less than 5 samples were available.Genes with more than 5 reads in only 1 sample were removed before testing for di erential expression, DEGs were identi ed using |LFC| >=0.5849625 and adjusted p-value < 0.05.Gene Ontology analyses for the upregulated or downregulated genes were performed with the R library ClusterPro ler (v3.14.3) (Wu et al., 2021;Yu et al., 2012).
The selection of a robust set of markers for the epithelial strati cation di erentiation phenotype was performed with two criteria: • being assigned to the GO keratinization (GO:0031424 (ebi, 2022)), the most signi cantly enriched term across all the CTGC-Ac vs other CTGC comparisons, via direct experimental evidence, using lters for taxon (Homo Sapiens) and evidence (manual assertion) on the EBI Quick GO web site.

AND
• being signi cantly upregulated either in Bb versus Ac or Bc versus Ac The resulting genes are: KRT80, KRT7, KRT86, KRT81, KRT83, KRT6B, KRT6A, KRT74, KRT79, KRT16.To clas-sify samples in high or low expressing keratin groups, quartiles of the expression level of the ten selected genes were calculated from progressive disease (PD) samples belonging to CTGC-Ac.Each sample was then assigned ten labels, according to its expression of those keratins with respect to CTGC-Ac rst and third quartiles -"high" if the expression is larger than the 3rd quartile and "low" if it is smaller than the 1st.Samples were then classi ed as "keratin-high" if they were assigned no low labels, and "keratin-low" if they were assigned no high labels.
Z transformed expression values in COAD/READ patients for HOPX were downloaded alongside clinical annotations from CBioPortal (Firehose legacy dataset), patients with multiple expression values or missing expression/survival data were ltered out.Survival analysis was performed on the remaining 371 patients using the libraries survival and survminer (Therneau, 2022;Kassambara, 2021), the optimal threshold for HOPX expression was determined using the maximally selected rank statistic approach (maxstat, (Hothorn, 2017)), correcting accordingly the resulting log rank p value.DEG and all the enrichment and survival analyses were performed with R version 3.6.3.

| Morphological analyses
Haematoxilin and Eosin was performed in xenografts from mice treated with vehicle (until tumors reached an average volume of 1500mm 3 ).Tumors were explanted, routinely processed and stained with H&E (Bio-optica).
Images were captured with the Leica LAS EZ software using a Leica DM LB microscope.

Naive Bayesian Classi cation
The naive Bayesian classi cation is a probabilistic classier based on the Bayes' theorem.It can be e ciently exploited to assign a cluster membership to each parental tumor (from now on referred as model) given the cluster membership of every curve of the PDXs derived from it.
Let (k i 1 , . . ., k i n i ) be the clusters assigned to the n i PDX curves belonging to the i th model.Thus, the classi er assigns the cluster ki 2 {1, . . ., G } to the i th model as: where is the frequency of having n k i curves of the i th model in the k th cluster.

Shannon Index
The Shannon index (Shannon, 1948) is an information statistic index that we used for quantifying the curve diversity in a speci c PDX model.It is computed as follows: de-convolution methodology able to t and cluster temporal data with an accuracy that highlights di erences in the dynamics.The CONNECTOR clusters re ected molecular features rooted in the biology of the systems studied.To the best of our knowledge, no other available tools support users without advanced expertise in the statistical analysis of complex temporal data in such an informative and interpretative manner as our proposed software.CONNECTOR was assessed using tumor growth data from PDX models and resulted in the generation of the CONNECTOR Tumor Growth Classes (CTGC).We rst analyzed PDX lines from 36 models generated from chemotherapy-naive high-grade serous epithelial ovarian cancer.CONNECTOR separated the PDX models based on the extent of intra-tumor heterogeneity, which was re ected in the variability of response of individual PDXs to three treatment regimens.Then, we deployed CONNECTOR to inspect the evolution of more than 1500 growth curves of PDXs from metastatic colorectal cancer treated with the clinically approved anti-EGFR monoclonal antibody cetuximab.The clusters generated by CONNECTOR allowed us to identify a subset of cetuximab-resistant tumors associated with previously unrecognized molecular and phenotypic features.
sis.The second plot visualizes the spline basis knots position for di erent values of the dimension.The knots divide the time domain into contiguous intervals, and the curves are tted with separate polynomials in each interval.Hence, this plot allows the user to visualize whether the knots properly split the time domain considering the distribution of the sampled observations.
Figure S1. Figure 2A shows the four CTGCs in which it and comprises measurements from 1563 individual mice, collected from 2012 to 2020 and representing 173 original engraftments from parental tumors in patients.The PDX curves were analyzed by CONNECTOR to achieve a widespread overview of the distribution of the individual PDX models after drug administration.When running CONNECTOR, based on the results from the model selection phase, the optimal value of the base spline was 4. To perform a through analysis we evaluated fDB and tightness for a broad range of clusters numbers, starting with three, that corresponds to the number of clinical response classes.We chose to compare the results obtained with 3, 4 and 5 clusters, as the fDB index worsen for larger values, see Figure S3A.
and Figure S3C, respectively).Through this, we obtained a nal number of 9 CTGCs -where CTGC-A and B are further split in Aa, Ab, Ac, Ad, Ae and Ba, Bb, Bc -which are represented in Figure 3B.We assigned each PDX model to a speci c CTGC, by means of a Naïve Bayesian classi cator, see Section 4.
Di erent CTGCs were evaluated with respect to our three-class cetuximab response annotation and the molecular characteristics of the classi ed tumors (Figure 4A).As expected, CTGCs were associated with RECISTlike response classes.Indeed, PR tumors were enriched in CTGC-Aa (Fisher p value 2.4 ⇥ 10 13 ) and SD tumors were enriched in CTGC-Ab (Fisher p value 8.0 ⇥ 10 8 ).

F
I G U R E 3 CONNECTOR results.A) CONNECTOR tumor growth classes for three di erent choices of the number of clusters: three, four and ve.The circos plots show the repositioning of the curves as the number of CTGCs changes.B) CONNECTOR tumor growth classes with twofold clustering.The nine boxes result from a rst run with number of clusters equal 3 followed by second runs on the CTGC-A (with 5 sub-classes) and on the CTGC-B (with 3 sub-classes).C) t-SNE visualization of the CTGCs induced on models.The color of the dots match the color of the assigned CTGC, represented in panel B. The dimension of the dots is inversely proportional to the Shannon Index calculated on the distribution of the curves of the same model across CTGCs (large dots -small entropy).
showed interactions between the growth patterns detected by CONNECTOR and speci c biological traits.In line with the observation that cetuximab-responsive tumors are enriched in the CRIS-C subtype(Isella et al., 2017), CRIS-C tumors were associated with the CTGC-Aa cluster.However, the enrichment of CRIS-C tumors within CTGC-Aa (Odds Ratio CRIS C/Aa 4.63) was stronger than that observed when considering all responders (Odds Ratio CRIS-C/PR 3.38).This suggests that CON-NECTOR is able to recognize the speci c subset of responders sustained by the CRIS-C phenotype.The nding that CONNECTOR clusters accurately captured de ned molecular features prompted us to investigate whether the diversi cation of cetuximabresistant cases in multiple CTGCs could relate to di erent biological substrates of resistance.To test this hypothesis, we rst assessed the distribution of the genetic variants that are known to cause resistance to cetuximab in colorectal tumors(Bertotti et al., 2015).As expected, all such variants were depleted from CTGC-Aa (Fisher p value 1.3 ⇥ 10 2 ) and, albeit to a nonsigni cant extent, from CTGC-Ab (Fisher p value 0.22); conversely, resistance variants were overall enriched in all the CTGC-B clusters and CTGC-C cluster.Importantly, we did not observe an enrichment for speci c resistance mutations in any of the PD-associated CTGCs, suggesting that the tumor growth patterns recognized by CONNECTOR in non-responders were not driven by speci c resistance genotypes.To further explore the functional characteristics of the di erent PD-enriched CTGCs, we mined a set of RNA-seq data that were generated in the context of an independent study performed in the same cohort(Perron et al, in preparation).Analysis of di erentially expressed genes (DEGs) between all PD-enriched CT-GCs and CTGC-Aa, used as a common reference, uncovered a very speci c GO functional enrichment related to strati ed epithelial di erentiation and keratinization, which was guided by keratin-encoding genes signi cantly upregulated in CTGC-Ba and CTGC-Bb, but not in CTGC-Ac (Figure4B, TableS1 and Figure S4E-F).This was not simply driven by the enrichment of non-responder tumors in CTGC-Ba and CTGC-Bb.Indeed, the same functional enrichments for CTGC-Ba and CTGC-Bb tumors versus CTGC-Ac tumors were observed also when limiting the analysis to PD tumors only (see TableS2and FigureS4A-B-C).Hence, CTGC unveiled two distinct phenotypes associated with cetuximab resistance, one of which (shared by CTGC-Ba and CTGC-Bb) is characterized by markers of keratinized strati ed epithelia.To validate this observation, we strati ed the full cohort of cetuximab-resistant PDXs for which RNAseq data were available (n = 140) based on the expression of 10 keratin genes that were robustly upregulated in CTGC-Ba or CTGC-Bb and associated to "keratinization" according to GO annotations (details in Section 4).By applying thresholds derived from expression quartiles in CTGC-Ac non-responsive tumors, we identi ed 42 keratin-high samples and 15 keratin-low samples.H&E sections of 4 keratin-low and 4 keratin-high PDXs were subjected to blinded histopathological evaluation, which con rmed the existence of two clearly distinct subpopulations (see FigureS5).The rst included tumors displaying histological features that resembled the glandular organization typical of the large intestine (presence of adenomorphic structures with evident luminal spaces, mild to moderate tumor cell pleomorphism, columnar cells with nuclear polarization).tumors pertaining to the second group were less di erentiated, with round or columnar pleomorphic cells that formed solid elds and rare luminal spaces.Intriguingly, in three of the latter four tumors we observed a variable degree of squamous di erentiation, with evidence sparse and irregular times, as usually occurs when data collection is laborious and time limits are constrained by ethical protocols.On these premises, and considering the results illustrated in(Jacques and Preda, 2014), we compared the best reported methods on sparse and irregularly sampled curves.The tests indicated that(James and Sugar, 2003) was the best performing method.The method has been studied to solve the main problems that arise when clustering sparse and irregularly sampled functional data: large number of missing observations on the discretised time grid (sparsity) and di erent covariances of the coe cients (curves are measured at di erent time points).Indeed, the method uses basis functions to project each curve onto a nitedimensional space and considers a random-e ect model for the coe cients.This allows to take advantage of all the sampled curves.The model is extremely exible and computes estimates, con dence intervals and prediction intervals for individual curves.The performances of the functional clustering model are strongly dependent on the choice of several free pa-rameters.CONNECTOR includes a toolset for choosing each free parameter appropriately.To this goal two new indexes have been introduced -the functional Davies and Bouldin (fDB) index and total tightness.The two indexes, together with the cross-loglikelihood plot, the visualization of the positions of the time knots and the stability matrix of the nal clustering, provide the user with all the information needed to choose suitable values of the free parameter.Oberg and co-workers(Oberg et al., 2021) have recently proposed a linear mixed e ects regression models for analysis of PDX repeated measures data.The model was able to t di erent treatment designs and randomization schemes.However, their model is limited in the prediction of the trajectories since only monomials up to cubic degree are considered to describe the time dependency, after log transform.Conversely, the core of the CONNECTOR package is a general mixed effect model where both xed-e ects term and randome ects term are considered on a cubic spline basis, which makes the model as exible as needed.We illustrated the versatility of CONNECTOR starting with the analysis of the spontaneous growth patterns in 21 PDX lines propagated from a single HGS-EOC tumor sample.We observed uneven growth rates in PDX lines derived from the same original tumor.This is consistent with the notion that ovarian cancers show a high degree of intratumor heterogeneity (McPherson results in the establishment of genetically di erent tumor entities during serial propagation.This diversi cation can also explain the varied responses to treatment observed in 15 PDX lines -again derived from the same original tumor -exposed to di erent chemotherapeutics.CONNECTOR allowed us to perform an in-depth study of the growth dynamics of a vast cohort of mCRC PDXs treated with cetuximab, with validation and discovery aspects.First, the reliability of the unsupervised and data-driven identi cation of CTGCs is supported by the observation that clusters were properly enriched in the three classes by which cetuximab response was previously annotated.Moreover, our analysis adds new insights into how cetuximab-resistant tumors can be strati ed according to their molecular diversi cation.Speci cally, CTGC-based categorization allowed us to identify a subset of cetuximab-resistant tumors (CTGC-B) with transcriptional and morphological features of metaplastic di erentiation towards corni ed epithelia (keratin-high tumors).CTGC-B tumors were also enriched for the CRIS-B transcriptional subtype, which

(
mean-curves) of k 0 -th and k -th cluster.It serves to weight the sum S k 0 +S k , which is the total standard deviation of the clusters: k 0 -th and k -th clusters dispersion is measured compared to their relative distance.CONNECTOR returns the violin plots of both the total tightness and the fDB index for a given repetition of runs and for di erent choices of the number of clusters G .To support this decision, CONNECTOR returns as well the consensus matrix for the most frequent clustering at each given number of clusters.The plot informs about the stability of the nal clustering across di erent runs.Indeed, each cell of the matrix is colored proportionally to the frequency of the two corresponding curves belonging to the same cluster across di erent runs.Hence the larger the frequencies are (which corresponds to warmer colors of the cells in the plot), the more stable is the nal clustering.Motivated by its meaning, we refer to such matrix as the stability matrix.
tionalities even to users without expertise in R, we developed a web application through the R package Shiny(RStudio, Inc, 2014), a powerful R library for developing interactive web apps straight from R. Speci cally, the function RunConnectorShiny o ers a fancy web application providing a basic-level interface to con gure and execute a data analysis with the CONNECTOR functionalities.Thus, users without experience in R language can directly focus on analyzing the results rather than spend their e orts setting up an R script.Therefore, the web application enables the user to run each function and go through the three steps of analysis with just a few clicks.All these functionalities make the CONNECTOR Package a very exible and easy tool.|Xenograft In Vivo TreatmentsAfter surgical removal from patients, each metastatic colorectal cancer specimen was fragmented and either frozen or prepared for implantation: cut in small pieces of which 2 fragments were implanted 2 mice.After engraftment and tumor mass formation, the tumors were passaged and expanded for 2 generations until production of 2 cohorts, each consisting of 12 mice.Tumor size was evaluated once-weekly by calliper measurements and the approximate volume of the mass was calculated using the formula 4/3⇡ (d /2) 2 D /2, where d is the minor tumor axis and D is the major tumor axis.PDXs derived from each original fragment were then randomized for treatment with placebo (6 mice) or cetuximab (6 mice); animals with established tumors, de ned as an average volume of 400mm 3 , were then treated with cetuximab (Merck, White House Station, NJ) 20 mg/kg/twiceweekly i.p.For assessing PDX models response to therapy, we used averaged volume measurements at 3 weeks after treatment normalized to the tumorgraft volume at the time of cetuximab treatment initiation.Tumors are then classi ed as follows: 1. "partial response" (PR): decrease of at least 50% in tumor volume 2. "progressive disease" (PD): increase of at least a 35% in tumor volume 3. "stable disease" (SD): the ones between 50% decrease and 35% increase All animal procedures were approved by the Ethical Commission of the Institute for Cancer Research and Treatment and by the Italian Ministry of Health.
where H i is the Shannon index for the i th model, G is the number of clusters, p k ,i is the proportion of PDX curves from the i th model and belonging to the k th cluster.Therefore, a larger value of H i corresponds to a larger diversity of the i th PDX model, i.e. many di erent clusters are assigned to the same model.Whether H i is equal to zero means that the curves of the i th PDX model belong to the same cluster.t-distributedstochastic neighbor embedding (t-SNE)t-SNE (Van der Maaten and Hinton, 2008) is a statistical method used to visualize high-dimensional data in a two or three-dimensional map exploiting a nonlinear dimensionality reduction technique.This approach has been applied to give a spatial visualization of the resulting CT-GCs of the PDS models in Figure3C.expressour thanks to Jude Fitzgibbon, Marco Ladetto and Simone Ferrero for the useful discussions, and to Jessica Giordano and Elena Rosso for the critical check of CONNECTOR.

) Di erential expression of genes in the "keratinization" GO in CTGCs: triple
F I G U R E 4 CONNECTOR clusters molecular annotation and transcriptomic analyses.A) Molecular and phenotypic characterization of the CONNECTOR clustered mCRC xenografts: each sample was annotated according to CRIS subtype, response to cetuximab and somatic alteration known to determine cetuximab resistance (KRAS, NRAS and BRAF single nucleotide variations; MET and ERBB2 ampli cation) or sensitivity (EGFR ampli cation).Bvolcano plot showing the magnitude of expression di erential (x axis, Log2 FoldChange) and signi cance (y axis, -Log10 adjusted p value) of "keratinization" genes when comparing CTGCs enriched in PD versus Aa.Only comparisons involving CTGCs enriched in PD with at least 9 total samples are reported for clarity.Bb and, to a smaller extent, Ba have many keratinization genes upregulated, but the same is not true for Ac.The top 5 upregulated genes in Bb are labeled (SPRR is a family of small proline rich proteins induced during di erentiation of keratinocytes).C) Expression levels of HOPX in keratin-high and keratin-low samples: the y axis shows DESeq2 corrected counts in logarithmic scale.D)