The unidimensionality of the five Brain Injury Rehabilitation Trust Personality Questionnaires (BIRT-PQs) may be improved: preliminary evidence from classical psychometrics

ABSTRACT Objective: To assess the internal construct validity (ICV) of the five Brain Injury Rehabilitation Trust Personality Questionnaires (BIRT-PQ) with Classical Test Theory methods. Methods: Multicenter cross-sectional study involving 11 Italian rehabilitation centers. BIRT-PQs were administered to patients with severe Acquired Brain Injury and their respective caregivers. ICV was assessed by the mean of an internal consistency analysis (ICA) and a Confirmatory Factor Analysis (CFA). Results: Data from 154 patients and their respective caregivers were pooled, giving a total sample of 308 subjects. Despite good overall values (alphas ranging from 0.811 to 0.937), the ICA revealed that several items within each scale did not contribute as expected to the total score. This result was confirmed by the CFA, which showed the misfit of the data to a unidimensional model (RMSEA ranging from 0.077 to 0.097). However, after accounting for local dependency found within the data, fitness to a unidimensional model improved significantly (RMSEA ranging from 0.050 to 0.062). Conclusion: Despite some limitations, our analyses demonstrated the lack of ICV for the BIRT-PQ total scores. It is envisaged that a more comprehensive ICV analysis will be performed with Rasch analysis, aiming to improve both the measurement properties and the administrative burden of each BIRT-PQ.


Background
Neurobehavioral changes, often defined as 'personality changes' (1), are frequent long-term sequelae of traumatic brain injury (TBI), occurring from 30% to 60% of cases (2,3). Frequently reported changes include reduced motivation, emotional regulation difficulties (explosive anger, irritability, aggression, and labile mood), difficulties maintaining social relationships, disinhibition, and impulsivity (4)(5)(6)(7). Some authors have termed these changes as non-cognitive neurobehavioral sequelae (8), or as neurobehavioral disability (5). The latter plays a central role in determining the overall outcome of patients with TBI, even more than cognitive impairments (9)(10)(11)(12). Furthermore, it has been proven that these non-cognitive neurobehavioral impairments are good predictors of caregivers' burden, patients' quality of life in terms of independent living, intimate relationships' development, employment's acquisition or maintenance, independently of the severity of the brain injury (13)(14)(15). Neurobehavioral disability has also been described in nontraumatic acquired brain injuries (nt-ABI). Indeed, several behavioral changes have been described among these latter patients (16), including disorders of social perceptiveness, self-control and/or emotional regulation, decreased ability to learn from social experience, difficulties with impulse control, development of sociopathic or borderline traits, loss of self-sense, childish behavior, disinhibited behavior, and poor social judgment (17). Thus, non-cognitive behavioral impairments may be considered a landmark of ABI regardless of its etiology.
Given the clinical relevance of neurobehavioral disability, an adequate assessment of non-cognitive neurobehavioral impairments plays a central role in rehabilitation programs for patients with ABI, and such aspects always need to be carefully investigated. Nevertheless, despite the recognition of its importance, few valid and reliable measures are available for the assessment of non-cognitive neurobehavioral sequelae (18,19). Indeed, at least three criteria seem to be necessary for the effectiveness of such kinds of instruments as they should: (a) be specific for personality changes following ABI and not created for other types of pathologies, (b) cover a broad domain of functioning, and (c) present a robust internal validity. Several papers underline the limits of the measures commonly used in neurorehabilitation (20).
In a series of studies, Cattran and colleagues (8,21-23) introduced a specific set of scales designed to assess noncognitive neurobehavioral impairments for patients with ABI, recently translated and cross-culturally adapted for the Italian population (24). The five Brain Injury Rehabilitation Trust personality questionnaires (henceforth BIRT-PQs) consist of five separated scales (totaling 150 items) assessing the following areas: motivation, emotional regulation, social cognition, disinhibition, and impulsivity. As each questionnaire is available in a patient's self-administered and caregiver-rated forms, the comparison of these two parallel forms may offer an indirect measure of the problem's awareness of a patient. Selectivity for the ABI population, adequate comprehensiveness of the neuro-behavioral impairments spectrum, and the availability of two different points of view (self and relative) of the same problem are clinical advantages that make the BIRT-PQs attractive for clinicians.
In general terms, the measurement of a latent construct by a scale or questionnaire (i.e. the personality changes as measured by the five BIRT-PQs) is entirely based on the assumption of unidimensionality, by which all items in a scale should contribute to measuring a single underlying construct. Unidimensionality contributes to the so-called internal construct validity (ICV) of a scale (25). Only upon demonstration of adequate ICV, it would be legitimate to sum together the item scores to generate the total score of the scale.
As the ICV of the five BIRT-PQs has not been assessed so far, this study aims to evaluate the internal consistency, the unidimensionality, and the respondent burden of the BIRT-PQs under the Classical Test Theory (CTT) framework.

Study design and setting
This was a multicenter cross-sectional study carried out from April 2016 to December 2017 in eleven Italian rehabilitation centers 1 with expertise in care and management of patients with ABI.

Subjects
Participants were patients and their respective caregivers, who were consecutively enrolled within the participating centers. The following inclusion criteria were employed for enrollment: a diagnosis of severe ABI (s-ABI), characterized at the onset by lack of consciousness, as defined by an initial Glasgow Coma Scale≤8 (26), lasting more than 24 h; age between 18 and 70 years; a Level of Cognitive Functioning (LCF) (27) score≥7 at the time of enrollment; a premorbid Modified Barthel Index (MBI) (28) score equal to 100; attending the outpatient clinic after 6 months but not later than 6 years since inpatient discharge from the same center; availability of a caregiver willing to participate. Exclusion criteria were: aphasia severe enough to impair the ability to read or to understand the spoken language; previous history of any neurological and psychiatric disorders; being domiciled in long-term care, nursing home, or other cared residential facility; inability to give informed written consent.
Local Ethical Committees approved the study of the participating centers, which was carried out under the principles outlined in the Helsinki declaration (29). Participants and their respective caregivers gave their written informed consent to take part in the study.

Outcome measures
The primary outcome measures were the BIRT-PQs (21-23), which include five independent scales assessing several dimensions of personality that might be altered after an ABI. The scales are motivation (BMQ, 34 items), regulation of emotions (BREQ, 32 items), social cognition (BSCQ, 28 items), disinhibition (BDQ, 24 items), and impulsivity (BIQ, 32 items). Two versions of the questionnaire are available: a patient version, where the respondent is asked to answer questions regarding his/her aspects of personality, and a caregiver-rated version, which is identical to the former, but it is self-administered by a caregiver, and the questions regards the aspects of personality of his/her cared one affected by ABI. Each item is scored using a Likert format with 4response options ranging from 'always' to 'never' (item score range: 1 to 4). Most of the items are worded as such that the higher the score, the more frequent is the related aspect of personality disturbance. However, to reduce acquiescence, some items are reversed in meaning and scoring, so that the higher the score, the less frequent is the personality change. The total score of each scale is obtained adding up the scores of the pertaining items, so that the higher the total score, the greater the degree of personality disturbance. As each scale has a different number of item, the total score varies by questionnaire: BMQ ranges from 34 to 136 points, BREQ and BIQ from 32 to 128, BSCQ from 28 to 112, and BDQ from 24 to 96. Within this study, the Italian version of the BIRT-PQs was administered (24).
Also, the following other scales and questionnaires were administered for sample description purposes: • Disability Rating Scale (DRS) (30). This scale tracks the recovery of functioning after severe head trauma from 'coma to community'. It includes physiologic measures from the Glasgow Coma Scale (GCS), measures of cognitive ability to perform self-care activities, measures of dependence on others, and measures of employability. • Satisfaction Profile (SAT-P). It assesses the patient's subjective satisfaction within different aspects of functioning (31), including psychological functioning (10 items), physical functioning (9 items), work (5 items), basic needs and free time (5 items), and social functioning (3 items). Respondents are asked to mark a sign over a 10 cm horizontal line with the extremes semantically defined (left completely unsatisfied, right completely satisfied). As the total score of each sub-scale is computed by calculating the means of each item score, it ranges from 0 to 100 (indicating, respectively, the worst and the best satisfaction). • Frontal Behavioral Inventory (FBI). It quantifies the personality and behavior changes in persons with dementia from the perspective of caregivers (32). Within the questionnaire, both negative (12 items) and positive behaviors (12 items) are assessed. The total scores of both subscales range from 0 to 36, where higher scores are indicative of a higher level of personality and behavioral change. In this study, the Italian version was employed (33). • Caregiver Burden Inventory (CBI). It assesses the burden of care for caregivers of dementia patients (34), by exploring five main dimensions: time-dependence, developmental, physical, social, and emotional burden. Each dimension is assessed by five items, except for the physical burden, which has only four items. Each item is scored between 0 (not at all descriptive) and 4 (very descriptive), where higher scores indicate a more significant caregiver burden. Therefore, all total scores range from 0 to 20, whereas the total score for physical burden ranges from 0 to 16.

Procedures
The demographic and clinical characteristics of the patients and their respective caregivers were collected. All subjects (patients and caregivers) were requested to complete the five BIRT-PQs independently. Also, patients were asked to complete the SAT-P, whereas caregivers were required to complete the FBI and CBI. All administration procedures were shared and uniformed between the participating centers.

Statistical analysis
Descriptive sample and scale statistics Descriptive statistics were used to report all the collected variables. Mean ± standard deviation (SD), median with 10 th and 90 th percentile, and absolute frequency with percentage, were calculated for interval, ordinal, and nominal variables, respectively (35). Ceiling and floor effects were defined as the occurrence of the highest or the lowest possible score for each scale, respectively, for more than 15% of the subjects in the sample (36).

Internal consistency analysis
The internal consistency of each five BIRT-PQs was analyzed on a pooled sample including data from the patients and their respective caregivers. Particularly, it was assessed by calculating: • the Cronbach's alpha (37) for each BIRT-PQs total score, where values ranging between 0.70 and 0.95 are recommended (38); • the average inter-item correlations, expressed by the Spearman's correlation coefficient (39), is the mean of the inter-item correlations between each pair of items; values≥0.2 were considered acceptable (40); • Cronbach's Alpha with deleted variable, where the alpha was calculated after removing each item in turn; values below the total Cronbach's alpha are expected (41); • the item-to-total correlations, which are the nonparametric correlations (based on Spearman's ϱ) between each item and its restscore (i.e. the total score minus the item score); values≥0.40 were considered satisfactory (40).

Analysis of unidimensionality: confirmatory factor analysis
A confirmatory factor analysis (CFA), here based on polychoric correlations to fit ordinal data, was undertaken to assess the unidimensionality of the BIRT-PQs. CFA has been described in detail elsewhere (35,42,43). Within this analysis, the assessment of model fit was performed using the following indicators (44): • The model chi-square (χ 2 ), an overall indicator of model fit, which is a measure of the discrepancy between the covariance matrices of the model and of the sample (42). For a perfectly fitting model, the χ 2 probability values are not significant, although these statistics are sensitive to a larger sample size. • The Root Mean Square Error of Approximation (RMSEA), which is a measure of the discrepancy between the covariance matrix predicted by the model and the population covariance matrix, if it were available. In other words, the RMSEA is an estimate of how well the model fits the observed data. It is considered one of the most informative index of model fit (42) and, unlike the χ 2 , it is less influenced by sample size. Values≤0.06 are considered indicative of a good fit.
which is the average value across all residual values derived from the comparison between the predicted and the observed variance-covariance matrix (42). In practice, SRMR is the amount of error by which the model explains the correlations (42). Values≤0.08 are typical of well-fitting models (45). • The Comparative Fit Index (CFI) and the Tucker-Lewis Index (TLI). They both measure the proportionate improvement in the model fit by comparing the hypothesized model and the null model. However, while the CFI is normed (i.e. its values range between 0 and 1), the TLI is not (i.e. its values can extend beyond the range [0, 1]). Furthermore, the TLI includes a penalty for overly complex models. For both indicators, values≥0.95 are considered indicative of a wellfitting model (45).
The first analytical step involved testing for multigroup invariance by respondents (patients vs. caregivers), following the approach outlined in Byrne (46). Particularly, an initial separate model was established for each group. Should the initial baseline model fail to fit, we would assess the modification indices (MI) on pairs of items (42,47), which are indicators of model misspecification. Particularly, high MIs suggest the presence of residual covariance between items (42). In other words, MIs indicate local dependency between items, where the response to one of the items within the pair is influenced by the response to the other item (48)(49)(50). In these instances, it is possible to re-specify the model after accounting for local dependency by allowing correlation of the error terms of the items in the pair (42,47). Following this, the model fit is reassessed again. Once the two group-specific final models were established (one baseline model for patients and another one for caregivers), we would build a further multigroup baseline model called the 'configural model.' The latter model allowed to test simultaneously for both groups the same parameters that were estimated separately in the group-specific baseline models without imposing any invariance constraint. Following this, increasing levels of constraints would be imposed on the parameters of the configural model (involving, in order, factor loadings, observed variables intercepts, residual variances, and, finally, factor variances) to test a series of increasingly restrictive hypotheses about multi-group invariance. Should the latter approach fail, we would proceed to pool the data from patients and caregivers to collect evidence about the ICV of the BIRT-PQs using MI extensively to adjust for any model misspecifications.

Analysis of respondent burden
The respondent burden was estimated separately for each BIRT-PQs by recording the time needed to administer each questionnaire (51). Individual administration times (for both patients and caregivers) were recorded separately. . For CFA, we estimated that a sample size of 154 cases would guarantee a ratio subjects-to-parameters to be calculated for the initial analyses between 6.4:1 and 4.5:1, which are values somewhat below the recommended ratio of 10:1 (52). On the other hand, for a 308 subjects sample, the same proportion would be between 12.8:1 and 9.1:1, which would be adequate for these analyses (52).

Sample characteristics
One-hundred and fifty-four patients and their respective caregivers were enrolled in this study. Thus, a pooled sample of 308 subjects was available for analysis. Detailed demographic and clinical characteristics of the sample are reported in Tables 1 and 2, respectively. The enrolled patients were, on average, 42 years old, and more than two-thirds of them (68.8%) were men. Almost 50% of them had been awarded a high school diploma, whereas only 16.9% of them had achieved a degree. Nearly half of them were unmarried (47.4%), whereas about two-fifths of them were married. For about 40% of them, the primary caregiver was the spouse/partner, whereas slightly less than two-fifth of them were cared for by their parents. One-half of the patients lived with their own family (50%), whereas about one-third of them lived with the family of origin. About two-thirds of the patients were full-time workers before the brain injury, whereas only one in five of them had returned to full-time work at the time of assessment, although only 42.9% of them had a severe or complete restriction of their occupational capacity. On the other hand, the caregivers were predominantly women (57.8%), they were on average 10 years older (52.1 years), and they had achieved, on average, a lower educational level (37.7% high school diploma), although 11% of the education data were missing.
Almost two-thirds of the patients had suffered a TBI (61% of cases), whereas cerebral hemorrhage (either intracerebral or subarachnoid) accounted for about one-fifth of the etiologies. The brain injury at onset was severe (median GCS: = 5 points) with a prolonged disorder of consciousness (almost 3 weeks). The median hospital stay was 4.6 months, although an average length of stay of 6.3 months suggests that some hospital stays had been unusually prolonged. Despite the severity of the initial ABI, patients at discharge had achieved good cognitive and motor functioning (median LCF and MBI were, respectively, 8 and 100), although a Disability Rating Score of 3 suggested a 'partial' level of disability (30).
On average, the assessments took place more than 2 years after hospital discharge (range 9-61 months). No floor or ceiling effect for each BIRT-PQs was detected. Missing item data for each BIRT-PQs were minimal, ranging from 0.05% for BSCQ to 0.09% for BMQ and BIQ; no systematic missing item data pattern could be identified across each scale. The median scores of the BIRT-PQs were reported separately for patients and caregivers in Table 2. The score distributions of all five scales were left-skewed, with all median scores falling below the median of their respective score distributions. Notably, the lowest levels of personality change were reported for the BREQ and the BSCQ, with median scores lying for both questionnaires in the first quartile of the score distribution for both patients (BREQ: 50; BSCQ: 46) and caregivers (BREQ: 51; BSCQ: 45). Median higher scores lying in the second quartile of the score distribution were reported instead for the BMQ, the BDQ and the BIQ, both by patients (BMQ: 63; BDQ: 42; BIQ: 56) and by caregivers (BMQ: 68; BDQ: 43; BIQ: 59). Caregivers rated, on average, a higher degree of personality change than the patients for the BMQ (5 points), BIQ (3 points), BREQ (1 point), and BDQ (1 point), whereas patients self-reported a higher degree of disturbance than their caregivers within the social cognition scale (1 point).
The levels of satisfaction with life reported by the patients, measured by the SAT-P (Table 2) According to their caregivers, the total scores of the negative and positive behavior subscales, as measured by the FBI, lied on average in the 2 nd and the 1 st quartile, respectively (negative behaviors: 10; positive behaviors; 4). The FBI total score, indicating the overall amount of personality disturbance, laid in the first quartile of the score distribution (14). All the total scores of the CBI subscales lied in the first quartile of the score distributions, near the upper limit of the quartile for the time dependence (4), social (4) and physical (3.5) subscales, whereas for the remaining two scales, the total scores lied toward the quartile's lower limit (developmental = 2; emotional = 1). The overall caregiver burden, measured by the CBI total score, laid near the upper limit of the 1 st quartile of the score distribution (19).

Internal consistency analysis
All Cronbach's α values for the patients and caregivers data pooled together were>0.9 for all subscales (Table 3), except for BDQ (α = 0.811). Similar results were yielded by the analysis of the average inter-item correlations, which were≥0.2, except for BDQ (0.177).
The alpha with item deleted suggested that the deletion of most items with item-to-total correlation coefficients<0.4 would lead to an increase of the Cronbach's alpha for the remaining item set (Table 3).

Confirmatory factor analysis
The summary results of the CFA are reported in Table 4. The initial CFA failed to support the ICV of all five BIRT-PQs, both for the patient and the caregiver versions (RMSEA ranging from 0.073 to 0.097, and from 0.078 to 0.095, respectively). Except for the BSCQ, the RMSEA values were lower for the patient version than for the caregiver version. High values of MIs were detected for all scales, both for patients and caregivers. After allowing correlation of the error terms of the locally dependent pairs of items within a scale, it was possible to achieve a better fit to a one-factor model, although the RMSEAs were lower than the recommended cutoff of 0.06 only for the patient's versions of BREQ and BDQ. For the patient's version of BIQ, the final model was un-identified. Furthermore, it was not possible to proceed further with the analysis of multigroup invariance because for all BIRT-PQs (both patient and caregiver versions), the configural models were un-identified too. As a consequence, it was decided to pool the patient and caregiver data together and to proceed with the analysis.
The pooled CFA base analyses failed to support the ICV of all five BIRT-PQs (RMSEA ranging from 0.077 to 0.097; SRMR ranging from 0.081 to 0.105). High values of MIs were detected for all scales, indicating local dependence between a substantial number of item pairs (23, 17, 26, 13, and 31 locally dependent pairs of items for BMQ, BREQ, BSCQ, BDQ, and BIQ, respectively). After accounting for all locally dependent pairs by allowing correlations of their error terms, fit to one-factor model improved for all scales, with RMSEA values at or below (for BREQ, BDQ, and BIQ), and just above (for BMQ and BSCQ) the cutoff value of 0.06, and SRMR ≤0.08 for all scales. CFI and TLI were adequate only for BREQ (0.962 and 0.958, respectively).

Analysis of respondent burden
The average administration time for each BIRT-PQ questionnaire ranged from 5.7 to 7.5 min for the patient version and 4.1 to 6.2 min for the caregiver version ( Table 5). The average overall administration times of the five BIRT-PQs were 32 min (range: 23.5-40.5) and 23.8 min (range: 16.4-31.2) for the patient and caregiver versions, respectively.

Discussion
Within this paper, we undertook a CTT analysis of the internal construct validity of the five BIRT-PQs, here based on internal consistency and CFAs. Our results suggested that the total scores of the five BIRT-PQs lacked adequate internal construct validity. Particularly, despite good overall internal consistency values for each scale, several items within each    scale were problematic as they contributed much less than expected to the total score. This issue was also confirmed by the CFA, which showed the misfit of the data to a one-factor model, suggesting a lack of unidimensionality for the total scores. Furthermore, the baseline CFAs indicated that a source of misfit could be represented by the presence of local dependency within the data. Indeed, after accounting for this local dependency, the fit to a one-factor model improved significantly, although some un-modeled sources of misfit remained concealed within the data. It cannot be excluded that violations of multigroup invariance contributed to these unknown sources of misfit, as the latter could not be investigated because of the larger sample size needed for this kind of analysis. Finally, the study of respondent burden suggested that the BIRT-PQs are demanding instruments in terms of administration time, both for patients and caregivers. As we employed the same inclusion criteria and etiology characteristics of the previous BIRT-PQs studies (21,22), patients with mild or moderate brain injuries could not be included in the sample. Despite this, in comparison to the sample enrolled in the initial validation studies of the BIRT-PQs (21,22), our sample was twice as large (154 subjects vs. 72 subjects). Furthermore, as pooling patients and caregiver together further doubled the sample size, we were able to reach a total sample size of >300, which was sufficient to guarantee a ratio subjects-to-item between 8.8:1 and 12.5:1, values which were close or above the recommended ratio 10:1 for CFA (52). Overall, as our patients' characteristics are coherent with epidemiological data available on s-ABI regarding age, gender, and etiology (53), our sample is not only numerically appropriate, but it can also be considered representative of the population of patients with s-ABI.
Before this study, the psychometric performance of the BIRT-PQs had been assessed only in terms of classic reliability (testretest reliability and internal consistency) and external construct validity (concurrent and predictive validity) (8,21,22). The evidence provided in these studies about the validity of the BIRT-PQs is limited. Firstly, the psychometric methods employed were based on parametric statistics (e.g. multiple regression), whereas the BIRT-PQs total scores, being ordinal, would have indeed required the use of psychometric methods based on nonparametric statistics (25,54,55). Secondly, these methods compare the psychometric performance of the instrument's total score with  an external indicator, which is supposedly considered a valid indicator of the same or similar construct (25,56). The main shortcoming of these methods is that they assume the validity of the total score as a unidimensional indicator (i.e. they assume that all items within a scale contribute to measuring a single underlying construct) so that their scores can be summed together to generate a total score. Unfortunately, this assumption has never been tested in detail so far.
On the other hand, the internal construct validity methods employed in this paper aim to establish 'internally' the validity of the total score by comparing its psychometric performance to that predicted by a measurement model (25,44). The only previously published analyses which could provide some information about the internal construct validity of the BIRT-PQs were those related to internal consistency, as the latter is considered a necessary, although not sufficient, pre-requisite for unidimensionality (40). These analyses reported high values for all BIRT-P scales, thus suggesting a high internal consistency for all five scales. Particularly, in a sample of 72 subjects with ABI, Cronbach's alphas of 0.940 and 0.950 were reported for the self-and relative-rated BMQ, respectively (21). Cattran et al. (22) found a Cronbach's alpha of 0.960 for both selfand relative-rated BREQ in a sample of 74 patients with ABI. Finally, Cattran et al. (8) found internal consistency values (respectively, for the self-and relative-rated version) of 0.915 and 0.949 for BSCQ, of 0.899 and 0.942 for BDQ, and 0.924 and 0.952 for BIQ on a sample of 72 participants.
The internal consistency reported in this paper showed Cronbach's alpha values systematically lower than those reported in the literature (BMQ: 0.930; BREQ: 0.937; BSCQ: 0.909; BDQ: 0.811; BIQ: 0.904). This finding can be explained considering that, as the alphas were calculated on a bigger sample size (n = 308), larger measurement errors could be expected. Notwithstanding this limitation, all of Cronbach's alpha values suggested high internal consistency. However, as Cronbach's alpha values are known to increase with the number of items in the scale, the supplementary analyses here performed cast some further light on the internal consistency of the five scales. Particularly, the average inter-item correlation analysis suggested that the internal consistency of BDQ was less satisfactory than as indicated by the Cronbach's alpha, and that BREQ had a higher internal consistency (0.329) than BMQ (0.284), although based on a slightly lesser number of items. This finding can be explained considering that the average inter-item correlation is an internal consistency indicator which, unlike Cronbach's alpha, is not influenced by the number of items in the scale (40). Furthermore, despite the overall consistency reliability values, our item-byitem analyses suggested that in each scale, there were some problematic items. This aspect was especially evident for BDQ and BIQ, where some low negative item-to-total correlations were detected, thus suggesting that these items did not contribute at all to the total scores. Furthermore, all items correlation <0.4 to the restscore led to an increase of the Cronbach's alpha should that item be deleted, thus suggesting that these items were less consistent than the majority of other items and, therefore, less contributing to the total score.
As internal consistency analyses cannot establish whether a scale is unidimensional (40), we performed a CFA to fit a one-factor model, to test specifically the assumption that the items in each scale were unidimensional. Therefore, we performed a non-parametric CFA based on polychoric correlations to fit ordinal data, as the assumptions of normality and linearity of item scores, which are required by parametric CFA (47), are unlikely to be met by ordinal data (25). Within CFA, the first analyses were performed on each scale separately for patients and caregivers to test the ICV assumption of invariance. Indeed, for valid measurements, it is necessary to demonstrate that the various components of the measurement models are equivalent (i.e. invariant) across particular sample subgroups, such as patients and caregivers (46). Unfortunately, we were unable to test this important ICV assumption, as the standard errors of the models' parameters could not be computed, leading to un-identified configural models. This issue could be explained considering that with a sample size of 154 subjects, the sample size\ parameters ratio for the final analyses fell well below the recommended ratio of 10:1 (actually well below 5:1 for most scales, except for BDQ) (52). Thus, it is highly likely that the additional constraints imposed by the configural model did reduce even further this ratio, thus leading to un-identified models.
The CFA on pooled data rejected the unidimensionality assumption. This finding was somehow expected, given the results of the item-by-item internal consistency analyses. However, the base analyses showed the presence of high values of MIs within each scale (42,47), which are indicators of local dependency (48)(49)(50)57). Indeed, local dependence frequently occurs in health outcome scales (56), although it is frequently unreported or inadequately addressed (56,58,59). Local dependency is often a source of model misfit, as it may be linked either to response dependency (the response to an item can be predicted on the basis of the response to a similar item) (48-50), or to multidimensionality, where some of the response variances are accounted for by an additional latent variable (59,60). Indeed, after accounting for local dependency by correlating the error terms of the item pairs with high MIs (42,47), a substantial improvement of the fitness to the onefactor CFA model was observed for all the scales, thus suggesting a remarkable improvement of the unidimensionality of each scale. Given the impossibility to test the assumption of multigroup invariance by the respondent, we could not exclude that the suboptimal fit to a unidimensional model for most of the BIRT-PQs for the pooled sample data could be related to the presence of un-modeled violations of this requirement within the data.
The analysis of respondent burden demonstrated that the BIRT-PQs were demanding in terms of administration time, both for patients and caregivers. This finding was somehow expected, considering a total item set of 150 questions. The substantial length of the questionnaires does not only limit the acceptability and feasibility of the BIRT-PQs in routine clinical settings (61), but it is also a hindrance because of the substantially larger sample sizes needed to perform more advanced and detailed psychometric analyses, as the multigroup invariance CFA attempted in this paper. Considering the item redundancy uncovered by the CFA on the pooled data, we envisage that the next step of analysis will be a Rasch analysis conducted on the pooled data for each scale aimed at reducing the item set. Within the Rasch analysis, it will be possible to test all ICV requirements (54), including subgroup invariance for a variety of factors, such as gender, age, etiology, and respondent (so-called analysis of differential item functioning or DIF). The advantage of testing group invariance within the context of Rasch analysis is that the DIF analysis must be performed on pooled data, thus overcoming the sample size issues highlighted in this paper.
This study has some limitations that deserve discussion. Firstly, it should be noted that fitness to the one-factor CFA model was entirely adequate only for the BREQ. Indeed, apart from multidimensionality and local dependency, it is likely that other unrecognized sources of misfit could not be modeled within the CFA performed in this study (54), such as the already mentioned violations of multigroup invariance. Another limitation is that, as all psychometric methods here employed are sample-dependent (54,62), the generalizability of our results may not be assumed for different samples and/ or people living in other countries.

Conclusion
Despite the above limitations, our analyses demonstrated the lack of internal construct validity for the original BIRT-PQs total scores. Consequently, the external validity analyses already published should be considered somehow biased, as relying on biased total scores. At the same time, the present analyses demonstrated that the unidimensionality of each BIRT-PQ not only can be improved, but it is sufficient to perform a Rasch analysis (44). The latter will be aimed at deleting misfitting and/or redundant items according to a logic that will take into account both the statistical constraints and the clinical knowledge of the construct beingd measured. Considering also that the BIRT-PQs' administrative burden is substantial, the perspective of creating more valid and shorter forms of the questionnaires is undoubtedly desirable for improving both the measurement properties of each BIRT-PQ, as well as their acceptability and feasibility in routine clinical settings.