Advertisement

Problems of rating scales in health measurements

Research Article | DOI: https://doi.org/10.31579/2835-8147/074

Problems of rating scales in health measurements

  • Satyendra Chakrabartty 1

1 Indian Statistical Institute, India.

*Corresponding Author: Satyendra Chakrabartty, Indian Statistical Institute, India.

Citation: Satyendra Chakrabartty, (2024), Problems of rating scales in health measurements, Clinics in Nursing; 3(4): DOI: 10.31579/2835-8147/074

Copyright: © 2024 Satyendra Chakrabartty, this is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Received: 05 July 2024 | Accepted: 15 July 2024 | Published: 24 July 2024

Keywords: patient-reported scale; linear transformation; normal distribution; ability to detect changes; elasticity

Abstract

Background: Patient-reported outcomes (PROs) using multi-item rating scales are not comparable due to different features of the scales, different factors under consideration, etc.  

Objectives: To discuss methodological limitations of PROs and to provide a method for converting ordinal item-score to follow normal distribution.

Method: Converting raw item-score to equidistant score (E) followed by standardization to Z-scores ~N(0,1)  and converting Z-scores to proposed scores (P_i) in the range 1 to 100. Scale scores (P_Scale) as sum of P_i's  and battery scores (B-scores) as sum of scale scores follow normal distribution.

Results: Each of P_Scale-scores and B-scores satisfy desired properties, helps undertaking parametric analysis, comparing status and finding equivalent scores of two PROs having implications in classification and also to get reliability, validity in better fashion.  

Conclusion: The suggested method contributing to improve scoring of PRO instruments with additional benefits of identification of poorly performing scales, assessment of progress across time is recommended.

 

Introduction:

Often subjective self-reported measures of illness are evaluated through rating scales to assess objective health [2] Data resulting from such rating scales are categorical and in ordinal level. Large numbers of clinical researches use patient reported rating scales (PROs) to quantify clinical conditions like intensity of disease, effects of disease or treatment, health status, quality of life (QoL), pain, sleep disorders, depression, anxiety, stress and far beyond as part of the patient decision making process. The MAPI Trust, a nonprofit organization provides information on 3000+ patient rating scales ( http://www.mapi-trust.org/about-the-trust).

PROs consist of number of scales which vary in terms of features of the scales like number of items (scale length), number of levels (scale width), scoring methods, etc. and are not comparable. Scale length, scale width, frequencies of levels affect differential item functioning (DIF). Analysis of ordinal data emerging from PROs without satisfying the assumptions of statistical techniques used, may distort the results. [22] suggested prior checking of measurement properties of PRO-instruments. 

 

Methods:

Self-reported rating scale consisting of multi-point items suffers from methodological limitations including not meaningful addition.  If addition is not meaningful, computations like standard deviation (SD), correlation, Cronbach α, etc. are meaningless. Statistical analysis like regression, Principal component analysis (PCA), Factor analysis (FA), testing equality of means by t-test or ANOVA assumes normal distribution of the variables under study. But questionnaire scores violate the assumption and may distort the results. Assigning equal importance to items and constituent scales in summative scoring of PROs is not justified since contributions of items or scales to total battery score, values of inter-item correlations, scale-battery correlations and factor loadings are different [25] Mean, SD, Cronbach alpha tend to increase with increase in number of levels and may influence mean more than the underlying variable [18] No consensus is there regarding number of levels per item in rating scales [5]  

Studies attempting to evaluate effect of selenium supplementation on stroke used different definitions of stroke either by categorical variable or variables in ratio scale. While investigating dose-response correlation between dietary selenium intake and stroke risk, [29,30] used self-reported single question "Has a doctor ever told you that you had a stroke?" to define Stroke. Thus, stroke was taken here as a categorical variable and not in ratio scale. [39] asked each participant whether a doctor ever given a diagnosis of stroke (no, yes, unknown) and defined stroke as a self-reported physician diagnosis during follow-up.  The follow-up time was the date of the first discovery of stroke. [28] included adults with accepted ischemic stroke by neuroimaging during the last 72hrs with a volume of at least one-third of MCA territory. Different inclusion criteria for stroke and different analysis resulted in different relationships between intake of selenium supplementation on stroke and conclusions.  

The paper suggests a method of transforming ordinal scores of i-th item to normally distributed proposed scores (P_i-scores) facilitating meaningful addition and deriving scale score (P_Scale) as arithmetic aggregation of P_i-scores satisfying desired properties, enabling assessment of progress and parametric analysis.  

Problems of rating scales:

If distance between two successive response-categories or levels of K-point items (K= 2, 3, 4, 5 ……) is denoted by d_(j,(j+1) ) then d_(j,(j+1) )≠d_((j+1),(j+2) )  ∀ j =1, 2, 3, 4… i.e. scores are not equidistant [27] Thus, addition of ordinal item scores is not meaningful [15]  and even (X ) ̅  > or

Generic or disease-specific multidimensional rating scales for QoL may not consider all relevant constructs. For example, Disease-specific stroke adapted 30-item SIP version (SA-SIP30) with 8 subscales excludes domains like recreation, energy, pain, general health perceptions, overall quality of life or stroke symptoms [11] Multidimensional rating scales may even fail to give global summary like 36-Item Short Form Health Survey questionnaire (SF-36) (http://www.webcitation.org/6cfeefPkf)

Multidimensional scale covers a number of sub-scales/dimensions where scale formats are different for different sub-scales. For example, SF-36 has 10 (3-points) items on Physical functioning, 3 (6-point) items on Energy/Fatigue, 2 (5-point) items on social functioning, 6 (6-points) items on Emotional well-being, 5 (5-point) items on General health, two items on Pain (one 6-point and one 5-point), seven binary items and another item regarding reported health transition over the last year. The set-up indicates (i) different distributions for binary items, 3-point, 5-point, 6-point items, (ii) higher mean, SD of sub-class containing 6-point items, (iii) different reliability, validity, for different sub-classes. [26] Two distinct concepts measured by the SF-36 are Physical Component Summary (PCS), and Mental Component Summary (MCS). [35]. found paradoxical inverse relationship between PCS and MCS which implies good physical condition pre-supposes poor mental health and vice versa. SF-36 was negatively correlated with Patient Health Questionnaire (PHQ) and General anxiety disorder questionnaire (GAD-7) probably due to different factors measured by them [16]

Scoring methods of PROs are different. Dimension score of MacNew Heart Disease Health–Related Quality of Life Questionnaire (MacNew) is based on mean of the responses in items belonging to the dimension but, Cardiovascular Limitations and Symptoms Profile (CLASP) scores consider weights to find total for each subscale. Each dimension of Myocardial Infarction Dimensional Assessment Scale (MIDAS) is scored separately.  

No clear understanding of factors being measured. Against two factors proposed in the Hospital Anxiety and Depression Scale (HADS), factor structure of the instrument was found to be three in a range of clinical populations [3] against recommending HADS as a one-dimensional measure [8] and statistical evidence for a three-factor structure [33] Similarly, for Psychological General Well-Being Index (PGWBI), [21] found single construct of psychological wellbeing against underlying six factors of the scale raising questions about factor analytic interpretation in the presence of local dependency.

Use of zero as an anchor value does not allow computation of expected value (value of the variable × probability of that value), reduces mean and SD of the scale, item-total correlations, affects regression or logistic regression, etc.  If each respondent of a sub-group selects the level marked as “0” to an item then mean = variance = 0 for the sub-group and correlation with that item is undefined. [34] found more than 40% of the patients scored zero in 10 subscales of Sickness Impact Profile (SIP) and in one subclass of SF-36. Better is to mark the anchor values as 1, 2, 3… and so on, keeping the convention of higher score ⇔ higher value of the variable being measured.  

Higher score in each of Nottingham Health Profile (NHP), Minnesota Living with Heart Failure (MLHF) indicate higher health problems, unlike Sickness impact Profile (SIP). Thus, directions of scores are different for different scales.  

Rating data with floor and ceiling effects follow unknown distribution and do not satisfy the assumption of PCA like bivariate normality for each pair of observed variables, normally distributed scores, etc.  

Test reliability by Cronbach alpha assumes one-dimensional scale and tau-equivalence (equality of all factor loadings). Multidimensional PROs like Insomnia Severity Index (ISI), Pittsburgh Sleep Quality Index (PSQI) and Insomnia Symptom Questionnaire (ISQ), etc. violate the assumption and underestimate the coefficient alpha [9] The coefficient alpha is influenced by variance sources, sampling errors [27] sample size [7] and even test length and test width [20]

 

Validity of a multidimensional scale as correlation with criterion scores raises the question about the dimension /factor being reflected by the validity. It is desirable to find the validity of the main factor for which the scale was developed and also to derive relationship between test reliability and test validity. Vaughan, (1998) found lower validity where data contained predominantly high performers. To avoid such problems, structural validity of normally distributed transformed scores by PCA was preferred [4-6]

Different cut-off scores are there for different PROs. For example, cut-off score of Sickness Impact Profile (SIP136) with 136 “Yes–No” type items distributed over 12 domains is ≥ 22 and for Stroke-Adapted Sickness Impact Profile (SA-SIP30) with 30 items covering 8 subscales is ≥33.  Natural question is whether score of 33 in SA-SIP30 is equivalent to the score of 22 in SIP136.  Similarly, score of 14 in ISI indicating “no insomnia” is equivalent to which score in PSQI or ISQ?  Thus, finding equivalent scores of two scales can make better comparisons of the PROs for the purposes of classification of individuals. For QoL questionnaires, there could be no cut-off point to show better or worse QoL [31] Based on treatment status for Cancer Core Questionnaire (EORTC QLQ-C30), four different cut-off scores were found [17]

Intra- and-inter observer reliability of ordinal scale like Kessler Psychological Distress Scales (K 6 and K 10) are evaluated by Kappa and weighted Kappa. Major limitations in this context are:  

A low kappa does not imply low agreement [1] Confidence interval for Kappa ≤0.60 may indicate large volume of incorrect evaluation of data [32]

Methods of deciding weights for weighted kappa vary and may give different values of weighted kappa.  

Concepts of agreement in terms of κ or κ_Weighted are different from the concept of reliability of tests/scales.

Suggested method:

Let X_ijbe the raw score of a respondent in the i-th item for choosing the j-th level where the levels are marked as 1, 2, 3, 4, …. avoiding zero and higher value of X_ij  implies higher dysfunctions or impairments. The suggested method transforms ordinal item scores (X_i) to equidistant scores (E_i) and further transformation to proposed scores (P_i-scores) in the score range [1, 100] following (μ_i,σ_i) facilitating meaningful addition to derive scale score (P_Scale) as sum of P_i-scores. The method is described below.

For the i-th item, find maximum frequency f_(i.Max)and minimum frequency f_(i Min).  For n-number of respondents in a 5-point item (say), find initial weights ω_i1=f_(i Min)/n, the common differenceα=(5f_(i.Max)- f_(i Min))/4n and other initial weights as ω_i2=(ω_i1+α)/2, ω_i3=(ω_i1+2α)/3,ω_i4=(ω_i1+3α)/4, and ω_i5=(ω_i1+4α)/5.  

Take final weights W_ij= ω_ij/(∑_(j=1)^5▒ω_ij ) Here, ∑_(j=1)^5▒W_ij =1. Here, W_ij's form an arithmetic progression. Generated scores E_ij= W_ij X_ij  are continuous, monotonic and equidistant.  

Standardized equidistant scores (E) of each item as ???? =(E- E ̅)/(SD(E)) ∼ N (0, 1) and  

P_i  =  ((99)*(Z_i- Min(Z_i )))/(Max (Z_i )- Min(Z_i ) ) + 1 ∼ N (μ_i,σ_i) where 0≤  P_i≤100 irrespective of length of scale and width of items.  

Normality of item scores (P_i's ) facilitates meaningful addition and the resultant scale scores P_Scale = ∑_i▒P_i  as the convolution of P_i's . Normally distributed P_Scale-scores can be added to get battery score (B-scores) also following normal.  

Major properties of P_Scale-scores and B-scores are:                                                            

Each avoids equal importance to items and dimensions and represents continuous, monotonically increasing scores.  

The zero point for scoring K-point items to get E-scores is obtained whenf_ij=0. Other items in ratio scales can be standardized and transformed to follow normal distribution in the range [1, 100] and added with P_i's  

Contribution of j-th scale to the battery can be found by (P_(jth Scale)-score)/(B-scores).  

 

Benefits:

Parameters of distributions of P_Scale-scores and B-scores can be estimated from data. Normality enables estimation of population mean (μ), population variance (σ^2), confidence interval of μ,  testing statistical hypothesis like H_0: μ_1=μ_2 or H_0: σ_1^2=σ_2^2 etc.  

Based on battery scares, progress of i-th patient in t-th period over the previous period by (B_(i(t))-B_(i(t-1)))/B_(i(t-1)) ×100. Decline is indicated in case of B_(i(t))-B_(i(t-1))<0> (B_(i(t-1)) ) ̅ indicates progress. Similarly, progress with respect to scores of P_Scalecan be computed. Decline if any, may be probed to find the critical scale(s) where P_(Scale(t))-P_Scale(t-1) <0>

Effect of small change in i-th scale (P_(i-th Scale )) to Battery score B-scores can be quantified by considering elasticity i.e. percentage change of B-scores due to small change inP_(i-th Scale). The scales can be ranked based on such elasticity. Elasticity studies in economics, reliability engineering, consider model like logQ_jt=α_j+β_j logP_jt where  Q_jt denotes the quantity demanded of j-th industry at time t and P_jt is industry price relative to the price index of the economy  However, for normally distributed  P_Scale-scores and B-scores, logarithmic transformations are not required to fit regression equation of the form  P_Scale = α_i  + β_i P_i + ε_i  

The coefficient β_i reflects the impact of a unit change in the independent variable (i-th dimension) on the dependent variable (P_Scale). Policy makers can decide appropriate actions in terms of continuation of efforts towards the scales with high values of elasticity and corrective actions for the dimensions with lower elasticity i.e. areas of concern.  

Normality of B-scores facilitates testing H_0: μ_(B_t ) = μ_(B_((t-1)) ) reflecting effectiveness of the treatment plans and H_0: 〖Progress〗_((t+1)over t) = 0, reflecting progression  

Graph depicting progress/decline of one patient or a group of patients with similar socio-demographic profile is analogous to hazard function and helps to identify high-risk groups and compare response to treatments from the start.  

For two scales X and Y with normal pdf f(x)and g(y)  respectively, equivalent score y_0  for a given value say x_0  can be found by solving the equation ∫_(-∞)^(x_0)▒〖f(x)dx=∫_(-∞)^(y_0)▒g(y)dy〗 using standard normal table even if the scales have different lengths and widths [4-6]

P-scores and B-scores following normal distributions satisfy the assumptions of PCA, FA and enable finding Factorial (FV) = λ_1/(∑▒λ_i ) = λ_1/(∑▒S_(X_i)^2 )  where λ_1  the highest eigenvalue indicating validity for the main factor being measured [24] The test significance of λ_1 can be undertaken using the Tracy–Widom (TW) test statistic U = λ_1/(∑▒λ_i ) following TW-distribution [23] Such FV avoids the problems of construct validity and selection of criterion scale ensuring matching constructs and two administrations of the scale and the criterion scale.  

For standardized item scores, 〖FV〗_(Z-scores) of a test with m-items is λ_1/m and the test variance S_X^2 can be written as  S_X^2= ∑▒λ_i + 2∑_(i≠j=1)^m▒〖Cov(X_i,X_j)〗=  λ_1/FV+2∑_(i≠j=1)^m▒〖Cov(X_i,X_j)〗   [1]

The equation (1) can be used to find the theoretical reliability

r_(tt(theoretical))  = (S_T^2)/(S_X^2 )=  (S_T^2  )/(λ_1/FV+2∑_(i≠j=1)^m▒〖Cov(X_i,X_j)〗)        [2]

Equation (2) gives relationship between r_(tt(theoretical)) and factorial validity, which is non-linear.

[30] suggested maximum reliability of a test by α_(PCA ) which can be derived from the correlation matrix of m-number of items by  

α_PCA= (m/(m-1)) ( 1-1/λ_1 )         [3]

Relationship between FV and α_PCA  can be derived as:

α_PCA= (m/(m-1)) ( 1-1/λ_1 ) = (m/(m-1)) ( 1-1/(FV.∑▒λ_i )) = (m/(m-1)) ( 1-1/(m.〖FV〗_(Z-scores) ))   [4]

As per (4), higher value of 〖FV〗_(Z-scores) increases α_PCA

Cronbach alpha of a battery consisting of K-scales can be obtained as a function of scale reliabilities by α ̂_Battery = (∑_(i=1)^K▒r_(tt(i))  S_Xi+ ∑_(i=1,i≠j)^K▒∑_(j=1)^K▒〖2COV(X_i,X_j)〗)/(∑_(i=1)^K▒S_Xi + ∑_(i=1,i≠j)^K▒∑_(j=1)^K▒〖2COV(X_i,X_j)〗)          [5]

where r_(tt(i)) and S_xi denote respectively reliability and SD of the i-th scale.

 

Discussion:

The suggested method defines meaningful scale scores and battery scores for each individual.  Each of P_Scale-scores and B-scores satisfy desired properties, helps undertaking parametric analysis, comparing status and progression of patients including indication of effectiveness of treatment plans, finding equivalent scores of two patient reported scales (PROs) where area under normal curve corresponding to PRO-1 up to P_(PRO-1)^0  = area under normal curve corresponding to PRO-2 up toP_(PRO-2)^0. For classification of individuals, equivalent cut-off scores of class boundaries may be found satisfying 〖Var.of group〗_(Score ≥ P_(PRO-1)^0  )/(Variance of PRO-1)=〖Var.of group 〗_(Score ≥P_(PRO-2)^0 )/(Variance ofPRO-2)  which may facilitate to have similar efficiency of classification, in terms of within group variance and between group variance.

Factorial validity (FV) reflecting the main factor being measured helps to have a clear understanding of the most important factor being measured. However, establishing clinically meaningful content validity is a vital step. Maximum value of test reliabilityα_PCA, relationship between 〖FV〗_(Z-scores) and α_PCA and also between r_(tt(theoretical)) and FV can be used effectively to compare scales. The scales with eigenvalues exceeding unity can be retained keeping in view that results may get distorted by wrong selection of constituent scales.  

 

Conclusions:

The suggested B-scores reflecting disease severity with respect to the PRO measures is recommended with the scales chosen as per the selection criteria mentioned above. Future empirical investigations may be undertaken to evaluate properties of the suggested method and its clinical validation along with effects of sociodemographic factors. 

Declarations:

Acknowledgement: Nil

Conflicts of interest/Competing interests: The author has no conflicts of interest to declare

Funding: Did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors. 

Ethical approval: Not applicable since the paper does not involve human participants. 

Consent of the participants: Not applicable since the paper does not involve data from human participants

Data Availability statement: The paper did not use any datasets

Code availability: No application of software package or custom code

CRediT statement: Conceptualization; Methodology; Analysis; Writing and editing the paper by the Sole Author

References

Clinical Trials and Clinical Research: I am delighted to provide a testimonial for the peer review process, support from the editorial office, and the exceptional quality of the journal for my article entitled “Effect of Traditional Moxibustion in Assisting the Rehabilitation of Stroke Patients.” The peer review process for my article was rigorous and thorough, ensuring that only high-quality research is published in the journal. The reviewers provided valuable feedback and constructive criticism that greatly improved the clarity and scientific rigor of my study. Their expertise and attention to detail helped me refine my research methodology and strengthen the overall impact of my findings. I would also like to express my gratitude for the exceptional support I received from the editorial office throughout the publication process. The editorial team was prompt, professional, and highly responsive to all my queries and concerns. Their guidance and assistance were instrumental in navigating the submission and revision process, making it a seamless and efficient experience. Furthermore, I am impressed by the outstanding quality of the journal itself. The journal’s commitment to publishing cutting-edge research in the field of stroke rehabilitation is evident in the diverse range of articles it features. The journal consistently upholds rigorous scientific standards, ensuring that only the most impactful and innovative studies are published. This commitment to excellence has undoubtedly contributed to the journal’s reputation as a leading platform for stroke rehabilitation research. In conclusion, I am extremely satisfied with the peer review process, the support from the editorial office, and the overall quality of the journal for my article. I wholeheartedly recommend this journal to researchers and clinicians interested in stroke rehabilitation and related fields. The journal’s dedication to scientific rigor, coupled with the exceptional support provided by the editorial office, makes it an invaluable platform for disseminating research and advancing the field.

img

Dr Shiming Tang

Clinical Reviews and Case Reports, The comment form the peer-review were satisfactory. I will cements on the quality of the journal when I receive my hardback copy

img

Hameed khan