February 01, 2017

Personnel Problems: How Useful is the Wonderlic Personnel Test?

Amy Knapp

Amy Knapp
Nova Southeastern University
Now, possibly more than ever before, the economy in the United States is a pressing issue. For many years, the United States has struggled to lower unemployment rates and to reconcile the fact that, with approximately 22 million workers underemployed or unemployed (Vo, 2012), there are at least 3.2 million positions that remain unfilled (Sullivan, 2011). In many ways, psychology is equipped to deal with issues like these. Specifically, the field of industrial-organizational psychology, one of the newer and more developing areas of psychology, is concerned with blending the world of work with psychological concepts and tools. This field can be subdivided into the more specific concentrations of personnel psychology and organizational psychology. These two divisions focus on different aspects of employment and business. Organizational psychology is involved with workplace functioning, such as motivation and leadership, whereas personnel psychology is more concerned with employee recruitment and selection, as well as matching job performance with job descriptions (Kaplan & Saccuzzo, 2012 p. 484).
When it comes to hiring, a number of factors are used by employers to determine the best candidate for a job. An interview is sometimes considered the single most important factor in making or breaking a deal for employment. A business often places importance on experience and previous jobs held by a candidate as well, facts that are typically included in one's resume which is, of course, the crucial first step in getting noticed by an employer. These tools are often not enough, however, for an employer to come to a clear decision, especially in today's market where an overwhelming number of applicants apply to one job. This is one of the many reasons why so many employers have historically implemented employee assessments. Assessments can take a number of forms but one common strategy is to determine the cognitive ability of applicants. While some might contend that other factors, such as motivation and personality, are more relevant to employee selection (Leverett, Matthews, Lassiter & Bell, 2001), there is much evidence to suggest cognitive ability deserves attention in this respect. Cognitive ability has been found to strongly correlate with job training success (Ree & Earles, 1992) and job-related learning (Schmidt, Hunter & Outerbridge, 1986; Schmidt & Hunter, 1992). Moreover, many more have concluded that cognitive ability is one of the best predictors of job performance (Thorndike, 1986; Jensen, 1986; Hunter, 1986; Campbell, 1990). Meanwhile, Hunter & Hunter (1984) and Schmidt & Hunter (1998) both proclaim cognitive ability to be the single most important factor in predicting job performance. There are numerous tests that measure cognitive ability but the Wonderlic Personnel Test is arguably the most economical and most used in industrial-organizational settings (Leverett et al., 2001). The question to ask, however, is whether or not its issues and limitations disqualify any justification for its use.
About the WonderlicAfter being used by the United States Navy in 1937 as a means of selecting pilots for training in World War II (Chamorro-Premuzic & Furnham, 2010, p. 100), the Wonderlic Personnel Test (WPT) was first administered publicly in 1938 (Geisinger, 2001). The WPT is a norm-referenced test which was adapted from a previous assessment, the Otis Self-Administering Test, and can be administered individually or in groups (Schraw, 2001). The WPT is intended for use with the working-age population, from 18-year-olds freshly out of high school to those aged 60 and over. However, current, concrete normative data is quite elusive. Geisinger (2001) reports that norms are available for working adults, high school graduates, college graduates, genders, age, education, race, ethnicity, and by position as well as by test form. However, Geisinger (2001) adds that not all normative data is satisfactorily current. This does seem to be an issue. Downie (1965) indicated that norms were available from 50,000 adults from 20-65 years old. This is the same information provided by Kaplan & Saccuzzo (2012, p. 499) fifty years later. Belcher (1992) reveals that the 1983 norms are based on 126,000 individuals but adds that this data was collected in 1966, which makes it far from current. It is possible that the most recent normative data may be from 1970, as provided by the Wonderlic Company (1970) and reported by Drdege (1972), in which 251,253 individuals were used to compile performance norms for age, sex, education, industry, geographic region and position applied for. When solicited, Dr. Brett Wells, lead research scientist for the Wonderlic Company, offered no further information about the normative data of the WPT. However, he did provide information about the Wonderlic Personnel Test-Revised (WPT-R), which was released in 2007 (Kaplan & Saccuzzo, p. 500), stating that 427 companies and 36,543 test-takers participated in its development (B. Wells, personal communication, November 5, 2012). Additionally, the WPT is quite highly correlated with the WPT-R (r=.78) (B. Wells, personal communication, November 5, 2012). Despite its issues, the WPT is widely used and it is purported to have had 200 million administrations (B. Wells, personal communication, November 5, 2012). Most of these administrations of the WPT have likely been completed using a paper-and-pencil test but a computer-based format is also currently available and there are 16 parallel forms of the WPT (Rosenstein & Glickman, 1994, p. 187). It is also available in large-print, Braille and audiotape versions (Geisinger, 2001). In addition to three English versions, for the United States, United Kingdom, and Australia, the WPT has been translated to 11 languages, including Chinese, Japanese, Korean, Vietnamese, Portuguese, Spanish, French, German, Russian, Swedish and Tagalog (Leverette et al., 2001).
While it is intended for use in personnel, the WPT is designed to measure one's ability to learn and general mental ability, or g, which is a concept taken from Spearman's theory of intelligence. In Spearman's theory, he noted that "…when a set of diverse ability tests are administered to large unbiased samples of the population, almost all of the correlations are positive. This phenomenon is called positive manifold, which according to Spearman resulted from the fact that all tests, no matter how diverse, are influenced by g" (Kaplan & Saccuzzo, 2012, p. 233). With this information in mind, the content of the WPT contains 50 questions with an allotted completion time of 12 minutes. However, the test manual provides instructions for adjusting untimed tests (Schraw, 2001). Arranged in order of difficulty, the types of questions on the WPT include word comparisons, number comparisons, number series, geometric figure analyses, disarranged sentences, vocabulary, and logical as well as mathematical story problems. These different types of items are not scaled and the number of correct answers, which is calculated by the administrator using a key provided with the WPT test, is the score one receives. Due to the ease of scoring and interpretations, a trained administrator is not necessary. In order to successfully complete the test, however, the examinee must possess a sixth-grade reading level and knowledge of how to tell time, the days of the week, months of the year, basic math, U.S. monetary units, measures and weights (Geisinger, 2001).
Age adjustments are advised for test-takers over the age of 29 (Schraw, 2001), with one additional point suggested for 30-39 year olds and as much as 5 additional points to be added for those 60 years of age and older (Schoenfeldt, 1985). Possible scores simply range from 0 to 50 and are intended to indicate whether an applicant for a job is within a range that would be appropriate for the position applied for. For example, in early versions of the manual "…[m]inimum scores range from 8 for janitors and other unskilled jobs to over 30 for some professional and managerial jobs" (Drdege, 1972). Later versions of the WPT explain expected scores further; for instance, the mean score for a Chemist is a 31 and other expected means include a 26 for Newswriters, 17 for Security Guards and 15 for Warehouse Workers (Guion & Highhouse, 2006, p. 240). Curiously, both Clerical Workers and National Football League (NFL) players would be expected to have a mean score of 21 (Guion & Highhouse, 2006, p. 240). Pesta, Bertsch, Poznanski, & Bommer (2008) indicate the population mean to be 22 (SD=7) (p. 430). This is the same expected average for Bank Tellers (Guion & Highhouse, 2006, p. 240).
As aforementioned, Guion & Highhouse (2006) indicate the average score for NFL players on the WPT to be 21 (p. 240). Discussing football alongside cognitive ability may seem peculiar when taken out of a bigger context. In truth, it could be claimed that no discussion of the Wonderlic would be complete without also mentioning the use of the WPT by the National Football League. The WPT has been used as part of what is referred to as the NFL Combine since 1970 (Guion & Highhouse, 2006, p.240). The NFL combine, an employment assessment in itself, includes a "…series of drills, exercises, interviews, aptitude tests, and physical exams designed to assess the skills of promising college football players and to predict their performance in the NFL" (Kuzmits & Adams, 2008, p. 1721). While it may seem odd that a cognitive ability test would be relevant to performance on a football field, the score one receives on his Wonderlic is probably the most discussed result of the combine screenings. In fact, the WPT is viewed as so important that players often hire tutors to improve their chances of a high score (Hart & Sheldon, 2007, p.29). As Kuzmits & Adams (2008) point out, "[f]ew studies have examined scientifically whether a relationship exists between a graduating collegiate football player's combine performance and subsequent performance as a professional football player" (p. 1721). Sawyer, Ostarello, Suess & Dempsey (2002) found that there was a significant relationship between many measures of performance and players' ability level on the vertical jump, while other researchers have found the combined results of players' physical assessments to be a good to fair predictor of future performance (McGee & Burkett, 2003). However, the WPT has yet to be examined thoroughly and, in fact, Kuzmits & Adams (2008), using a small sample, concluded that the WPT is unrelated to NFL success. Nevertheless, the League has continued to hold the combine assessment and to incorporate the WPT as part of the process for decades.
Analytic Information
For any assessment to gain respect and longevity as one that can be relied on, it must have demonstrative reliability and validity. In the case of the WPT, there is a sizable amount of evidence to justify its endorsement. Test-retest reliability has been found to range from .82 to .94 (Geisinger, 2001; Dodrill, 1983) and parallel forms reliability has a range of .73 to .95 (Geisinger, 2001). The Kudor-Richardson coefficient establishes internal consistency at .88 (Geisinger, 2001). In terms of validity, numerous experts look favorably upon the Wonderlic and accept it as a good measure of cognitive ability (Belcher, 1992; Bell, Matthews, Lassiter & Leverett, 2002; Geisinger, 2001; Hawkins, Faraone, Pepple, Seidman & Tsuang, 1990; Matthews & Lassiter, 2007; Schmidt, 1985; Schraw, 2001; Schulte, 2004). If that were not enough to establish content validity, the widespread use of the WPT for over 70 years is more than encouraging. Given the vast variety of question types and the results of other validity checks, it seems fair to conclude that the WPT possesses good content validity.
Construct Validity is similarly well-established for the WPT. The assessment has demonstrated convergent validity by correlating with several other measures of cognitive ability. The correlation between the WPT and the (full scale) Wechsler Adult Intelligence Scale-Revised (WAIS-R) was found to be .92 (Schulte, Ree & Carretta, 2004, p. 1061), while its correlation with the General Aptitude Test Battery (GATB), Aptitude G, is similarly high at .90 (Hunter, 1989). Matthews & Lassiter (2007) found the correlation between the WPT and the Woodcock-Johnson-Revised (WJ-R) Tests of Cognitive Ability Battery to be a moderate .55 and another moderate correlation was found between the WPT and the Kaufman Adolescent and Adult Intelligence Test (KAIT), where a .66 correlation resulted (Bell et al., 2002). Furthermore, the WPT has been found to measure both crystallized and fluid intelligence, as evidence by Bell et al. (2002) in their comparison of the KAIT scales of Crystallized IQ (Gc) & Fluid IQ (Gf), which were both moderately correlated with the WPT (r = .62 and r = .54, respectively). However, in the conclusion of their investigation, Bell et al. (2002) suggest that, in many situations, it may be wise to use alternate tests in order to gauge individual scores for Gf and Gc, as numerous jobs may require one over the other (pp. 117-118). When it comes to divergent validity, according to Schraw (2001), the WPT is uncorrelated with a wide variety of personality assessment measures. A specific example of evidence in support of divergent validity is found in comparing the WPT to the NEO-Five-Factor Inventory (NEO-FFI). The WPT correlated most highly with the "Openness to Experience" scale, r = .255, and with the "Neuroticism" scale, r = -.242 (Schulte et al., 2004, p. 1065). These weak correlations are not only logical, but can also be considered negligible.
The evidence for criterion validity is not as convincing as that of other forms of validity. As mentioned, Schmidt & Hunter (1998) found g to be the most valid predictor of job performance (r = .51) and job training performance (r - .56). However, the WPT specifically has been found to be positively correlated with educational and vocational success (Schraw, 2001). Additionally, those who score high on the WPT tend to gain more from formalized training and be more likely to learn from on-the-job experiences (Schulte et al., 2004). Geisinger (2001) reports that predictive validity for performance ranges from .22 to .67 in selected job categories. This range is quite a large one considering a .22 correlation would account for less than 5% of variance in job performance. This evidence may still satisfactorily suggest predictive validity but there is even less apparent evidence for concurrent validity. Although Geisinger (2001) does note that those who have attained higher education tend to score higher, there is no definitive proof that the higher scores are not merely the result of education.
Limitations and Potential Biases
Some of the problems surrounding the WPT have already been alluded to. Evidence for criterion validity is lacking and more research in this area could definitely be used, especially when it comes to verifying concurrent validity apart from the potential influence of educational attainment. Schoenfeldt (1985) noted what he referred to as a substantial correlation between the WPT and education, citing a range of r = .3 to .8. Drdege (1972) raised the same issue, describing this as a serious limitation that can lead to unfair disadvantages and even discrimination against those who are less educated. Unfortunately, the shortcomings of the WPT do not end here. Several have also criticized the overall organization of literature pertaining to the WPT as well as the apparent lack or withholding of information. Foley (1972) has several criticisms related to the WPT manual and corresponding publications, faulting them for containing a lack of organization, a lack of information, particularly in providing inadequate data to evaluate the measurement, in addition to having poor presentation. Foley (1972) also provides an interesting anecdote, obtained from his communication with the tabulation company, in which he requested information for the WPT norms and the company refused to provide any. Schmidt (1985) had similar complaints, especially when it came to information about validity being lacking. Belcher (1992) echoed these sentiments, pointing out that the testing company needs to include more information about their tests, the participants used to gather the information, and to present this data in an accessible manner. Schoenfeldt (1985) noted disorganization in the testing manual materials and specifically expressed concern for inconsistencies in the information provided to assist administrators in dealing with Equal Employment Opportunity (EEO) liability.
Perhaps in part because there is a lack of information on how to use the WPT, several reviewers have remarked that the test may often be inappropriately used. Foley (1972) and Schmidt (1985) agree that there is no supporting data to justify one of the key elements of the WPT, the decided-upon minimum scores for occupations. Foley (1972) also believes the measure is most appropriate for a limited population, which primarily includes white-collar workers. He additionally notes that, despite having many parallel forms, the widespread use of the test impairs its effectiveness due to the resultant familiarity with the WPT. Drdege (1972) expresses similar concerns about the use of "cutting scores," in which some scores are considered too high for certain occupations. Drdege (1972) says that there is no evidence to support this practice and that there is no proof that it is advantageous to anyone, whether it be employers or job seekers. More recently, Geisinger (2001) made many of the same comments about the WPT and Schoenfeldt (1985) was very concerned about the improper use of the WPT and especially how this might affect protected groups.
Schoenfeldt (1985) cited the Supreme Court decision of Griggs v. Duke Power Company (1971), in which it was ruled that the tests used by Duke Power Company, including the Wonderlic, lacked validity and, therefore, the use of them was not justified, making them inappropriate. The court also found the test practices to adversely impact protected groups. Schoenfeldt (1985) agrees with the court decision and believes that the test could be used to discriminate against certain groups and recommends other methods of testing where this risk may be present. One of the major reasons for these concerns is the fact that mean and median scores on the WPT are vastly different for whites (22.76 and 23) compared to African-Americans (16.20 and 16) and Hispanic-Americans (17.26 and 17), which can lead to obvious problems when a member of one of these minority groups and a white person are competing for a job (Geisinger, 2001). To remedy this situation, the 1983 manual for the WPT explained that group norms should be used to prevent adverse impact (Schmidt, 1985). This practice, of course, was made illegal with the passage of the Civil Rights Act of 1991 (Kaplan & Saccuzzo, 2012, p. 55), making it now more difficult to know how this problem is to be handled in occupational settings.
Aside from the above mentioned dilemmas, there are other groups for whom the WPT may be inappropriate and/or who may be put at a disadvantage. For example, Edinger et al. (1985) and Hawkins et al. (1990) both found the WPT to be an inappropriate measure for populations suffering from mental illness. Hawkins et al. (1990) noted that types of brain damage, visiospatiel disabilities and even amotivational states could undermine the value of the WPT as a measure of intelligence (p. 199). Additionally, the single-score format of the WPT is a limitation for such populations, as potential functional deficiencies will not be detected, as is the case with other measures (Hawkins et al., 1990, p. 200). Geisinger (2001) also raised issue with the speeded nature of the test, postulating that this may provide a disadvantage for those with disabilities, below-average English skills and older populations. This might be exactly why Rosenstein & Glickman (1994) found score discrepancies in their aged sample. Rosenstein & Glickman (1994) compared two groups, one comprised of 40-55 year olds and the other made up of 56-84 year olds and found significant differences in not only the scores of the two groups but also in the number of items attempted by each group. The 56 and over group attempted significantly fewer items, a result that occurred regardless of age adjustments and independent of print-sizes, which were varied across both groups. Finally, another probable flaw has arisen pertaining to the testing of non-native English speakers and to the recent development of foreign language equivalents of the WPT. Belcher (1992) points out that the WPT makes no accommodations for non-native English speakers and that the speeded nature of the test likely causes these individuals to be disadvantaged. Because of this, Belcher (1992) concludes:
[T]he Wonderlic fails to meet the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1985). Instances of noncompliance include failure to provide a testing situation designed to minimize threats to test reliability and validity arising from language differences (Standard 13.1) and lack of information necessary for appropriate test use and interpretation with linguistically diverse test takers (Standard 13.3).
The development of foreign language equivalents should partially remedy this situation, however, Geisinger (2001) remarks that the validity of these parallel forms is questionable because there is an absence of validation data provided. It is equally unclear as to how accurate these translations are (Geisinger, 2001).
My interests in helping people have always drawn me to psychology. Along the way, I have also discovered that I have a knack for assessing the potential and skills of others. Over the years I have sort of made a hobby of placing people I know in appropriate careers or jobs. This pastime and my background in psychology, along with my experiences acquiring a business degree, have led me to be interested in a career within employee selection and the field of organizational psychology. Thus, I found it only appropriate to consider the usefulness of an assessment instrument that is relevant to a career in job placement or human resources and a test that I may someday come to use to make more informed hiring and job matching decisions. The test that stood out to me was the Wonderlic Personnel Test. Realistically, a personnel test should be brief and, yet, also relatively accurate, since decisions often have to be made quickly in business. After all, in the business world, time is money. Thus, the Wonderlic seemed to fit the bill. However, upon rather thoroughly reviewing the literature available on the WPT, the instrument seems to be very limited, at best. It is clear that the relevance of the WPT in certain contexts may be null, such as its use by the NFL combine; but that instance is a bit outside of the intended use of the WPT and not much of a surprise. When all of the other issues with the WPT are taken into account and it is revealed just how many groups are potentially excluded from attaining a fair and accurate score, though, this cannot be ignored. It is, furthermore, troublesome how little effort is made by the Wonderlic Company to provide detailed, easily accessible information about the soundness of their test. Readily obtainable and verifiable data should always be the goal whenever a company wants its products to be used and taken seriously, and Wonderlic, Inc. seems to be establishing a reputation for being less than forthcoming. At some point, this begins to make a test seem suspect.
The world has changed quite a bit since the advent of intelligence theory and even since the birth of the WPT. In the current state of affairs, obtaining gainful employment has become a challenge for everyone, regardless of intelligence, education, race, ethnicity, or personal circumstances. The law does not tolerate discrimination or the putting in place of obstacles for selected groups. Simply because the WPT has experienced widespread use for a number of decades does not mean it should be continually used without scrutiny. Given the evidence, certainty can only be extended to the fact that the WPT seems to accurately assess general mental ability in white professionals under the age of 56 who are free of psychological disorders and disability and who speak English as their native language. This is a very narrow scope, however, beyond it, I could not feel comfortable endorsing the WPT until it has undergone intense scientific investigation that yields current normative data, with satisfactory reliability and validity checks, and more rigorous efforts have been made to establish clear limitations of the WPT. Until then, if one must use the WPT, it should always be used with several other means of assessment (i.e., interview, job history, education, and other tests) and use should ideally be limited to white-collar environments where the test-takers have some formal education and adequate reading and English abilities.
