Testing and Evaluation
Testing and Evaluation
Why Do We Need Tests?
Testing and Evaluation are integral parts of the teaching-learning process. While teaching English language to a non-native learner, tests are useful to measure a student’s current knowledge, skills, and understanding of the English language. They also help in evaluation of their progress and identification of their areas of strength and weakness. Testing can also help motivate them to learn and improve their language skills.
So, tests can serve several important purposes in teaching English to a non-native learner:
- Assessment of Learning: Tests provide a way to assess the progress and understanding of students. They help teachers identify areas where students excel and areas that may need further instruction.
- Feedback: Tests offer students feedback on their performance, highlighting their strengths and weaknesses. This feedback can guide students in understanding what they need to focus on to improve their language skills.
- Motivation: Tests can motivate students to study and engage with the material more actively. The desire to perform well on a test can encourage students to practice their English skills regularly.
- Benchmarking: Tests establish benchmarks for student performance, allowing teachers to compare individual students’ progress over time and to evaluate the effectiveness of teaching methods and materials.
- Accountability: Tests provide a means of accountability for both students and teachers. Students are accountable for their learning progress, while teachers are accountable for their teaching effectiveness.
- Goal Setting: Tests help students and teachers set realistic goals for language learning. By assessing current proficiency levels, students can identify specific areas for improvement and set achievable goals for future learning.
- Preparation for Real-Life Situations: Tests can simulate real-life language use situations, such as writing emails, participating in conversations, or understanding written instructions. This prepares students for using English in practical contexts outside the classroom.
What Makes a Good Test?
To make the Testing and Evaluation process effective and efficient, we need good tests. But, what makes a good test? What are the characteristics of a good test? What are the criteria of a good test? Several key factors are crucial when evaluating the quality of a good test, such as:
Reliability: Imagine a weighing scale that gives you a different weight every time you step on it. That wouldn’t be very useful! Similarly, a reliable test consistently measures the same skill or knowledge, regardless of when or where the test is administered. This ensures that scores accurately reflect a student’s ability, not random chance or variations in testing conditions.
Validity: A test can be reliable, but if it is measuring the wrong thing, it is not very helpful. Validity ensures the test accurately assesses the specific knowledge or skill it is intended to measure. For instance, a test designed to measure writing ability shouldn’t focus solely on grammar rules and neglect creative expression.
Fairness: A fair test provides an equal opportunity for all test-takers to demonstrate their abilities, regardless of background or cultural experiences. This means avoiding questions or content that might favor a specific demographic group.
Clear Instructions: Clear and concise instructions are like a roadmap for the test-taker. They ensure everyone understands what is expected of them, minimizing confusion and frustration during the testing process.
Appropriateness: The test content, format, and difficulty level should be well-matched to the intended purpose and the target population. Imagine giving a college-level math test to elementary school students – it would not be appropriate.
Objectivity: Subjective scoring, where answers are open to interpretation, can introduce bias. A good test strives for objectivity, where scoring is clear-cut and consistent between different graders. This ensures scores reflect the test-taker’s abilities, not the grader’s personal judgment.
Practicality: Consideration needs to be given to the implementation of administering, scoring, and interpreting the test. An ideal test is feasible to implement within reasonable time and resource constraints.
Authenticity: Real-world relevance enhances the value of a test. When test tasks and materials reflect situations or contexts students might encounter outside the classroom, it provides a more meaningful assessment of their skills.
Sensitivity: A good test is like a finely tuned instrument, capable of detecting subtle variations in the abilities or traits being measured. This allows for a more nuanced understanding of student progress and areas for improvement.
Accessibility: Everyone should have a fair chance at demonstrating their knowledge. Accessibility ensures that the test format and delivery methods consider students with disabilities or special needs, providing them with the necessary support to participate effectively.
By considering these factors, teachers can develop and implement effective tests for testing and evaluation of student’s learning and promote fair and meaningful assessment.
Validity of a Good Test
Validity is the foundation of a good test.
A test is said to be valid if it measures accurately what it is intended to measure.
This makes validity the central concept in language testing. Validity ensures that a test is fit for the intended purpose and a reliable, accurate and fair reflection of a learner’s ability.
Construct Validity
Language tests are designed with the purpose to measure essentially theoretical constructs (concepts) such as a learner’s ‘reading ability’, ‘fluency in speaking’, ‘knowledge of grammar’ etc. For this reason, the term ‘construct validity‘ is now used to refer to the overall notion (idea) of validity.
This means that when we say that a particular test is a valid test, we are saying that it measures correctly the construct or concept we intend to measure.
A construct is a theoretical concept, theme, or idea based on empirical observations. It is a variable that’s usually not directly measurable.
It is, therefore, important that a test should be designed in such a manner that its scores maximise the contribution of the desired ‘construct‘ and minimise the contribution of irrelevant factors such as ‘general knowledge’, ‘first language background’ etc. However, no matter how much care is taken to create a valid test, simply claiming that a test has construct validity isn’t sufficient; empirical evidence is necessary to support it. This evidence, which can come in various forms such as content validity and criterion-related validity, is crucial for addressing language testing issues. Let’s begin by looking at these two types of evidence and their importance, and then we’ll delve into other forms of validity evidence.
Content Validity
The first form of empirical evidence is related to the content of the test.
A test is considered to have ‘content validity’ if its content accurately represents the language skills, structures, and other aspects it is designed to assess.
A grammar test, for instance, must consist of items related to grammar knowledge or control. However, this alone doesn’t ensure content validity. Content validity would only be achieved if the test includes a proper sample of relevant structures. The specific structures deemed relevant depend on the purpose of test. For instance, an achievement test for intermediate learners would not contain the same set of structures as one for advanced learners. Likewise, the content of a reading test should reflect the specific reading skills (such as skimming for gist or scanning for information) and the type and difficulty of texts that successful candidates are expected to handle.
To assess the content validity of a test, we need a detailed outline of the skills or structures it aims to cover, which should be established early in the test development process. It is not realistic to expect everything in the specification to appear in the test, as there may be too many elements to include in a single test. However, the specification provides the test creator with a framework for selecting elements for the test in a principled manner. Comparing the test specification with the actual test content forms the basis for determining content validity. Ideally, these assessments should be conducted by individuals familiar with language teaching and testing, but who are not directly involved in producing the test.
Why Content Validity is Important?
Firstly, the higher the content validity of a test, the more likely it is that it will accurately measure what it is intended to measure. This implies that it possesses construct validity. If a test lacks representation of major areas outlined in its specifications, or if these areas are under-represented or absent altogether, it is unlikely to be accurate. Additionally, such a test can have adverse effects, leading to areas not covered by the test being overlooked in teaching and learning.
Often, tests prioritize what is easy to assess rather than what is truly important to assess. The best way to prevent this is by creating thorough and detailed test specifications and ensuring that the test content reflects them fairly. Therefore, content validation should be conducted during the test development process, rather than waiting until the test is in use. When designing a language test for a specific purpose, it becomes crucial to consult domain experts, such as air traffic controllers for an aviation English test.
Criterion-related Validity
The second form of empirical evidence for a test’s construct validity is by comparing its results with those of another reliable assessment of the candidate’s ability. This assessment acts as the standard or criterion against which the test is validated. There are essentially two kinds of criterion-related validity: concurrent validity and predictive validity.
Concurrent Validity
Concurrent validity is when a test and a criterion are administered at the same time. For instance, in a speaking component of an achievement test, if only a 10-minute session is feasible instead of the full 45-minute test, its content validity depends on how well it represents all the skills outlined in the course objectives. Additionally, to establish concurrent validity, a random sample of students would take both the shorter and longer tests, with their scores compared. If there is a high level of agreement between the scores, the shorter test is considered valid. However, if there is little agreement, it is not considered valid. The agreement is measured using correlation coefficients, where a coefficient of 1 indicates perfect agreement and 0 indicates no agreement.
Predictive Validity
The second type of criterion-related validity is predictive validity. This concerns the degree to which a test can predict candidates’ future performance. An example would be how well a proficiency test could predict a student’s ability to cope with a graduate course at a university. The criterion measure in this case could be an evaluation of the student’s English skills by their university supervisor, or it could be the outcome of the course (pass/fail, etc.). However, the choice of criterion measure raises interesting questions. Should we trust the subjective opinions of supervisors who may not be trained in language assessment? How reliable is it to use the final course outcome as the criterion measure when many other factors besides English ability, such as subject knowledge, intelligence, motivation, health, and happiness, contribute to the outcome?
When the course outcome is used as the criterion measure, a validity coefficient of around 0.4 (indicating less than 20% agreement) is typically the highest one can expect. This is partly because of the other factors, and partly because those students whose English the test predicted would be inadequate are usually not allowed to take the course. Therefore, the test’s (possible) accuracy in predicting problems for those students goes unrecognised. This is the reason that a validity coefficient of this order is generally regarded as satisfactory.
Another example of predictive validity is when a student takes a practice test before the final exam. If the validity coefficient is around 0.4, it means that the practice test is only about 40% accurate in predicting how well you’ll actually do on the final exam. So, if you do really well on the practice test, there is a possibility you’ll do well on the final exam, but it is not guaranteed. And if you don’t do well on the practice test, it doesn’t necessarily mean you will do poorly on the final exam. In other words, the practice test is not very reliable in telling us how well you’ll do on the real thing. It’s like having some idea, but not a very clear one, about how you’ll perform.
To sum up, content validity, concurrent validity and predictive validity, all have a part to play in the development of a good test.
Validity in Scoring
It is worth pointing out that if a test is to have validity, not only the items but also the way in which the responses are scored must be valid. It is no use having excellent items if they are scored invalidly. A reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (assuming the reading test is meant to measure just reading ability!). By measuring more than one ability, it makes the measurement of the one ability in question less accurate. There may be occasions when, because of misspelling or faulty grammar, it is not clear what the test-taker intended. In this case, the problem is with the item, not with the scoring. Similarly, if we are interested in measuring speaking or writing ability, it is not enough to obtain speech or writing in a valid fashion. The rating of that speech or writing has to be valid too. For instance, overemphasis on such mechanical features as spelling and punctuation can invalidate the scoring of written work (and so the test of writing).
Face Validity
A test is said to have face validity if it looks as if it measures what it is supposed to measure.”
For example, a test that pretended to measure pronunciation ability but which did not require the test-taker to speak (and there have been some) might be thought to lack face validity. This would be true even if the test’s construct and criterion-related validity could be demonstrated. Face validity is not a scientific notion and is not seen as providing evidence for construct validity, yet it can be very important. A test which does not have face validity may not be accepted by candidates, teachers, education authorities or employers. It may simply not be used; and if it is used, the candidates’ reaction to it may mean that they do not perform on it in a way that truly reflects their ability. It is therefore necessary that new techniques, particularly those which provide indirect measures, have to be introduced slowly, with care, and with convincing explanations.
If you are looking forward to prepare for UGC NET/JRF, you may find this article useful.
©2024. Md. Rustam Ansari [profrustamansari@gmail.com]