
In the Spring semester of 1995, as part of the New Traditions Chemistry Curriculum Reform Initiative, professor John Wright in conjunction with the LEAD Center designed an "experiment" to see whether the wide variety of "active-learning" reforms that he had introduced in a freshman honors analytical chemistry course, Chem 110, would have any measurable effects on student learning in comparison to students in another lecture of the same course taught by a colleague using a more traditional lecturing format. Below we refer to Wright's lecture by the acronym, SAL, (Structured Active Learning). The comparison lecture is referred to as RL, (Responsive Lecturing), since the comparison lecturer both fielded student questions as well as posed questions to the students during lecture. Each lecture had about 100 students.
Wright asked a number the faculty in the Chemistry department
(including the colleague who would run the "comparison"
lecture) what they would consider the most convincing sort of
evidence that might demonstrate the student learning gains associated
with his reforms. After some discussion it was decided that an
oral exam conducted by faculty members from science departments
"other" than Chemistry would provide the most convincing
evidence.
Twenty-five faculty volunteers were found and the oral assessments
were conducted at the end of the semester. (The oral exam counted
for a small portion of the each student's final grade.) Each
assessor typically saw about eight students for 30 minutes each
(about half from Wright's SAL lecture the rest from the comparison,
RL, lecture). A total of 180 students participated in the oral
exams. All students in both lectures were divided into eight
groups, or octiles, of consecutive rankings based on the total
points they earned in the pre-requisite Chem 109 course. To ensure
that a given oral assessor saw students of similar Chem 109 achievement
levels, each assessor's students were chosen from the same Chem
109 octile. Assessors were unaware of the Chem 110 lecture in
which the students were enrolled.
Each oral assessor devised their own exam and their own criteria for demonstration of competence, all of which were detailed to LEAD Center staff in both a written format as well as through one-on-one interviews. Assessors were asked to construct a "relative" ranking of the students they interviewed from 1 (most competent) to n, (least competent) [n = no. of students they saw], as well as to rate "how far apart/close" the students were by placing hash-marks on a linear scale. In addition, assessors were asked to provide an "absolute" rating of student competence from 1 to 6 (similar to a grade).
The analysis of assessor criteria led to some interesting findings as to the nature of the cognitive gains that one might expect from the introduction of "active learning" methods in large lecture science courses.
The relative rankings posed some non-trivial statistical problems due to the correlation inherent such a scoring rubric, i.e., if a given student was ranked 1 then no other student seen by the same assessor could be ranked one. This was unfortunate since the relative rankings were seen as the most sensitive indicator of contrasts between the experimental, SAL, and comparison, RL, lectures. (This was due to the fact that each assessor interviewed students with similar achievement levels in the pre-requisite Chem 109 course.) Since the relative rankings assigned to the students were not independent attention was shifted to the assessors as the experimental unit. Namely, each assessor was assigned the
"score" = the difference in the average rankings of the RL students minus the average of the SAL student rankings
where the average is over only those students interviewed by the given assessor. These 25 scores were independent observations for which the Mann-Whitney Matched Pair Signed-Rank Test and the cruder, but more generally applicable Sign-Test provided appropriate test statistics (cf. Chart 1 below).
The Mann-Whitney test takes account of both the magnitude and the sign (+ or -) of the 25 "scores," but assumes that the distributions of average relative rankings of the RL and SAL students have the same "shape" for all assessors (i.e., the RL and SAL distributions differ only by a translation). The corresponding p-value gives the probability (under the Null-Hypothesis that the translation is zero) of seeing the observed value of the statistic (or higher), i.e., it gives the likelihood that there is no difference between the RL and SAL distributions). Meanwhile, the Sign-Test takes account only of the sign (+ or -) of the 25 scores. The corresponding p-value gives the probability of the observed number of +'s under the Null-Hypothesis that the chance of a + is 50-50 for each assessor, (i.e., it gives the likelihood that there is no difference between an assessor's average RL rank and his/her average SAL rank). (Thus, no assumption is made as to the shape of the distribution of average ranks.)
The analysis of assessor criteria for rating student competency led to the interesting finding the assessor's criteria were easily divided into two types: those who focused their criteria on the "process" by which students understood and solved the problems which were discussed in their oral exams, while the remaining assessors focused on the "outcomes" of students' analytical processes. The "Process" group--15 of the 25 assessors-- rated students highly primarily for demonstrating what has come to be called "expert" problem solving strategies (e.g., self-correction; awareness of different problem solving strategies; being able to form and grasp a "big picture," etc.). On the other hand, the "Outcomes" group (10 of the 25 assessors) rated students highly primarily on to what extent students' analytical processes (be they expert or not) led them to correct solutions.
Chart 1 below shows each of the 25 assessors "score" as described above, namely, the average of the ranks the assessor assigned to the RL students minus the average of ranks assigned to the SAL students. Note that 1 was the highest possible rank, so positive differences indicate that that the SAL students received the higher relative rankings. Note that 18 of the 25 scores are positive: the Sign-Test p-value indicated in the chart is simply the probability that a "fair" coin tossed 25 times would yield 18 or more heads. The Mann-Whitney Matched Pair Signed-Rank Test p-value takes account of both the sign and the magnitude of the differences in average ranks. Roughly speaking, it indicates the likelihood of seeing an "average" assessor score as high or higher than the observed average of the 25 scores under the Null-hypothesis that each assessor's "score" is an independent sample from a common distribution symmetric about zero (i.e., that for each assessor the distributions of average RL and average SAL ranks are the same).
An interesting by-product of this study arose from the analysis of each assessor's criteria for assessing student competency during their oral exams. Charts 2. and 3. Below show that the SAL students were far more likely to receive higher relative rankings among the "Process" group of assessors, whereas the relative rankings were much more evenly split between RL and SAL students among the "Outcomes" assessors.
However, the reader should note that professors rated most of the students in both lectures quite highly on their "absolute" scales of competency, i.e., when they were asked not to compare the students with each other, but rather just give each student an absolute rating from 1 to 6 (similar to a grade). Chart 4 below shows that students from both lectures received an average absolute competency rating of above 4.0 on a scale from 1 to 6. Nonetheless, the Wilcoxon-Mann-Whitney test found the difference in means significant with p-value < 0.001.
These results are consistent with the interpretation that "active learning" strategies can be implemented in such a way that they produce an increase in "higher-order" cognitive gains compared to traditional pedagogy, and that in addition, so that students perform as well or better on more traditional measures of assessment.