Assessment

Comprehensive Assessment Research Review: Annotated Bibliography

January 29, 2014 Updated March 1, 2015

Your content has been saved!

Alvarez, L., & Corn, J. (2008). Exchanging Assessment for Accountability: The Implications of High-Stakes Reading Assessments for English Learners (PDF). Language Arts, 85(5), 354-365. Teacher research demonstrates the detrimental effects on English learners of replacing authentic literacy assessments with standardized assessments designed primarily for purposes of accountability.

Andrade, H. (2007). Self-Assessment Through Rubrics (PDF). Educational Leadership, 65(4), 60-63. Rubrics can be a powerful self-assessment tool -- if teachers disconnect them from grades and give students time and support to revise their work.

Andrade, H., Du, Y., & Mycek, K. (2010). Rubric-Referenced Self-Assessment and Middle School Students’ Writing. Assessment in Education: Principles, Policy & Practice, 17(2), 199-214. This study investigated the relationship between 162 middle school students’ scores for a written assignment and a process that involved students in generating criteria and self-assessing with a rubric. In one condition, a model essay was used to generate a list of criteria for an effective essay, and students reviewed a written rubric and used the rubric to self-assess first drafts. The comparison condition involved generating a list of criteria and reviewing first drafts. The results suggest that reading a model, generating criteria, and using a rubric to self-assess can help middle school students produce more-effective writing.

Andrade, H., Du, Y., & Wang, X. (2008). Putting Rubrics to the Test: The Effect of a Model, Criteria Generation, and Rubric-Referenced Self-Assessment on Elementary School Students’ Writing. Educational Measurement: Issues and Practice, 27(2), 3-13. Third- and fourth-grade students (N = 116) in the experimental condition used a model paper to scaffold the process of generating a list of criteria for an effective story or essay. They received a written rubric and used the rubric to self-assess first drafts. Matched students in the control condition generated a list of criteria for an effective story or essay and reviewed first drafts. Findings include a main effect of treatment and of previous achievement on total writing scores. The results suggest that using a model to generate criteria for an assignment and using a rubric for self-assessment can help elementary school students produce more-effective writing.

Andrade, H., & Valtcheva, A. (2009). Promoting Learning and Achievement Through Self-Assessment. Theory Into Practice, 48(1), 12-19. doi:10.1080/00405840802577544. The authors describe how to do criteria-referenced self-assessment, and they review research in which criteria-referenced self-assessment has been shown to promote achievement. Criteria-referenced self-assessment is a process during which students collect information about their own performance or progress; compare it to explicitly stated criteria, goals, or standards; and revise accordingly. The purpose of self-assessment is to identify areas of strength and weakness in one’s work in order to make improvement and promote learning.

Bandura, A. (1997). Self-Efficacy: The Exercise of Control. New York, NY: W. H. Freeman and Company. Self-Efficacy is the result of more than 20 years of research by the psychologist Albert Bandura and related research that has emerged from Bandura’s original work. The book is based on Bandura’s theory that those with high self-efficacy expectancies -- the belief that one can achieve what one sets out to do -- are healthier, more effective, and generally more successful than those with low self-efficacy expectancies.

Bennett, R. E. (2011). Formative assessment: a critical review. Assessment in Education: Principles, Policy & Practice, 18(1), 5-25. doi: 10.1080/0969594X.2010.513678. This paper takes a critical look at the research on formative assessment, raising concerns about the conclusions drawn from landmark studies such as Black & Wiliam (1998). Bennett argues that the term “formative assessment” is problematic since it is often used to capture a wide range of practices. Furthermore, formative assessment lacks a sufficient body of peer-reviewed, methodologically rigorous studies to support a thorough analysis of its effectiveness. He concludes by stating that additional research is needed.

Black, P., Harrison, C., Hodgen, J., Marshall, B., & Serret, N. (2010). Validity in Teachers’ Summative Assessments. Assessment in Education: Principles, Policy & Practice, 17(2), 215-232. This paper describes some of the findings of a project that set out to explore and develop teachers’ understanding and practices in their summative assessments. The focus was on those summative assessments that are used on a regular basis within schools for guiding the progress of pupils and for internal accountability. The project combined both intervention and research elements. The intervention aimed both to explore how teachers might improve those practices in light of their reexamination of their validity and to engage them in moderation exercises within and between schools to audit examples of students’ work and to discuss their appraisals of these examples. It was found that teachers’ attention to validity issues had been undermined by the external test regimes, but teachers could readdress these issues by reflection on their values and by engagement in a shared development of portfolio assessments.

Black, P., & Wiliam, D. (1998, October). Inside the Black Box: Raising Standards Through Classroom Assessment (PDF). Phi Delta Kappan, 92(1), 81-90. Black and Wiliam conducted a review of 250 book chapters and journal articles, finding firm evidence that innovations designed to strengthen the practice of formative assessment yield substantial and significant learning gains. Learning gains are measured by comparing the average improvements in the test scores of pupils, represented by the statistical size of the effect. Typical effect sizes of the formative-assessment experiments were between 0.4 and 0.7 and are larger than most of those found for educational interventions. An effect size gain of 0.7 in the recent international comparative studies in mathematics would have raised the score of a nation in the middle of the pack of 41 countries (e.g., the United States) to one of the top five. The authors conclude that “while formative assessment can help all pupils, it yields particularly good results with low achievers by concentrating on specific problems with their work and giving them a clear understanding of what is wrong and how to put it right.” The authors recommend that “feedback to any pupil should be about the particular qualities of his or her work, with advice on what he or she can do to improve, and should avoid comparisons with other pupils.” In addition, three elements of feedback are defined: recognition of the desired goal, evidence about present position, and some understanding of a way to close the gap between the two. The authors also point out that sustained programs of professional development and support are required “if the substantial rewards promised by the research evidence are to be secured,” so that each teacher can “find his or her own ways of incorporating [feedback] into his or her own patterns of classroom work and into the cultural norms and expectations of a particular school community.”

Black, P., & Wiliam, D. (2009). Developing the Theory of Formative Assessment (PDF). Educational Assessment, Evaluation and Accountability, 21(1), 5-31. doi:10.1007/s11092-008-9068-5. This article provides a unifying framework for the diverse set of formative-assessment practices and aims to help practitioners implement the practices more fruitfully.

Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.). (2000). Learning and Transfer. In How People Learn: Brain, Mind, Experience, and School (pp. 51-78). Washington, DC: National Academy Press. The authors explore key characteristics of learning and transfer that have important implications for education. The authors assert that all new learning involves transfer based on previous learning and that transfer from school to everyday environments is the ultimate purpose of school-based learning. Transfer is supported by abstract representations of knowledge and best viewed as an active, dynamic process rather than a passive end product of a particular set of learning experiences. Helping learners choose, adapt, and invent tools for solving problems is one way to facilitate transfer while also encouraging flexibility. Adaptive expertise, which involves the ability to monitor and regulate understanding in ways that promote learning, is an important model for students to emulate.

Briggs, D. C., Ruiz‐Primo, M. A., Furtak, E., Shepard, L., & Yin, Y. (2012). Meta‐Analytic Methodology and Inferences About the Efficacy of Formative Assessment. Educational Measurement: Issues and Practice, 31(4), 13-17. doi: 10.1111/j.1745-3992.2012.00251.x. This paper is a commentary on the debate around formative assessment research, focusing on inconsistent results regarding its effectiveness. While Black and Wiliam (1998) found an effect size of 0.4 to 0.7 (moderate), Kingston and Nash (2011) found an effect size of 0.2 (small) in their own meta-analysis. Briggs et al. point out methodological concerns with Kingston and Nash’s analysis, and argue that additional research is needed.

Carlson, D., Borman, G. D., & Robinson, M. (2011). A Multistate District-Level Cluster Randomized Trial of the Impact of Data-Driven Reform on Reading and Mathematics Achievement. Educational Evaluation and Policy Analysis, 33(3), 378-398. In a randomized experiment in more than 500 schools within 59 districts for the reading portion of the project, 57 districts for the math portion, and seven states, approximately half of the participating districts were randomly offered quarterly benchmark student assessments and received extensive training on interpreting and using the data to guide reform. The benchmark assessments monitored the progress of children in grades 3-8 (3-11 in Pennsylvania) in mathematics and reading and guided data-driven reform efforts. The outcome measure was school-level performance on state-administered achievement tests. The Center for Data-Driven Reform in Education model was found to have a statistically significant positive effect on student mathematics achievement. In reading, the results were positive but did not reach statistical significance.

Chang, Chi-Cheng (2009). Self-Evaluated Effects of Web-Based Portfolio Assessment System for Various Student Motivation Levels. Journal of Educational Computing Research, 41(4), 391-405. The purpose of this study was to explore the self-evaluated effects of a Web-based portfolio assessment system on various categories of students’ motivation. The subjects for this study were the students of two computer classes in a junior high school. The experimental group used the Web-based portfolio assessment system whereas the control group used traditional assessment. The result reveals that the Web-based portfolio assessment system was more effective or useful in terms of self-evaluated learning effects for low-motivation students.

Chi, B., Snow, J. Z., Goldstein, D., Lee, S., & Chung, J. (2010). Project Exploration: 10-Year Retrospective Program Evaluation Summative Report. This report describes the independent evaluation, conducted in 2010, by the Center for Research, Evaluation, and Assessment (REA) at the Lawrence Hall of Science, University of California, Berkeley. The evaluators undertook a 10-year retrospective study of Project Exploration programming and participation by nearly 1,000 Chicago public school students. The survey and follow-up interviews attempted to surface factors that affected students’ decisions to get involved and stay involved with science. Key findings from the REA study include the following: increased science capacity; positive youth development; and engagement in a community of practice that nurtured relationships and helped students learn from one another, envision careers in science, and feel good about their futures.

Cohen, G. L., Garcia, J., Apfel, N., & Master, A. (2006). Reducing the Racial Achievement Gap: A Social-Psychological Intervention (PDF). Science, 313(5791), 1307-1310. In two field studies, students were led to self-affirm in order to assess the consequences on academic performance. In these studies (separated by a year and composed of a separate set of students), seventh-grade students at a racially diverse middle school in the northeast United States were randomly assigned to self-affirm or not to self-affirm as part of a brief classroom exercise. Students who self-affirmed did so by indicating values that were important to them and writing a paragraph indicating why those values were important. For students who did not self-affirm, they indicated their least important values and wrote a paragraph regarding why those values might be important to others. The effects on academic performance during the term were dramatic. African American students who had been led to self-affirm performed about .3 grade points better during the term than those who had not. Moreover, benefits occurred regardless of preintervention levels of demonstrated ability. The self-affirmation intervention appears to have attenuated a drop in performance occurring for the African American students.

Cohen, G. L., Steele, C. M., & Ross, L. D. (1999). The Mentor’s Dilemma: Providing Critical Feedback Across the Racial Divide (PDF). Personality and Social Psychology Bulletin, 25(10), 1302-1318. Stereotype threat is eliminated and motivation and domain identification are increased by so-called wise mentoring that offers criticism accompanied by high expectations and the view that each student is capable of reaching those expectations. Across two experiments, an emphasis on high standards and student capability eliminated perceived bias, eliminated differences in motivation based on race, and preserved identification with the domain in question. These results suggest that feedback that might be viewed in terms of negative stereotypes differs in effectiveness, according to the presence of an emphasis on high standards and assurance that the individual can meet those standards.

Council of Chief State School Officers (CCSSO). (2013, February). Knowledge, Skills, and Dispositions: The Innovation Lab Network State Framework for College, Career, and Citizenship Readiness, and Implications for State Policy (PDF) (CCSSO White Paper). This white paper communicates the shared framework and definitional elements of college, career, and citizenship readiness accepted by Innovation Lab Network (ILN) chief state school officers in June 2012. Going forward, each ILN state has committed to adopting a definition of college and career readiness that is consistent with these elements, although precise language may be adapted, and to reorient its education system in pursuit of this goal.

Danaher, K., & Crandall, C. S. (2008). Stereotype Threat in Applied Settings Re-Examined. Journal of Applied Social Psychology, 38(6), 1639-1655. Reducingstereotypethreat.org says this about this study: "Given the importance of standardized-test performance in determining educational opportunities, career paths, and life choices, Danaher and Crandall argue that use of standard statistical decision criteria is misplaced in this context. Accordingly, these authors reexamined the data presented by Stricker and Ward (2004) but used criteria of p < .05 from the overall analysis of variance and η ≥ .05 for the standard. Results indicate that "soliciting identity information at the end rather than at the beginning of the test-taking session shrunk sex differences in performance by 33%." When test takers did not report their identities before the test, women’s performance improved noticeably and men’s scores declined slightly. Reducingstereotype.org concludes, "This re-analysis suggests that soliciting social-identity information prior to test taking does produce small differences in performance consistent with previous findings in the stereotype-threat literature that, when generalized to the population of test takers, can produce profound differences in outcomes for members of different groups."

Darling-Hammond, L., & Adamson, F. (2010). Beyond Basic Skills: The Role of Performance Assessment in Achieving 21st Century Standards of Learning (PDF). Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education. This paper is the culminating report of a Stanford University project aimed at summarizing research and lessons learned regarding the development, implementation, consequences, and costs of performance assessments. A set of seven papers was commissioned to examine experiences with and lessons from large-scale performance assessment in the United States and abroad, including technical advances, feasibility issues, policy implications, usage with English-language learners, and costs.

Darling-Hammond, L., & Barron, B. (2008). Teaching for Meaningful Learning: A Review of Research on Inquiry-Based and Cooperative Learning (PDF). In Powerful Learning: What We Know About Teaching for Understanding (pp. 11-16). San Francisco, CA: Jossey-Bass. This is a comprehensive review of research on inquiry-based-learning outcomes and approaches, including project-based learning, problem-based learning, and design-based instruction. Darling-Hammond and Barron describe evidence-based approaches as follows to support inquiry-based teaching in the classroom: (1) clear goals and carefully designed guiding activities; (2) a variety of resources (e.g., museums, libraries, Internet, videos, lectures) and time for students to share, reflect, and apply knowledge while thinking through classroom dilemmas more productively; (3) participation structures and classroom norms that increase the use of discussion and a culture of collaboration (e.g., framing discussions to allow for addressing misconceptions midproject and using public performances); (4) formative assessments that provide opportunities for revision; and (5) assessments that are multidimensional. Ultimately, these practices will support students in evaluating their own work against predefined rubrics and promote assessment, knowledge development, and collaboration.

Darling-Hammond, L., Herman, J., Pellegrino, J., et al. (2013). Criteria for High-Quality Assessment (PDF). Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education. The Common Core State Standards, adopted by 45 states, feature an increased focus on deeper learning, or students’ ability to analyze, synthesize, compare, connect, critique, hypothesize, prove, and explain their ideas. This report provides a set of criteria for high-quality student assessments. These criteria can be used by assessment developers, policy makers, and educators as they work to create and adopt assessments that promote deeper learning of 21st-century skills that students need to succeed in today’s knowledge-based economy.

Duckworth, A. L., Grant, H., Loew, B., Oettingen, G., & Gollwitzer, P. M. (2011). Self-Regulation Strategies Improve Self-Discipline in Adolescents: Benefits of Mental Contrasting and Implementation Intentions (PDF). Educational Psychology, 31(1), 17-26. doi:10.1080/01443410.2010.506003. Sixty-six second-year high school students who were preparing during English class to take a high-stakes exam (the PSAT) by practicing the writing section were randomly assigned to one of two conditions: a 30-minute written “mental contrasting combined with implementation intentions” (MCII) exercise or a control condition. All students answered a question about the likelihood of accomplishing a goal (“How likely do you think it is that you will complete all 10 practice tests in the PSAT workbook?”), wrote about the importance of that goal, and listed two positive outcomes associated with completing that goal and two obstacles that could interfere. Students in the control condition wrote a short essay about an influential person or event in their life. Students in the MCII elaborated in writing on both the positive outcomes and obstacles of the goal by imagining as vividly as possible. Students then rewrote both obstacles and proposed a specific solution for each one by writing three “if-then” plans (i.e., implementation intentions) in this form: “If [obstacle], then I will [solution].” The third if-then specified where and when they would complete the workbook. Students in the mental contrasting with implementation intentions condition completed 60 percent more practice questions than did controls. The authors conclude that “these findings point to the utility of directly teaching to adolescents mental contrasting with implementation intentions as a self-regulatory strategy of successful goal pursuit.”

Duckworth, A. L., Peterson, C., Matthews, M. D., & Kelly, D. R. (2007). Grit: Perseverance and Passion for Long-Term Goals (PDF). Journal of Personality and Social Psychology, 92(6), 1087-1101. doi:10.1037/0022-3514.92.6.1087. The authors tested the importance of one noncognitive trait: grit. Defined as perseverance and passion for long-term goals, grit accounted for an average of 4 percent of the variance in success outcomes, including educational attainment among two samples of adults; grade point average among Ivy League undergraduates; retention among two classes at the United States Military Academy, West Point; and ranking in the Scripps National Spelling Bee. Grit did not relate positively to IQ but was highly correlated with “conscientiousness.” The authors conclude that achieving difficult goals involves not only talent but also sustained and focused application of talent over time.

Duckworth, A. L., & Quinn, P. D. (2009). Development and Validation of the Short Grit Scale (Grit-S) (PDF). Journal of Personality Assessment, 91(2), 166-174. doi:10.1080/00223890802634290. This paper validates the use of a shorter version of the Grit Scale, which measures the trait of perseverance and passion for long-term goals. The shorter version (Grit-S) correlated with educational attainment and fewer career changes among adults and predicted GPA among adolescents in addition to inversely predicting hours watching TV. Among West Point cadets, the Grit-S predicted retention, and among Scripps National Spelling Bee competitors, the Grit-S predicted final round attained, a relationship mediated by spelling practice.

Dweck, C. S. (2006). Mindset: The New Psychology of Success. New York, NY: Ballantine Books/Random House Publishing. Dweck shows how mindset is unwrapped when we are children, and as adults, it runs each part of our lives: jobs, athletics, relationships, child rearing. Dweck details the ways in which creative talents across all genres -- music, literature, science, sports, business -- use the growth mindset to get results. She also demonstrates how we can change our mindset at any time to achieve true success and fulfillment. Dweck covers a range of applications and helps parents and educators see how they can promote the growth mindset.

Gersten, R., Beckmann, S., Clarke, B., Foegen, A., Marsh, L., Star, J. R., & Witzel, B. (2009). Assisting Students Struggling With Mathematics: Response to Intervention (RtI) for Elementary and Middle Schools (PDF) (NCEE 2009-4060). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. This guide provides eight specific recommendations intended to help teachers, principals, and school administrators use response to intervention to identify students who need assistance in mathematics and to address the needs of these students through focused interventions. The guide provides suggestions on how to carry out each recommendation and explains how educators can overcome potential roadblocks to implementing the recommendations.

Gersten, R., Compton, D., Connor, C. M., Dimino, J., Santoro, L., Linan-Thompson, S., & Tilly, W. D. (2008). Assisting Students Struggling With Reading: Response to Intervention (RtI) and Multi-Tier Intervention for Reading in the Primary Grades (PDF) (NCEE 2009-4045). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. This guide offers five specific recommendations to help educators identify struggling readers and implement evidence-based strategies to promote their reading achievement. Teachers and reading specialists can utilize these strategies to implement RtI and multitier intervention methods at the classroom or school level. Recommendations cover screening students for reading problems, designing a multitier intervention program, adjusting instruction to help struggling readers, and monitoring student progress.

Gielen, S., Peeters, E., Dochy, F., Onghena, P., & Struyven, K. (2010). Improving the Effectiveness of Peer Feedback for Learning. Learning and Instruction, 20(4), 304-315. doi:10.1016/j.learninstruc.2009.08.007. A quasi-experimental repeated-measures design examined the effectiveness of (a) peer feedback for learning, more speciﬁcally, certain characteristics of the content and style of the provided feedback, and (b) a particular instructional intervention to support the use of the feedback. Writing assignments of 43 students in grade seven in secondary education showed that receiving “justiﬁed” comments in feedback improves performance, but this effect diminishes for students with better pretest performance. Justiﬁcation was superior to the accuracy of comments. The instructional intervention of asking students who received peer assessment to reﬂect upon feedback after peer assessment did not increase learning.

Goldenberg, C. (1992/1993). Instructional Conversations: Promoting Comprehension Through Discussion. The Reading Teacher, 46(4), 316-326. This article describes an instructional conversation model developed along with elementary school teachers who want to promote these types of learning opportunities for students.

Gollwitzer, P. M., & Sheeran, P. (2006). Implementation Intentions and Goal Achievement: A Meta-Analysis of Effects and Processes (PDF). Advances in Experimental Social Psychology, 38, 69-119. Holding a strong goal intention (“I intend to reach Z!”) does not guarantee goal achievement because people may fail to deal effectively with self-regulatory problems during goal striving. This review analyzes whether realization of goal intentions is facilitated by forming an implementation intention that spells out the when, where, and how of goal striving in advance (“If situation Y is encountered, then I will initiate goal-directed behavior X!”). Findings from 94 independent tests showed that implementation intentions had a positive effect of medium-to-large magnitude (d = .65) on goal attainment. Implementation intentions were effective in promoting the initiation of goal striving, the shielding of ongoing goal pursuit from unwanted influences, disengagement from failing courses of action, and conservation of capability for future goal striving. There was also strong support for postulated component processes: Implementation-intention formation both enhanced the accessibility of specified opportunities and automated respective goal‐directed responses. Several directions for future research are outlined.

Griffin, P. (2007). The Comfort of Competence and the Uncertainty of Assessment (PDF). Studies in Educational Evaluation, 33(1), 87-99. This article argues that a probabilistic interpretation of competence can provide the basis for a link between assessment, teaching and learning, curriculum resources, and policy development. Competence is regarded as a way of interpreting the quality of performance in a coherent series of hierarchical tasks. The work of Glaser is combined with that of Rasch and Vygotsky. When assessment performance is reported in terms of competence levels, the score is simply a code for a level of development and helps to indicate Vygotsky’s zone of proximal development in which the student is ready to learn.

Hattie, J. (2009). Visible Learning: A Synthesis of Over 800 Meta-Analyses Relating to Achievement. New York, NY: Routledge. Hattie analyzed a total of about 800 meta-analyses, encompassing 52,637 studies, 146,142 effect sizes, and millions of students. Hattie points out that in education, most things work, more or less, and sets out to identify educational practices that work best and therefore best repay the effort invested. According to Hattie, the simplest prescription for improving teaching is to provide “dollops of feedback.” Providing students with feedback had one of the largest effect sizes on learning compared with other interventions studied.

Karegianes, M. L., Pascarella, E. T., & Pflaum, S. W. (1980). The Effects of Peer Editing on the Writing Proficiency of Low-Achieving Tenth Grade Students. The Journal of Educational Research, 73(4), 203-207. This article found peer editing to be as effective, if not more effective, than teacher editing for low-achieving students in 10th grade.

Kingston, N., & Nash, B. (2011). Formative assessment: A meta‐analysis and a call for research. Educational Measurement: Issues and Practice, 30(4), 28-37. doi: 10.1111/j.1745-3992.2011.00220.x. This meta-analysis reviews the research on formative assessment, re-examining Black and Wiliam’s (1998) claim that it has an effect size of 0.4-0.7 (moderate) on student learning. Kingston and Nash evaluated each study in Black and Wiliam’s meta-analysis and discovered many that had flawed research designs, reducing the number of valid studies from 300 to 13. Upon conducting their own meta-analysis, they found an effect size of 0.2 (small effect).

Koh, K. (2011). Improving Teachers’ Assessment Literacy Through Professional Development. Teaching Education, 22(3), 255-276. This study examined the effects of professional development on teachers’ assessment literacy between two groups of teachers: (1) teachers who were involved in ongoing and sustained professional development in designing authentic classroom assessment and rubrics and (2) teachers who were given only short-term, one-shot professional-development workshops in authentic assessment. The participating teachers taught fourth- and fifth-grade English, science, and mathematics. The teachers who were involved in ongoing, sustained professional development showed significantly increased understanding of authentic assessment.

Liem, G. A. D., Ginns, P., Martin, A. J., Stone, B., & Herrett, M. (2012). Personal Best Goals and Academic and Social Functioning: A Longitudinal Perspective. Learning and Instruction, 22(3), 222-230. This study examined the role of personal best (PB) goals in academic and social functioning. Alongside academic and social outcome measures, PB goal items were administered to 249 high school students at the beginning and end of their school year. Personal best goals were correlated with a range of positive variables at Time 1; however, at Time 2 the effects of personal best goals on deep learning, academic ﬂow, and positive teacher relationship remained signiﬁcant after controlling for prior variance of corresponding Time 1 factors, suggesting that students with personal best goals show sustained resilience in academic and social development.

Marx, D. M., Stapel, D. A., & Muller, D. (2005). We Can Do It: The Interplay of Construal Orientation and Social Comparisons Under Threat. Journal of Personality and Social Psychology, 88(3), 432. Results of four experiments showed that women tended to perform as well as men on a math test when the test was administered by a woman with high competence in math, but they performed more poorly (and showed a lower state of self-esteem) when the test was administered by a man. Results indicated that these effects were due to the perceived competence, and not just the gender, of the experimenter.

Mento, A. J., Steel, R. P., & Karren, R. J. (1987). A Meta-Analytic Study of the Effects of Goal Setting on Task Performance: 1966-1984. Organizational Behavior and Human Decision Processes, 39(1), 52-83. This meta-analysis of published research from 1966 to 1984 focuses on the relationship between goal-setting variables and task performance. The analyses “yielded support for the efficacy of combining specific hard goals with feedback versus specific hard goals without feedback.”

Murphy, P. K., Wilkinson, I. A. G., Soter, A. O., Hennessey, M. N., & Alexander, J. F. (2009). Examining the Effects of Classroom Discussion on Students’ High-Level Comprehension of Text: A Meta-Analysis. Journal of Educational Psychology, 101(3), 740-764. This comprehensive meta-analysis of empirical studies was conducted to examine evidence of the effects of classroom discussion on measures of teacher and student talk and on individual student comprehension and critical-thinking and reasoning outcomes. Results revealed that several discussion approaches produced strong increases in the amount of student talk and concomitant reductions in teacher talk, as well as substantial improvements in text comprehension. Few approaches to discussion were effective at increasing students’ literal or inferential comprehension and critical thinking and reasoning. While the range of ages of participants in the reviewed studies was large, a majority of studies were conducted with students in grades 4-6.

National Education Association (NEA) (2012). Preparing 21st Century Students for a Global Society: An Educator’s Guide to the “Four Cs” (PDF). NEA, in collaboration with other U.S. professional organizations, developed this guide to help educators integrate policies and practices for building the “Four Cs” (critical thinking, communication, collaboration, and creativity) into their own instruction. They argue that what was considered a good education 50 years ago is no longer enough for success in college, career, and citizenship in the 21st century.

Parker, W. C., Lo, J., Yeo, A. J., Valencia, S. W., Nguyen, D., Abbott, R. D., . . . & Vye, N. J. (2013). Beyond Breadth-Speed-Test: Toward Deeper Knowing and Engagement in an Advanced Placement Course. American Educational Research Journal, 50(6), 1424-1459. This mixed-methods-design experiment was conducted with 289 students in 12 classrooms across four schools in an “excellence for all” context of expanding enrollments and achieving deeper learning in AP U.S. Government and Politics. Findings suggest that quasi-repetitive projects can lead to higher scores on the AP test but a floor effect on the assessment of deeper learning. Implications are drawn for assessing deeper learning and helping students adapt to shifts in the grammar of schooling.

Reis, S. M., McCoach, D. B., Little, C. A., Muller, L. M., & Kaniskan, R. B. (2011). The Effects of Differentiated Instruction and Enrichment Pedagogy on Reading Achievement in Five Elementary Schools. American Educational Research Journal, 48(2), 462-501. Five elementary schools (63 teachers and 1,192 students in grades 2-5) were randomly assigned to differentiated or whole-group classroom instruction in reading. The differentiated approach focused on student engagement in reading by using a three-phase model. The model begins with a book discussion and read aloud, with time for independent reading, and integrated reading strategies or higher-level-thinking questions. Students then listened to other students read and received differentiated reading strategies in five-minute individual conferences or participated in literary discussions. Groups then had options for independent reading, creativity training, buddy reading, or other choices. Differentiated instruction increased reading fluency in three out of five schools. The authors conclude that differentiated instruction was equally if not more effective than the traditional whole-group approach.

Rosenshine, B., & Meister, C. (1994). Reciprocal Teaching: A Review of the Research. Review of Educational Research, 64(4), 479-530. An analysis of 16 studies indicated that reciprocal teaching was effective as long as the quality of instruction was reasonably high. The effect size was much larger for experimenter-developed comprehension tests (short answers and passage summaries) than when standardized tests were used.

Rowe, M. B. (1974). Wait-Time and Rewards as Instructional Variables, Their Influence on Language, Logic, and Fate Control: Part One -- Wait-Time. Journal of Research in Science Teaching, 11(2), 81-94. The level of complexity in student responses rises as a teacher pauses after asking questions. Analysis of more than 300 tape recordings over six years of investigations showed mean wait time after teachers ask questions to be about one second. If students do not begin a response, teachers then repeat, rephrase, ask a different question, or call on another student. When mean wait times of three to five seconds are achieved through training, the length of student responses increases, the number of unsolicited but appropriate responses also increases, and failures to respond decrease.

Shute, V. J. (2008). Focus on Formative Feedback (PDF). Review of Educational Research, 78(1), 153-189. doi:10.3102/0034654307313795. Shute defines formative feedback as "information communicated to the learner intended to modify his or her thinking or behavior to improve learning.” One hundred and forty-one publications that met the criteria for inclusion serve as the basis for this meta-analysis, which uncovers several guidelines for generating effective feedback. These guidelines include the following: (1) Feedback to the learner should focus on the specific features of his or her work in relation to the task and provide suggestions on how to improve. (2) Feedback should focus on the “what, how, and why” of a problem. (3) Elaborated feedback should be presented in manageable units, and feedback should present information to the extent that students can correct answers on their own. (4) Feedback is more effective when from a trusted source. And (5) immediate feedback is most helpful for procedural or conceptual learning or at the beginning of the learning process and if the task is new and difficult (difficult relative to the learner’s capability), and delayed feedback is best when tasks are simple (relative to the learner’s capability) or when transfer to other contexts is sought.

Slavin, R. E., Cheung, A., Holmes, G., Madden, N. A., & Chamberlain, A. (2012). Effects of a Data-Driven District Reform Model on State Assessment Outcomes. American Educational Research Journal, 50(2), 371-396. A district-level reform model created by the Center for Data-Driven Reform in Education (CDDRE) provided consultation with district leaders on strategic use of data and selection of proven programs. Fifty-nine districts in seven states were randomly assigned to CDDRE or control conditions. A total of 397 elementary schools and 225 middle schools were followed over a period of up to four years. Positive effects were found on reading outcomes in elementary schools by year four. An exploratory analysis found that reading effects were larger for schools that selected reading programs with good evidence of effectiveness than for those that did not.

Stecher, B. (2010). Performance Assessment in an Era of Standards-Based Educational Accountability (PDF). Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education. This paper is one of eight written through a Stanford University project aimed at summarizing research and lessons learned regarding the development, implementation, consequences, and costs of performance assessments. The paper defines performance assessment and different types of performance tasks; reviews recent history of performance assessments in the United States; and summarizes research on the quality, impact, and burden of performance assessments used in large-scale K-12 achievement testing.

Steele, C. M., & Aronson, J. (1995). Stereotype Threat and the Intellectual Test Performance of African Americans (PDF). Journal of Personality and Social Psychology, 69(5), 797-811. This paper raised the possibility that culturally shared stereotypes suggesting poor performance of certain groups can, when made salient in a context involving the stereotype, disrupt performance of an individual who identifies with that group. This effect was termed stereotype threat, and the existence and consequences of stereotype threat were investigated in four experiments. Study 1 involved black and white college students who took a difficult test using items from the verbal GRE under one of two conditions. In the stereotype-threat condition, the test was described as diagnostic and as a good indicator of their intellectual abilities. In one of the nonthreat conditions, the test was described as simply a problem-solving exercise that was not diagnostic of ability. In the third diagnostic condition, participants were encouraged to view the test as a challenge. Performance was compared with three conditions after statistically controlling for self-reported SAT scores. Black participants performed less well than their white counterparts in the stereotype diagnostic condition, but in the nonthreat condition, their performance was close to that of their white counterparts. Study 2 provided a replication of this effect but also showed that blacks both completed fewer test items and had less success in correctly answering items under stereotype threat. In Study 3, black and white undergraduates completed a task that was described either as evaluative (in assessing strength and weaknesses) or as not evaluative of ability, but experimenters encouraged students to try their best and let them know that they could find out their abilities later. When the task supposedly measured ability, African American participants showed heightened awareness of their racial identity (by completing word fragments related to their race), more doubts about their ability (by completing word fragments related to self-doubt), a greater likelihood for excuses for poor performance (i.e., self-handicapping), a tendency to avoid racial-stereotypic preferences, and a lower likelihood of reporting their race compared with students in the low-threat condition. Study 4 sought to identify the conditions sufficient to activate stereotype threat by having undergraduates complete the nonthreat conditions from Studies 1 and 2. Unlike in those experiments, however, students’ ethnic information was solicited from some of the students before they completed the test items. Results showed that performance was poorer only among African Americans whose racial identity was made salient prior to testing. These studies established the existence of stereotype threat and provided evidence that stereotypes suggesting poor performance, when made salient in a context involving the stereotypical ability, can disrupt performance, produce doubt about one’s abilities, and cause an individual to disidentify with one’s ethnic group.

Strobel, J., & van Barneveld, A. (2009). When Is PBL More Effective? A Meta-Synthesis of Meta-Analyses Comparing PBL to Conventional Classrooms. The Interdisciplinary Journal of Problem-Based Learning, 3(1). Researchers from Purdue University and Concordia University synthesized eight meta-analyses of problem-based learning (PBL) studies to evaluate the effectiveness of problem-based learning and the conditions under which PBL is most effective. The meta-analyses included medical students and adult learners in postsecondary settings. PBL was more effective than traditional instruction for long-term retention, skill development, and satisfaction of students and teachers. Traditional approaches, on the other hand, were more effective for improving performance on standardized exams, considered by the researchers as a measure of short-term retention.

Topping, K. J. (2009). Peer Assessment. Theory Into Practice, 48(1), 20-27. Peer assessment is an arrangement for learners to consider. Peer assessors can specify the level, value, or quality of a product or performance of other equal-status learners. Products to be assessed can include writing, oral presentations, portfolios, test performance, or other skilled behaviors. A formative approach to peer assessment helps students to help one another plan their learning, identify their strengths and weaknesses, target areas for remedial action, and support metacognitive and other personal and professional skills. A peer assessor with less skill at assessment but more time in which to do it can produce an assessment of equal reliability and validity as that of a teacher. Because peer feedback is available in greater volume and with greater immediacy than teacher feedback, teachers are encouraged to use it.

Tubbs, M. E. (1986). Goal Setting: A Meta-Analytic Examination of the Empirical Evidence. Journal of Applied Psychology, 71(3), 474-483. Tubbs conducted meta-analyses to estimate the amount of empirical support for the major postulates of the goal theory of E. A. Locke (see record 1968-11263-001) and Locke et al. (see record 1981-27276-001). The results of well-controlled studies were generally supportive of the hypotheses that specific and challenging goals led to higher performance than easy goals, “do your best” goals, or no goals. Goals affect performance by directing attention, mobilizing effort, increasing persistence, and motivating strategy development.

Usher, E. L., & Pajares, F. (2008). Sources of Self-Efficacy in School: Critical Review of the Literature and Future Directions. Review of Educational Research, 78(4), 751-796. The purpose of this review was threefold. First, the theorized sources of self-efficacy beliefs proposed by A. Bandura (1986) are described and explained, including how they are typically assessed and analyzed. Second, findings from investigations of these sources in academic contexts are reviewed and critiqued, and problems and oversights in current research and in conceptualizations of the sources are identified. Although mastery experience is typically the most influential source of self-efficacy, the strength and influence of the sources differ as a function of contextual factors such as gender, ethnicity, academic ability, and academic domain. Finally, suggestions are offered to help guide researchers investigating the psychological mechanisms at work in the formation of self-efficacy beliefs in academic contexts.

Walker, A., & Leary, H. (2009). A Problem-Based Learning Meta-Analysis: Differences Across Problem Types, Implementation Types, Disciplines, and Assessment Levels. Interdisciplinary Journal of Problem-Based Learning, 3(1), 12-43. In a meta-analysis of 82 studies, 201 outcomes favored problem-based learning (PBL) over traditional instructional methods. The authors review a typology of 11 problem types proposed by Jonassen (2000), which range from logical problems to dilemmas and include features like highly structured problems (focused on an accurate and efficient path to an optimal solution) and ill-structured problems (which do not necessarily have solutions and which prioritize evaluation of evidence and reasoning). The typology includes logical problems, algorithmic problems, story problems (which have underlying algorithms with a story wrapper that amounts to an algorithmic problem), rule-using problems, decision-making problems (e.g., cost-benefit analysis), troubleshooting (systematically diagnosing a fault and eliminating a problem space), diagnosis-solution problems (characteristic of medical school and involving small groups understanding the problem, researching different possible causes, generating hypotheses, performing diagnostic tests, and monitoring a treatment to restore a goal state), strategic performance, case analysis (characteristic of law or business school and involving adapting tactics to support an overall strategy and reflecting on authentic situations), design problems, and dilemmas (such as global warming, which are complex and involve competing values and which may have no obvious solutions). Strategic-performance and design problems were deemed especially effective in producing positive PBL outcomes.

Watt, K. M., Powell, C. A., & Mendiola, I. D. (2004). Implications of One Comprehensive School Reform Model for Secondary School Students Underrepresented in Higher Education (PDF). Journal of Education for Students Placed at Risk, 9(3), 241-259. A study of 10 high schools that implemented Advancement Via Individual Determination (AVID) found that all 10 of the AVID schools improved their accountability ratings during the first three years of AVID implementation. AVID students outperformed their classmates on various standardized tests and attended school more often than their classmates.

Wiggins, G., & McTighe, J. (2005). Understanding by Design. Alexandria, VA: Association for Supervision and Curriculum Development. The ASCD website says the following about this book: “What is understanding and how does it differ from knowledge? How can we determine the big ideas worth understanding? Why is understanding an important teaching goal, and how do we know when students have attained it? How can we create a rigorous and engaging curriculum that focuses on understanding and leads to improved student performance in today’s high-stakes, standards-based environment? Authors Grant Wiggins and Jay McTighe answer these and many other questions in this second edition of Understanding by Design. Drawing on feedback from thousands of educators around the world who have used the UbD framework since its introduction in 1998, the authors have greatly revised and expanded their original work to guide educators across the K-16 spectrum in the design of curriculum, assessment, and instruction.”

Wiliam, D. (2010). The Role of Formative Assessment in Effective Learning Environments. In H. D. Dumont, D. Istance, & F. Benavides (Eds.), The Nature of Learning: Using Research to Inspire Practice (135-159). OECD Publishing. This chapter summarizes and elaborates upon formative-assessment research and effective practices to date. >

Yeh, S. S. (2007). The Cost-Effectiveness of Five Policies for Improving Student Achievement (PDF). American Journal of Evaluation, 28(4), 416-436. doi:10.1177/1098214007307928. Yeh conducts a cost-benefit analysis, comparing five educational policies, including rapid assessment, voucher programs, charter schools, accountability, and increased spending. Rapid assessment is identified as the most cost effective compared with the other strategies analyzed.

Go to the first section of the Comprehensive Assessment Research Review, Definitions and Outcomes.