NAU Department of Educational Leadership Rural Schools Resource Center (NAURRC)
Hot Topics and Current Research
What do Letter Grades Tell Us?
By Mr. Sean E. Rickert, Superintendent, Pima USD & ARSA
President-Elect
What do Letter Grades tell us?
Nothing. Next Question.
But seriously, for education leaders there may be something to be learned from understanding the history of letter grades, what they can tell us about the quality of schools and what they mean. They don’t mean what people think they mean, and that is why they really don’t mean much. When people think an indicator means one thing, and it doesn’t mean what they think, the real meaning is likely to be lost.
Understanding education policies in Arizona is remarkably simple. Arizona is a parental choice state. Every parent of a school age child decides where to send their child for educational services. Most choose a traditional public school. Many choose a charter school, and some select to homeschool their children. Lastly, a few choose private schools. In 2000, as part of a proposal to boost funding for schools through an education sales tax the voters enacted an accountability system. This created a set of tools to assess school quality and guide parents as they made choices within the educational marketplace. Schools were labeled, “Highly Effective” to “Failing” based on how well students performed on the state standardized test. In 2010 the descriptive labels were exchanged for letter grades A through F. Grades were based on the percentage of students testing well on the Arizona Instrument to Measure Standards (AIMS). The percentage of students passing was coupled with the percentage of students achieving a year's progress. Emphasis was also given to subgroups of students including the disabled, the poor, the non-English speakers, and the poor performers. In 2015, Arizona transitioned from AIMS to a new test more closely aligned with the National Assessment of Educational Progress (NAEP). Where 70% of students tested “Proficient” on AIMS only 30% passed the new Arizona Measure of Educational Readiness Informing Teaching (AzMERIT). This dramatic shift created a need for a new system for measuring school quality, so parents could understand where they should send their children.
Governor Ducey set the task to the State Board of Education who in turn formed an Ad Hoc committee to revamp the way we measure school performance. The committee included representatives from traditional public schools and charters as well education policy experts. Their charge was to study accountability systems in other states and report back to the state board with a system for labeling schools. The system was expected to be based primarily on student level data that could be independently verified. It should also minimize the correlation between socioeconomic status (SES) and school quality. Schools shouldn’t be seen as doing well simply because their students were wealthy. As if that wasn’t complicated enough, the committee was also saddled with the new requirements of the latest reauthorization of the Elementary and Secondary Education Act (ESEA) known as the Every Student Succeeds Act (ESSA). ESSA replaced the No Child Left Behind act and ensconced many of the accountability provisions the Obama Administration had incorporated into their NCLB waivers. These rules would provide important guidelines for Arizona’s new system. The committee was expected to create a simple to explain, valid measure of school quality that met the requirements of state and federal law.
Two key data points dominated the discussion over how to measure schools -- proficiency and growth. The committee had to develop methods for measuring each and determine the best way to balance them against each other in the final formula. At the same time the system for grading schools needed to be easy to interpret. There was broad frustration with the previous system because few people understood how it measured school performance. Proficiency is seen as providing the most basic understanding of how well a school has done its job. If a high percentage of students do very well on the standardized test, we should say the school has done a good job. However, things are not so simple. If I am teaching a physical education class and at the end of the semester we measure student performance by how high they can jump, isn’t there a need to consider how high they jumped at the beginning of the semester before we assess the value of the class? At the most basic level quality instruction should lead to high levels of student growth. Can you have a school with high proficiency and low growth? So, both proficiency and growth are necessary. How do we measure them?
Proficiency can be measured in a number of ways. We can look at the percentage of students who have scores above a certain level. This is called straight proficiency. Arizona uses a weighted proficiency model. This means student scores are given one of four values. Their score either equates to 0, .6, 1, or 1.4 points. If you have 12 students and two are at the highest level, six are at the proficient level, 2 are at the partial proficiency level and 2 are at the lowest level { [ (2 x 1.4)+(6 x 1)+(2 x .6)+(2 x 0) ] / 12 = .833 }. The formula is based on a recognition that the goal is highly proficient, but for some students partially proficient is quite an achievement.
Developing a growth measure is even more complicated. How do we measure growth? The Student Growth Percentile is a measure of how a student’s improvement over their prior year performance compares to other students who did as well the previous year. In 2018 SGP was coupled with Student Growth to Target, a measure of how close students were to being on track to perform at grade level by the end of their academic career. Both are complicated methods for measuring how much a student has improved over the year. The accountability system isn’t interested in how students perform (per se), so the A-F system looks at student growth and aggregates it into a school label. This is done by considering how much growth students at each of the performance levels achieved. Students are grouped into three levels of growth. Weights are applied, SGP is weighted against SGT and a final Growth measure is achieved.
All of this is based entirely upon a group of students' performance on a single test. So, part of the quest remained unfulfilled. For elementary students this gap was filled in with the ‘Acceleration / Readiness’ points. Part of the puzzle was based on the ESSA requirements that subgroup performance be a contributing factor, so that was built into the A/R points. There was also the lingering question about what could be included meeting the twin requirements of independently verifiable and non-test based. Arizona doesn’t collect much data about school operations, but attendance is one measure gathered. Chronic absenteeism fit the bill. At the high school level there was more data about Career Technical Education (CTE) and accelerated course work (e.g. AP, IB, Dual Enrollment). The creation of the College and Career Readiness Indicators provided a mechanism for secondary schools to include a quantitative measure of some of the quality programs being offered. Secondary schools also included the obligatory graduation rates. How can you grade a high school without looking at their ability to get students across the stage?
The last metric added was the ESSA required inclusion of measures of English Learner performance. This gave us the complete lexicon of measures to be used for grading schools. Left out in the parking lot were proposals to include important non-tested programs like technology and physical education. There was also a proposal to include post graduation success, but difficulties operationalizing the idea made it impracticable. With all the measures on the board the final step was to determine how to balance them. For elementary schools growth was seen as the predominant metric. For high schools a more nuanced balance was selected where growth, proficiency and CCRI were seen as fairly equal.
How well did this work? The first application of the system was met by a record number of appeals. By the second year the weights for growth had been shifted in ways that the ad hoc committee knew would unbalance the whole metric. Underlying the whole discussion had been the possibility of a “menu of assessments” which would prove wholly unmanageable. By the third application of the system this issue rose to the surface. By the fourth application a pandemic forced the cancellation of the state test. By the fifth application we had to weigh skip year growth against a year when most students were only participating in school in-person for part of the year -- no grades were provided. The sixth application is the one we are currently grappling with.
The 2022 grades are similar to the 2017 ones that initiated the new system. They are different because there are no SGT points. Growth is now completely based on Student Growth Percentiles. At the secondary level, there are issues with growth because we are comparing how students performed on the ACT test they took as sophomores with the ACT Aspire test they took as frosh, but many did not take that test.
At the elementary level we have now (quickly) moved on from the AzMERIT test to the Arizona Assessment of Student Achievement (AASA). This test is different from the prior test, but is being compared. We also know that there were significant issues with attendance during the 2021-2022 school year.
Finally, there is always the question of how best to set ‘cut scores’. Cut scores refers to the values differentiating an A from a B, from a C, etc. When the State Board of Education initially set the cut scores back in 2018, they were based on means and standard deviations. Ever since, there is an annual discussion about how to set the cut scores. The conventional wisdom has been that we should leave them where they are to facilitate school improvement. This is a great idea, but it falls a bit flat. Cut scores are arbitrary values. Differentiating between schools based on how many standard deviations from the mean their point total falls would make sense if any of this was natural. None of the histograms show that it is natural. We have a system where the mean is biased. But, in the end it doesn’t really matter.
What does the letter grade a school receives tell us? In the end it tells exactly what we need to know. The distribution shows that schools with a D or F label are on a left tail far from the performance of the vast majority of schools. Schools that receive A, or B, labels are all pretty much in the same boat. They have done some things well, some things very well and few things poorly. They likely have student populations predisposed to doing well, but the weights do a pretty good job of ensuring affluence is not a requirement for schools to get a good grade. A C means what it usually means. Some things were done well, but there is tremendous room for improvement. A school with a D or F label has room for improvement. They may have some strengths they can point to, but more than one thing has gone wrong.
School letter grades in a School Choice environment like Arizona exist as a mechanism to help parents make decisions about where it is best for them to send their child to school. The ironic thing about our letter grading system is that hardly any parents use it for this purpose. This can be attributed to the general dissatisfaction with the whole standardized assessment system. If the Check Engine on your car is always on, but when you take it to the mechanic they can’t find anything to repair, eventually you will ignore the light. A test that tells more than two-thirds of parents that their children are not proficient has had the same effect. A letter grade system that gives good schools D and C grades and schools out of reach of most families A grades eventually begins to appear elitist. Solving this problem is very difficult. Shifting cut scores like the state board did for high schools can lower the bar, but it is difficult to explain why a school is better.
This is problematic because one of the goals touted by members of the State Board of Education for the accountability system is to incentivize improvement. A system that is broadly seen as invalid and unrealistic can’t incentivize much. At the same time the notion that the State Board can make decisions about the A-F formula to drive greater student achievement, belies the reality that teachers are already trying. Since the only metric of value to the board is the AASA, all they can do is increase the emphasis teachers place on teaching to the test. Ironically their discussions over cut scores at their recent meetings indicate that they want to do the exact opposite.
When Governor Ducey provided guidance to the Ad Hoc Committee he asked them to consider three state accountability systems in their discussion. These three -- Indiana, Florida, and Massachusetts, each shed light on the issue of school accountability from different perspectives. Indiana is the home state of the Governor’s friend, former Vice President Mike Pence, who took great strides to support school choice while holding school operators accountable. Florida implemented a series of reforms under then Governor Jeb Bush that provided valuable examples of how to expand school choice for parents. Massachusetts is an outlier in this set. The accountability system adopted by Massachusetts is one of the most complex and comprehensive imaginable. Developed by psychometricians at Harvard and MIT it strives to consider every variable and weed out every bias. In the end when you look at the outcomes one thing remains. Students from wealthy communities do well, and those from poor communities do poorly. A system that identifies schools that are struggling as well as our system does can be considered a success. Even if it doesn’t tell us what we think it does.


Mr. Sean E. Rickert
Superintendent of Pima USD & ARSA President-Elect
What do Letter Grades Tell Us? Takeaways
1. An understanding of the history of letter grades is necessary to understand the meaning of letter grades.
2. Governor Ducey gave the task of creating a need for a new system for measuring school quality to the State Board of Education, and they formed an Ad Hoc committee to revamp the way we measure school performance. The State Board of Education included traditional public schools, charters, and education policy experts on the committee.
3. The committee’s mission is to create a simple example, a valid measure of school quality that meets state and federal law requirements.
4. The process needs to measure both proficiency and growth.
5. Several ways will measure proficiency, but developing a growth measure is more complicated.
6. Ultimately, students from wealthy communities do well, and those from poor communities do poorly.