As Steve Krause has noted and has been discussed a fair amount recently on the WPA-list, there is reason to be concerned with the growing role of grading writing by machines. There is a new site and petition (humanreaders.org), and I have added my name to that petition. So it should be clear that fundamentally I share the concerns raised there because I have confidence in the research beyond this position. Essentially the point is that current versions of machine grading software are not capable of "reading." What does that mean? It means that machines do not respond to texts in the way that humans do. It is possible to compose a text that humans would identify as nonsensical and receive a high score from a machine. Machines can be trained to look for certain features of texts that tend to correlate with "good writing" from a human perspective but those features can be easily produced without producing "good writing." The upshot, given the high stakes nature of many of these texts, is that students will not be taught to produce "good writing" but rather writing that scores well. The horrors of teaching to the test are a commonplace in our culture, so there's no need to take the argument further.
And yet, of course, you would not be reading this if your computer (or phone) didn't read it first. If you have arrived at this page via Google, then there have been several levels of machine reading that brought you here. If it seems that Google and other search engines do a fairly good job of finding reliable texts on the subject in which you are interested, then it is because, by some means, they are good readers. No doubt, part of Google's system is reliant upon human evaluators who link and visit pages, including perhaps your own preferences. The same might be said of human readers. How did we figure out what "good writing" was? Do we not rely upon social networks for this insight? In its crudest form (and closer to assessment), don't we "norm" scorers of writing for assessment purposes?
Anyone who has ever done search engine optimizing has written explicitly for machines. One of the things that makes SEO trickly though is the secret, proprietary nature of Google's search algorithm. Unlike these machine grading mechanisms, it is not easy to game Google's search rank. Perhaps what is required for machine grading is a more complex, harder to predict, mechanism. In other words, while machines do not need to read in the same way as humans do, they might need to simulate the subjective, unreliable responses of human readers in order to serve our purposes. That last sentence encapsulates two potential errors we encounter in our discussions of machine grading.
1. Because machines don't read the way humans do, they don't understand the meaning of the text. Critics complain that machines can't recognize when a text is nonsense or counterfactual. (One might say the same of humans anyway.) On what basis do we claim that humans are the arbiters of sense? Only on the basis that we only care what humans think, or from a correlationist perspective, that we can only understand texts in terms of ourselves anyway. We don't understand why machines grade texts the way they do sometimes, but we don't say that machines are subjective, which is what we say when human readers disagree. Instead, we say that machines produce error. I say that machines are readers too. Maybe they aren't the readers we want to score our tests, but then we wouldn't want a room full of kindergarteners either. So being human is no guarantee of reliable scoring.
2. Good machines would simulate human readers. This is our basic premise, right? That a machine would give the same score as a human to a given text. That is, we recognize that machines and humans will never read the same way but we need them to provide the same output in terms of scores. This would be like a calculator. A calculator doesn't do math like I do, but it gets the same answer. To make this happen we black box both the human and the calculator: the process is irrelevant; only the answer counts. But that's not really a good analogy for the scoring of human writing.
Unlike calculable equations, there is not right score for a text. What human scoring processes demonstrate is that reading takes place within the context of a complex network of actors that serve to create "interrater reliability" and so on. We begin with the preimse that humans typically will not agree on the score for a text, even when you take a fairly similar group of readers (e.g. composition instructors teaching in same department) and writers (e.g. students in their classes). They already are conditioned to a high degree, but then we add on specific conditioning through the norming process and the common conditions in which they are reading. Add into that various social pressures such as recognizing the seriousness of the scoring and the pressure to grade like other readers so as to reduce the amount of work (discrepancies in scoring lead to additional readings and scorings).
Scoring is not an objective, rational process. Once one abandons the flawed concept of intersubjectivity– the consensual hallucination that we share thoughts when we agree–one has to come up with another explanation for why two readers give an essay the same score and that explanation, in my view, would involve an investigation of the actor/objects and network-assemblages that operate to produce results. We can complain that machines don't recognize meaning, but that's only because meaning isn't in the text. This has always been the flaw in any form of grading. We evaluate students based upon what their texts do to us as readers. The only reason students have any power to predict what our experience will be is because they participate in a shared network of activity: a network over which they have little control.
So to go back to the original problem of machine grading, I would say that we need to ask what it is that we are trying to determine when we are grading these exams. Do we want to know if students can produce texts that have certain definable features in a testing situation? Do we want to know if students will get good grades on writing assignments in college? Or do we want to know, more nebulously, if students are "good writers"? I think we have proceeded as if these are the same questions. That is, good writers get good grades in college because they can produce texts with certain definable features. But that's not how it works at all, and I think we know that.
In case we don't, just briefly… Good texts don't have certain definable features because the experience of "good" isn't inhered in the texts we read. This doesn't make the process subjective in the sense of one's reading practice being unpredictable or purely internal. It just makes reading relational. One way of defining rhetorical skill is having the ability to investigate the networks at work and produce texts that respond to those networks. We object to the notion of training students to compose texts that will produce positve responses from machines, but we also object to the notion of training students to compose texts that produce positive responses from normed human scorers.
The real problem though is starting with the pedagogical premise that teaching writing means teaching students to reproduce definable textual features without understanding the rhetorical and networked operations underneath. Beacuse what we discover from machine readers is that we can compose texts that have those textual features but are ineffective from our perspective. This is a discovery we have already made a million times though as we have all seen many students who diligently replicate the requirements of an assignment and still manage to produce unsatisfactory results. Why? Because they have produced those features without understanding the rhetoricity behind them.
Machines are perfectly good readers. That's not where the problem is. The problem is that we don't understand reading.