Eh? What’s that?
I’m talking about AI approaches to the synthesis of speech on your smartphone and related devices. I.e., how does Siri figure out how to pronounce the words its saying? OK. But what does that have to do with us?
Another necessary detour around the aporias of disciplinary thought… This is really about recognizing the value of computer simulation in articulating the possibility spaces from which reality emerges. Put in the most general terms, if you want to know why one thing is composed rather than another, why there is something rather than nothing, you need a way of describing how one particular thing emerges from the virtual space of possibilities where many other things were more or less probable. Simulation is a longstanding concept in the humanities, mostly through Baudrillard, but the intensification in computing power changes its significance. As DeLanda writes, in Philosophy and Simulation,
Simulations are partly responsible for the restoration of the legitimacy of the concept of emergence because they can stage interactions between virtual entities from which properties, tendencies, and capacities actually emerge. Since this emergence is reproducible in many computers it can be probed and studied by different scientists as if it were a laboratory phenomenon. In other words, simulations can play the role of laboratory experiments in the study of emergence complementing the role of mathematics in deciphering the structure of possibility spaces. And philosophy can be the mechanism through which these insights can be synthesized into an emergent materialist world view that finally does justice to the creative powers of matter and energy.
So this post relies at minimum on your willingness to at least play along with that premise. It is, as DeLanda remarks elsewhere, an ontological commitment. You may typically have different commitments which of course is fine. Within a realist philosophy like DeLanda’s the value proposition for an ontological commitment is its significance rather than its signification, by which he means it’s less about its capacity to represent/signify Truths and more about its capacity to create capacities that make a difference (its significance).
In this case, we have the simulation of speech. Basically what happens (and basic is the best I can muster here), Siri’s voice is constructed from recorded human speech. That speech is divided up into constitutive sounds and then the purpose of speech synthesis is to figure out how to recombine those sounds to make natural sounding speech. [n.b. A common error in this conversation is to identify the semblance between a computer and a human at the wrong level: to assert that the human brain is like a computer. However I don’t think anyone would suggest that humans operate by having a database of sounds that they then have to probabilistically assemble in order to speak.] While humans don’t form speech this way, we do obviously have a cognitive function for speaking that is generally non-conscious (exceptions being when we are sounding out an unfamiliar word, learning a new language, etc.). Generally we don’t even “hear” the words we read in our minds (though I bet you’re doing it right now, just like you can’t not think of a pink elephant).
One thing that is clear in speech synthesis is that process that seeks to approximate the sounds of “natural speech” does not know the meaning of the words being spoken or need to know that the sounds being made are connected to meaning or even that meaning exists. It is a particular technological articulation of Derrida’s deconstruction of logo-phonocentrism whose heritage he describes as the “absolute proximity of voice and being, of voice and the meaning of being, of voice and the ideality of meaning” (Of Grammatology). Diane Davis takes this up as well, writing “it is not only that each time ‘I’ opens its mouth, language speaks in its place; it is also that each time language speaks, it immediately ‘echos,’ as Claire Nouvet puts it, diffracting or laterally sliding into an endless proliferation of ‘alternative meanings that no consciousness can pretend to comprehend’” (Inessential Solidarity). None of that is to suggest that meaning does not exist or even that the words Siri speaks are meaningless. No, instead it leads one toward a new task of describing the mechanisms (or assemblages, to stick with DeLanda’s terms) for signification and significance are separate–though certainly capable of relating to–the assemblages by which speech is composed.
But getting back to speech synthesis. I’ve been clawing my way through a couple pieces on this subject like this one from Apple’s Machine Learning journal and this one coming out of Google research. This is highly disciplinary stuff and at this point my understanding of it is only on a loose conceptual level. However, I’m trying to take seriously DeLanda’s assertion regarding “the role of mathematics in deciphering the structure of possibility spaces,” as well as his claim that “philosophy can be the mechanism through which these insights can be synthesized into an emergent materialist world view that finally does justice to the creative powers of matter and energy.” It is that last part that I am pursuing and which, at least for me, is integral to rhetoric and composition.
Here however is my hypothesis. Despite the arrival (and digestion) of poststructuralism in English Studies in the last century, rhetoric and composition remains a logo-phonocentric field. The digital age (or software culture as Manovich terms it) has put serious pressures on those ontological commitments (and that’s what logo-phonocentrism ultimately is, an ontological commitment). The mathematical description of the possibility spaces of speech synthesis and the subsequent simulation of speech are just one small part of those pressures, a part so esoteric as to be difficult for us to wrap our minds around.
But what happens when we start disambiguating (decentering) the elements of composition that we habitually unify in the idea of the speaking subject? To return to DeLanda here as I conclude:
The original examples of irreducible wholes were entities like “Life,” “Mind,” or even “Deity.” But these entities cannot be considered legitimate inhabitants of objective reality because they are nothing but reified generalities. And even if one does not have a problem with an ontological commitment to entities like these it is hard to see how we could specify mechanisms of emergence for life or mind in general, as opposed to accounting for the emergent properties and capacities of concrete wholes like a metabolic circuit or an assembly of neurons. The only problem with focusing on concrete wholes is that this would seem to make philosophers redundant since they do not play any role in the elucidation of the series of events that produce emergent effects. This fear of redundancy may explain the attachment of philosophers to vague entities as a way of carving out a niche for themselves in this enterprise. But realist philosophers need not fear irrelevance because they have plenty of work creating an ontology free of reified generalities within which the concept of emergence can be correctly deployed. (Philosophy and Simulation)
I would suggest an analogous situation for rhetoricians. Perhaps we fear irrelevance in the face of “reified generalities” that form our disciplinary paradigms. What happens when not just “voice” or “speech” is distributed but expression itself becomes described as emerging within a distributed cognitive media ecology?
In any case, that’s where my work is drifting these days and it was useful for me to glance back toward the discipline here to get my bearings vis-a-vis some future audience I hope to address.
One reply on “On the importance of deep mixture density networks and speech synthesis for composition studies”
Very informative post Alex ! Thank you
LikeLike