Anyone who has called a customer service line in the past decade or so has become well acquainted with a robot. Its voice (or rather his or her voice) is too stiff to be a human’s but too fluent to be a machine’s, thanking you for your call with all the warmth of an e-mail signature. Let him introduce himself: “I’m an automated system that can handle complete sentences. So tell me, how can I help you today?” So go right ahead and tell him just whatever seems to be the matter, and he’ll follow you because — get this — he can handle complete sentences. “OK, I’ll get you right over to our technical support department.” So then you hear a new fellow, who, judging by his slightly higher voice, is gentler and calmer, as if he knows your something is broken and you’re probably pissed. “Are you calling about technical support for an Apple product?” which, if you’ve done this right so far, you are. “Do you have a case ID?” which, if you’re calling for the first time, you don’t, but “no” is a perfectly acceptable answer in this choose-your-own-adventure novel. “Please say the name of the Apple product for which you want technical support,” which was probably the very first thing you said, back with the fellow who understood complete sentences, but now try it again with this new guy, just the name now. “Sorry, I’m having trouble. What I’m looking for,” he continues, with the slightest and most disarmingly realistic stammer on the W, “is a product name, like Snow Leopard, iMovie, MacBook Pro or iMac.” Try again. “I think you said iPad. Is this correct?” Glory, hallelujah, it is! Now you get to talk to a real human being.
But wait, what if you’re from Boston? Did you mean iPod?
This concern, and it is a real concern, is not unique to iPads (whose name, when it was announced in January, instantly became a laughingstock for Bostonians and Irishmen in the press and blogosphere) nor to Apple (who declined to comment for this article). Claire Bowern, for one, knows all too well the ordeal of getting a machine to understand her Australian accent when she wants “flight status” but says “flaiht stay-toos.” Computer, you’ve messed with the wrong Yale University linguistics professor. Bowern is currently working on the largest study of North American English dialects ever conducted. To start, she’s doing a dry run with a subset of data from Yalies.
“We were int’rested in seeing if freshmen changed thei’ accents when they came to Yale, given that, especially in the fi’st couple of weeks of claahsses, there’s a bit of a pressure-cooker environment,” Bowern says. “I’ve seen this in my claahsses teaching Intro Linguistics, and when I was a grad student at Hahvahd this was very noticeable, that freshmen would come to campus with aull sorts of local dialects and then within about six weeks they’d aull converged.”
So Bowern has set out to measure, more scientifically, how did the freshmen converge, and on what? Which regional features were abandoned and which were adopted? But since she found busy freshmen were loath to commit to repeat recordings, she decided to try sampling freshmen and seniors to see if, as groups, they speak differently.
The Yale samples will feed into the continent-wide study — Bowern and her research partners are aiming for 2,500 participants overall. They already have more than 1,700. They hope the data collection will wrap up by the end of the year. When it’s done, it will be the largest study of its kind by a factor of three, she says. It’s also important because Bowern is not only looking at how geography shapes dialect, as most researchers do. Few other researchers have tried studying, as Bowern will, patterns across gender, age, ethnicity, and class. Bowen explains that sociolinguists know that ethnicity, age, and gender contribute to “dialect die-versity,” but researchers tend to study those aspects separately from one another and from the larger context of regional variation. “So the question that we’re asking is: when you take a city or a given geographical area, aah the differences in geography still the most salient?”
Anyone who grew up speaking English in the United States or Canada can log onto a website (http://pantheon.yale.edu/~clb3/NorthAmericanDialects/); read the directions and frequently asked questions (Q: “But I don’t speak ‘good’ English … surely you don’t want a sample of my speech?” A: “Yes we do!”); enter his or her age, ethnicity, gender, current zip code, high school zip code, other native language, and parents’ other native languages; and (on a computer equipped with a microphone and a sound card) record: one, two, three, four, five, six, seven, eight, nine, ten, broad, full, skid, prune, heel, coat, maid, duck, hill, boat, sock, clown, might, eggs, foot …
Stop. Where are your telltale paahks and caahs and knickkneeacks and beeackpeeacks? Oats and aboats? Youses and y’alls? Bowern explains: “We wanted exahmples of all of the phonemic vowels of American English between puhticular consonants,” meaning the ones where a different pronunciation carries a different meaning (bite versus bit versus beet), but holding the consonants constant on either end because they can affect the vowel (bat versus bad). So, in the survey, it’s pen versus pin and caught versus cot.
Hear it now? Because if you’re from the South (say, Lixingtin, Kintucky) that first pair probably sounded the same, as did the second if you’re from out West (say, Lahs Angeles). It’s called merging word classes, and about half the country is doing it these days. Homophones are appearing where they weren’t before, and with them, so are miscommunications. “I do” in holy matrimony used to sound different from the dew on the grass, or else you must have been talking about dog doo; but across this continent, most people now favor the French-sounding ieu in all cases. Which and witch used to sound different (hhhwitch and witch) for two Americans out of three, whereas now only a few sticklers bother to sound that wispy aspiration. Don’t blame mass media, for our national speech isn’t homogenizing: Small-time accents might be blending into their respective regions, but those regional distinctions are more pronounced now than ever. Starting about 50 years ago, the people in the metropolitan areas of the Midwest (Buffalo, Detroit, Cleveland, Chicago) started heightening their short vowels (meaning the tongue, when pronouncing them, points higher in the mouth: man becomes meean instead of maahn) in what’s known as the “Northern cities vowel shift.”
This is all well documented. But it’s not at all clear why it’s happening. Sociolinguists have postulated that a sort of social Darwinism gradually directs language change: that whenever two sounds compete, the winner shall be the one to which more people attach a social prestige. After all, ever since the Gileadites ferreted out the Ephraimites by the latter’s failure to pronounce the sh in shibboleth, people have judged one another based on their accents. The sociolinguists’ theory makes sense, except that it doesn’t explain the long history of lower-class sounds being adopted by the population at large, and it’s just as mysterious how the Midwestern eeaccent would have become, by that logic, a status symbol.
For the Yale sub-study, Bowern set out to measure the Northern cities vowel shift in microcosm — what happens when you take 18-year-olds from all over the world, from all different backgrounds, and let them talk it out? How would Darwinism shape the dialect for this concentrated population in a concentrated period? But better yet, this is no ordinary playground peer pressure. This is Yale, one of those rare institutions that has, for centuries, refined and elevated only the most proper order of speaking by refining and elevating only the most proper order of men; the breeding ground of that voice heard round suppers with as many forks as courses, the one that Tom Wolfe in 1976 termed “the honk”; where they learned that dropping only the right r’s (fuhst for first but never fo’ud for forward) was as important as unbuttoning only the lowest button of your waistcoat, and as important as, for being so important to, surrounding yourself with the right crowd. And even after coeducation and financial aid and the attending shift (at least nominally) from an aristocratic impulse to a meritocratic one, there remains at Yale, just as do the (now nonfunctional) fireplaces in every dorm room and the windowless tombs of the senior societies, the firmest notion of upward phonetic mobility. It’s the belief that by eschewing Rhine in Spine in favor of Raeyn in Spaeyn, anyone from anywhere can pull himself up by his own diphthongs, because language says as much about where you’re from as where you’re going. Accents are as much an intrinsic part of our socially defined identities as birthplace, wealth, gender, ethnicity, and just about anything else you can think of to judge a person by. Except accents are malleable. Even contagious.
Professor Bowern notes that she now says cawffee instead of cohffee, adopting a vowel she didn’t have back in Australia but picked up in New Haven (which, by the way, locals always pronounce New HAY–ven, whereas the part-time population tends to favor NEW Haven).
Sociolinguists like Bowern call this the “interview effect” or “accommodation theory”: the tendency of people to adapt their accents to the person they’re talking to — provided they like them. Otherwise they may amplify the dialectic differences. The latter was the social dynamic that William Labov, the pioneering sociolinguist now at the University of Pennsylvania, found in his legendary study of Martha’s Vineyard, whose small native population, after centuries of isolation from mainland New England, suddenly faced an influx of rich vacationers; in response, a small group of fishermen began exaggerating an existing tendency in their speech as a way to distance themselves from the standard English of the summer invaders who, in their judgment, threatened their traditional home and way of life.
In the latest edition of Principles of Linguistic Change, Labov writes that the Vineyarders’ shift, while a marker of identity, happened subconsciously. When accents do rise to the level of social awareness, it usually relates to some stigma or prestige, as in New York, where Labov’s famous department store experiment found that sales associates at Saks were more likely to sound the final or preconsonantal r in “fourth floor” than their counterparts at Macy’s, who in turn spoke more r-fully than the staff at Klein’s. “Changes from above usually involve superficial and isolated features of language,” Labov writes. “They tell us little about the systematic forces that mold the history of dialect divergence.” So he, too, is still searching for possible social explanations for the Northern cities vowel shift. One of the more curious correlations he has considered is the one between Northern dialect speakers and Democratic voters.
These are just the kind of social and socioeconomic dimensions, as they map onto geography, that Bowern is investigating. By the beginning of October she had collected enough data from Yalies to begin the analysis — just 75 voice samples this time. She was all ready to go with a computer program that would automatically use sound waves to detect the spaces between the words, splice up the recordings, and graph the frequencies at which the vowels resonated. Every vowel has a natural range of frequencies on two curves, called formants (although it sounded to me like Bowern was saying foments), and the slightest change in pronunciation registers a slightly different frequency. This is how the advanced computer program would precisely detect and quantify human speech.
And had the test subjects been computers, it would have worked out fine. But the humans counted to twelve instead of ten, or got interrupted by a roommate opening the door, or spoke too quickly and strung the words together (loud-bad-log-pen-tide-pool-lie-home-boy-gun — Bowern acknowledges that, in hindsight, home-boy was probably not the best juxtaposition; same goes for chicken-hat), or supplemented the given list of words with a string of expletives (line, dollar, cot, see, fawn, shit, fuck, damn … ). But more troublingly, the parsing program was reporting different measurements every time Bowern ran the script — a pretty bad sign for both accuracy and replicability. When she removed the uncooperative samples and ran the data again, she found a small but statistically significant difference between freshman and senior women (although not men). She wants to redo the study next year during the first week of classes, perhaps getting more participants by sneaking it into the barrage of mandatory freshman orientation activities. In the meantime, Bowern got a new software program and an eight-core super-processor to avoid similar hang-ups as she moves on to the continent-wide study.
In all, processing the data will take about another six months. When it’s done, Bowern hopes the study could do to accents what genetics did to race: namely, show that it lacks any scientific basis as a source of discrimination. “A lot of dialects are quite highly stigmatized in the U.S., and ultimately, when it comes down to a difference of whether F1 is at 420 hertz versus 430 hertz, it seems like a pretty trivial reason to discriminate against someone,” she says, “though it’s widespread.”
Bowern says she plans to make the data available for all non-commercial research purposes. Meanwhile, she acknowledges the commercial interest of better understanding accents for speech technology companies. The largest, Nuance Communications Inc., based in Burlington, Massachusetts, projects that its 2010 sales will reach $1.2 billion. Jeff Foley, Nuance’s resident expert on accents, says the company has collected enough data to accommodate accents without compromising accuracy. “The system learns over time that if people calling in tend to pronounce things a certain way, we can adjust the models to lean in that direction,” he says.
That means the system knows to tolerate New Yorkers who put two syllables in four and Southerners who end their sentences with “sir” or “ma’am” and people who spell using expressions like “A as in apple.” Foley says the technology for call centers is 98 to 99 percent accurate, as good as, or sometimes better than, a trained human operator. The system starts with a realm of possible responses to the given prompt — an expected script — and then tries to match the sounds it hears to the possible choices, Foley explains. It weighs the probabilities of each potential match and, if it’s not confident enough about any of them, asks the subject to try again. All the while, it listens for vocal cues that can help it narrow its parameters by figuring that the speaker talks a certain way and adjusting its expectations to suit that model.
In other words, computers have learned to make sense out of sounds in much the same way people do. Someday, they’ll be able to speak the sounds back just as well. With current technology, synthesizing a natural-sounding voice requires a tedious, finicky and expensive process of recording a real human voice talent, which is why the ones you hear tend to be nondescript and inoffensive. “You don’t want to have a Valley girl or a Deep South voice because any extreme might tick people off,” says Juergen Horst, a researcher for AT&T — just as you rarely hear such voices in, say, a Yale seminar.
But within five years, Horst says, it’s conceivable that computers will be able to synthesize voices in real time from broadcasts, user commands or even individual callers. Then they could learn to accommodate the accent of whomever they’re speaking to, just like people do subconsciously. The computer will detect hhhwhich word classes you merge and measure the height of your a’s and listen for which r’s you drop to figure out who you are, where you’re from, and how you want to be spoken to. But it’s not judging.