Monday, September 15, 2014

can you still do linguistics without math?

A reader emailed me an interesting question that's worth giving a wider audience to:
It nearly broke my heart to hear that maths may be a required thing in linguistics, maths has pulled me back from a few opportunities in the past before linguistics, I'd been interested in engineering, marine biology, etc. I was just wondering if there was any work around, anything that would help me with linguistics that didn't require maths. Just. any advice at all, for getting into the field of linguistics with something as troubling as dyscalculia.
The reader makes a good point I hadn't thought about. I remember my phonetics teacher telling us that she often recruited students into linguistics by telling them that it's one of the few fields that teach non-mathematical data analytics. That was something that appealed to me.

I'm not familiar with dyscalculia so I can't speak to how it impacts the study of linguistics directly. But even linguists who don't perceive themselves as "doing math" often still are, in the form of complicated measurements and such, like in phonetics and psycholinguistics. Generally though, I think that there are still many opportunities to do non-mathematical linguistics, especially in fields like sociolinguistics, language policy, and language documentation. Let us not forget that the vast majority of the world's languages remain undocumented so we need an army of linguists to work with speakers the world over to record, analyze, and describe the lexicons, grammars, and sound systems of those languages. We also need to understand better child language acquisition, slang, pragmatic inferences, and a host of other deeply important linguistic issues. It still requires a lot of good old fashioned, non-mathematical linguistics skills to study those topics.

Unfortunately, those are woefully underpaid skills as well. One of the reasons math is taking over linguistics is simple economics: that's where the money is. Both the job market and the research grant market are trending heavily towards quantitative skills and tools, regardless of the discipline. That's just a fact we all have to deal with. I didn't go to grad school in order to work at IBM. That's just where the job is. I couldn't get hired at a university to save my life right now, but I can make twice what a professor makes at IBM. So here I am (don't get me wrong. I have the enviable position of getting paid well to work on real language problems, so I ain't complaining).

Increasingly, the value of descriptive linguistic skills is in the creation of corpora that can be processed automatically with tools like AntConc  and such. You can do a lot of corpus linguists these days without explicit math because the software does a lot of the work for you. But you will still need to understand the underlying math concepts (like why true "keywords" are not simply frequency searches). For details, I can highly recommend Lancaster University's MOOC "Corpus linguistics: method, analysis, interpretation" (it's free and online right now)

The real question is; what do you want to do with linguistics? Do you want to get a PhD then become a professor? That's a tough road (and not just in linguistics. The academic market place is imploding due to funding issues). There aren't that many universities who hire pure descriptive linguists anymore. Those jobs do exist, but they're rare. SUNY Buffalo, Oregon, and New Mexico are three US schools that come to mind as still having descriptive field linguist faculties. But the list is short.

If you want to teach a language, that's the most direct route to getting a job, but you'll need the TESOL Certificate too and frankly, those tend to be low paid, part-time jobs. Hard to build a secure career off of that.

That leaves industry. There are industry jobs for non-quantitative linguists, but they're unpredictable. Marketing agencies occasionally hire linguists to do research on cross-linguistic brand names and such. Check out this old post for some examples.

I hope this helps. I recommend asking this question over at The Linguist List too because I have my own biases. It's smart to get a wide variety of perspectives.

Tuesday, September 2, 2014

neural nets and question answering

I just read A Neural Network for Factoid Question Answering by Iyyer et al  (presented at EMNLP 2014).

I've been particularly keen on research about question answering NLP for a long time because my first ever NLP gig was as a grad student intern at a defunct question answering start-up in 2000 (QA was all the rage during the 90s tech bubble). QA is somewhat special among NLP fields because it is a combination of all of the others put together into a single, deeply complex pipeline.

When I saw this paper Tweeted by Mat Kelcey, I was excited by the title, but after reading it, I suspect the constraints of their task make it not quite applicable to commercial QA applications.

Here are some thoughts on the paper, but to be clear: these comments are my own and do not represent in any way those of my employer.

What they did:
Took question/answer pairs from a college Quiz Bowl game and trained a neural network to find answers to new questions. More to the point, "given a description of an entity, [they trained a neural net to] identify the person, place, or thing discussed".


The downside:
  1. They used factoid questions from a game called Quiz Bowl
  2. Factoid questions assume small, easily identifiable answers (typically one word or maybe a short multi-word phrase)
  3. If you’re unfamiliar with the format of these quiz bowl games, you can play something similar at bars like Buffalo Wild Wings. You get a little device for inputting an answer and the questions are presented on TVs around the room. The *questions* are composed of 4-6 sentences, displayed one at a time. The faster you answer, the more points you get. The sentences in the question are hierarchically ordered in terms of information contained. The first sentence gives very little information away and is presented alone for maybe 5 seconds. If you can’t answer, the second sentence appears for 5 seconds giving a bit more detail. If you still can’t answer, the third sentence appears providing even more detail, but fewer points. And so on.
  4. Therefore, they had large *questions* composed of 4-6 sentences, providing more and more details about the answer. This amount of information is rare (though they report results of experimental guesses after just the first sentence, I believe they still used the entire *question* paragraph for training).
  5. They had fixed, known answer sets to train on. Plus (annotated) incorrect answers to train on.
  6. They whittled down their training and test data to a small set of QA pairs that *fit* their needs (no messy data) - "451 history answers and 595 literature answers that occur on average twelve times in the corpus".
  7. They could not handle multi-word named entities (so they manually pre-processed their corpus to convert these into single strings).
The upside:

  1. Their use of dependency trees instead of bag o' words was nice. As a linguist, I want to see more sophisticated linguistic information used in NLP.
  2. They jointly learned answer and question representations in the same vector space rather than learning them separately because "most answers are themselves words (features) in other questions (e.g., a question on World War II might mention the Battle of the Bulge and vice versa). Thus, word vectors associated with such answers can be trained in the same vector space as question text enabling us to model relationships between answers instead of assuming incorrectly that all answers are independent."
  3. I found their error analysis in sections “5.2 Where the Attribute Space Helps Answer Questions” and 5.3 "Where all Models Struggle” especially thought provoking. More published research should include these kinds of sections.
  4. Footnote 7 is interesting: "We tried transforming Wikipedia sentences into quiz bowl sentences by replacing answer mentions with appropriate descriptors (e.g., \Joseph Heller" with \this author"), but the resulting sentences suffered from a variety of grammatical issues and did not help the final result." Yep, syntax. Find-and-replace not gonna cut it.

Friday, August 1, 2014

for linguists, by linguists

The Speculative Grammarian is at it again, offering a happy hour discount on an already ridiculously inexpensive book of linguistic fun: The Speculative Grammarian Essential Guide to Linguistics.
Speculative Grammarian is the premier scholarly journal featuring research in the oft neglected field of satirical linguistics—and it is now available in book form!

a sidelong look at all that is humorous about the field. Containing over 150 articles, poems, cartoons, humorous ads and book announcements—plus a generous sprinkling of quotes, proverbs and other witticisms—the book discovers things to laugh about in most major subfields of Linguistics. It pokes good-natured fun at linguists (famous or otherwise), linguistic theory, and many aspects of language. The authors and editors are linguists who love their field, but who at the same time love to celebrate the funny aspects of Linguistics. The book invites readers to laugh along.

Sunday, June 29, 2014

Facebook "emotional contagion" Study: A Roundup of Reactions

In case you missed it, there was a dust-up this weekend around the web because of a social science study involving manipulation of Facebook news feeds of users (which might include you, if you are an English language user). Here are three points of contention (in order of intensity):
  • Ethics - Was there informed consent?
  • Statistical significance - The effect was small, but the data large, what does this mean?
  • Linguistics - How did they define and track "emotion "?
First, the original study itself:

Experimental evidence of massive-scale emotional contagion through social networks. Kramer et al. PNAS. Synopsis (from PNAS)
We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction between people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues.
My two cents: We'll never see the actual language data, so the many questions this study raises are destined to be left unanswered.

The Roundup

In Defense of Facebook: If you can only read one analysis, read Tal Yarkoni's deep dive response to the study and its critics. It's worth a full read (comments too). He makes a lot of important points, including the weakness of the effect, the rather tame facts of the actual experiments, and the normalcy of manipulation (that's how life works) but for me, this take-down of the core assumptions underlying the study is the Money Quote:
the fact that users in the experimental conditions produced content with very slightly more positive or negative emotional content doesn’t mean that those users actually felt any differently. It’s entirely possible–and I would argue, even probable–that much of the effect was driven by changes in the expression of ideas or feelings that were already on users’ minds. For example, suppose I log onto Facebook intending to write a status update to the effect that I had an “awesome day today at the beach with my besties!” Now imagine that, as soon as I log in, I see in my news feed that an acquaintance’s father just passed away. I might very well think twice about posting my own message–not necessarily because the news has made me feel sad myself, but because it surely seems a bit unseemly to celebrate one’s own good fortune around people who are currently grieving. I would argue that such subtle behavioral changes, while certainly responsive to others’ emotions, shouldn’t really be considered genuine cases of emotional contagion

the Empire strikes back: Humanities Professor Alan Jacobs counters Yarkoni, using language that at times seemed to verge on unhinged, but hyperbole aside, he takes issue with claims that the experiment was ethical simply because users signed a user agreement (that few of them ever actually read). Money Quote:
This seems to be missing the point of the complaints about Facebook’s behavior. The complaints are not “Facebook successfully manipulated users’ emotions” but rather “Facebook attempted to manipulate users’ emotions without informing them that they were being experimented on.” That’s where the ethical question lies, not with the degree of the manipulation’s success. “Who cares if that guy was shooting at you? He missed, didn’t he?” — that seems to be Yarkoni’s attitude

Facebook admits manipulating users' emotions by modifying news feeds: Across the pond, The Guardian got into the kerfuffle. Never one to miss a chance to go full metal Orwell on us, the Guardian gives us this ridiculous Money Quote with not a whiff of counter-argument:
In a series of Twitter posts, Clay Johnson, the co-founder of Blue State Digital, the firm that built and managed Barack Obama's online campaign for the presidency in 2008, said: "The Facebook 'transmission of anger' experiment is terrifying." He asked: "Could the CIA incite revolution in Sudan by pressuring Facebook to promote discontent? Should that be legal? Could Mark Zuckerberg swing an election by promoting Upworthy [a website aggregating viral content] posts two weeks beforehand? Should that be legal?"
This Clay Johnson guy is hilarious, in a dangerously stupid way. How does his bonkers ranting rate two paragraphs in a Guardian story?


Everything We Know About Facebook's Secret Mood Manipulation Experiment: The Atlantic provides a roundup of sorts and a review of the basic facts, and some much needed sanity about the limitations of LIWC (which is a limited, dictionary tool that, except for the evangelical zeal of its creator James Pennebaker, would be little more than a toy for undergrad English majors to play with). Article also provides important quotes from the study's editor, Princeton's Susan Fiske. This also links to a full interview with professor Fiske.

Emotional Contagion on Facebook? More Like Bad Research Methods: If you have time to read two and only two analyses of the Facebook study, first read Yarkoni above, then read John Grohol's excellent fisking of the (mis-)use of LIWC as tool for linguistic study. Money Quote:
much of human communication includes subtleties ... — without even delving into sarcasm, short-hand abbreviations that act as negation words, phrases that negate the previous sentence, emojis, etc. — you can’t even tell how accurate or inaccurate the resulting analysis by these researchers is. Since the LIWC 2007 ignores these subtle realities of informal human communication, so do the researchers.
Analyzing Facebook's PNAS paper on Emotional Contagion: Nitin Madnani provides an NLPers
detailed fisking of the experimental methods, with special attention paid to the flaws of LIWC (with bonus comment from Brendan O'Connor, recent CMU grad and new U Amherst professor). Money Quote:
Far and away, my biggest complaint is that the Facebook scientists simply used a word list to determine whether a post was positive or negative. As someone who works in natural language processing (including on the task of analyzing sentiment in documents), such a rudimentary system would be treated with extreme skepticism in our conferences and journals. There are just too many problems with the approach, e.g. negation ("I am not very happy today because ..."). From the paper, it doesn't look like the authors tried to address these problems. In short, I am skeptical the whether the experiment actually measures anything useful. One way to address comments such as mine is to actually release the data to the public along with some honest error analysis about how well such a naive approach actually worked.

Facebook’s Unethical Experiment: Tal Yarkoni's article above provides a pretty thorough fisking of this Slate screed. I'll just add that Slate is never the place I'd go to for well reasoned, scientific analysis. A blow-by-blow deep dive into the last episode of Orange Is The New Black? Oh yeah, Slate has that genre down cold.


Anger Builds Over Facebook's Emotion-Manipulation Study: The site that never met a listicle it didn't love, Mashable provides a short article that fails to live up to its title. They provide little evidence that anger is building beyond screen grabs of a whopping four Twitter feeds. Note, they completely ignore the range of people supporting the study (no quotes from the authors, for example). As far as I can tell, there is no hashtag for anti-Facebook study tweets.


Facebook Manipulated User News Feeds To Create Emotional Responses: Forbes wonders aloud about the mis-use of the study by marketers. Money Quote:
What harm might flow from manipulating user timelines to create emotions?  Well, consider the controversial study published last year (not by Facebook researchers) that said companies should tailor their marketing to women based on how they felt about their appearance.  That marketing study began by examining the days and times when women felt the worst about themselves, finding that women felt most vulnerable on Mondays and felt the best about themselves on Thursdays ... The Facebook study, combined with last year’s marketing study suggests that marketers may not need to wait until Mondays or Thursdays to have an emotional impact, instead  social media companies may be able to manipulate timelines and news feeds to create emotionally fueled marketing opportunities.
You don't have to work hard to convince me that marketing professionals have a habit of half-digesting science they barely understand to try to manipulate consumers. That's par for the course in that field, as far as I can tell. Just don't know what scientists producing the original studies can do about it. Monkey's gonna throw shit. Don't blame the banana they ate.


Creepy Study Shows Facebook Can Tweak Your Moods Through ‘Emotional Contagion’. The Blaze witer Zach Noble summed up the negative reaction this way: a victory for scientific understanding with some really creepy ramifications. But I think it only seems creepy if you mis-understand the actual methods.

Final Thought: It's the bad science that creeps me out more than the questionable ethics. Facebook is data, lets use it wisely.





Friday, June 6, 2014

would you like vocal fries with that?

Actual linguist Christian DiCanio debunks non-linguists' study about perceptions of fake-vocal fry (if The Onion did linguistics parodies, surely this would be it): Vocal fry doesn't harm your career prospects, but not being yourself just might.

Money quote:
...listeners judge the female speakers with vocal fry as sounding "untrustworthy", there is a good possibility that they are simply making such a judgment based on the speaker not sounding like herself. The better lesson that one might take home instead here is that one's job prospects are harmed if you try to talk (or act) like someone who you are not.
Read the full take-down here (including bonus spectrogram!)

PS: I knew Christian briefly when he was an undergrad at SUNY Buffalo. He was talented and motivated. Now he's slumming it at some shady, slacker *university* in Connecticut. Damn waste.

Tuesday, May 27, 2014

mathematical linguistics for high school students


I received the following email this weekend:

I'm a high school junior from southern California.

For our final project in AP Calculus class, I'm doing a presentation on the connection between mathematics and linguistics, and I stumbled on your blogpost "Why Linguists Should Study Math" while researching my topic.

I was wondering if you could point me towards some resources (that are relatively easy to understand) about how math is present in and affects our written and spoken language.
Some things that I am considering are:
- the occurrences of words in our language
- how grammar uses mathematical principles
- algorithms we use to construct sentences

Thanks,
M.
My [edited] response (suggestions from y'all as to better resources are much appreciated; I'll forward; I wanted to get a response out quickly because the final is presumably fast approaching):

M.,

Thanks for reaching out to me. Of course, I think you’ve chosen a good topic. There are two broad ways in which linguistics and math intersects:
  • How the human brain uses math in natural language (psycholinguistics)
  • How linguists use math to study and model languages (computational linguistics)
From your email, it appears you are mostly interested in #1. However, in contemporary linguistics, the two are fast becoming one. Most contemporary linguists use math as a tool.

Let me address your three areas of interest with respect to how the human brain might use math to process and produce language:

The occurrences of words in our language: For the most part, this means “frequency” which really means counting. Linguists love to count. We use large corpora of texts to count words and phrases. Lancaster University in the UK is a well-known corpus linguistics school. Their web page has a lot of good introductory information (although I find it a bit clunky looking).

UPDATE: I forgot to include the one item that most directly answers the basic question: frequency effects in language. Human's are very aware of how often they hear words. In some way, we count words automatically, even if it's not quite a specific count like 75, somehow we know which words, phonemes, syntactic structures we hear/read more than others. This gives rise to a variety of frequency effects in language processing. This is the clearest example of how the brain uses math for language.

For example, we recognize high frequency words much faster than low frequency words. The website for Paul Warren's book "Introducing Psycholinguistics" has an online demo for a word frequency task you can walk through to see how linguists study this.
 
What do linguists count?
  • Words: I’m sure you’ve seen word clouds like Wordle. This is composed of simple word frequency counts. One of the most enduring facts about word counts is Zipf’s Law which says “the most frequent word [in a corpus of texts] will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.” Why would this be true? Linguists have been studying this for decades.
  • Ngrams: sets of two-word, three-words, four-word strings, etc. This helps provide more context than mere single word frequencies. Have some fun playing around with Google’s Ngram Viewer if you haven’t already. Try plotting the change in frequency of “mathematical linguistics” and “corpus linguistics” (paste those two phrases into the search box with no quotes and only a comma separating them). Scholars are trying to use this to plot changes in culture. For example, take a look at this PDF.
  • Other: We also count many other things too, like parts of speech (verbs, nouns, prepositions, etc). We also count the co-occurrence of linguistics items that are not right next to each other. If you want to dig into more frequency fun, check out the more advanced tools at BYU. You can read more about how these tools help us study language here.

How grammar uses mathematical principles: One of the most commonly studied types of mathematical principle in language is statistical learning. A good example of this is transitional probabilities, which are sets of probabilities for what linguistic item might come next given a string of items (e.g., words or phonemes). For example, if you read “The author signed the _______”, you could guess what the blank word is based on the previous four words (most likely, it’s “book”).  This is based on the psycholinguistic tests called “Cloze tests”. Linguists have discovered that the brain tracks transitional probabilities for all kinds of linguistic items. In fact, this is one of the most robust areas of study in language acquisition. Linguists study how babies use transitional probabilities to learn language. For example, one of the most challenging problems is figuring out how babies learn to separate a continuous stream of audio noise coming in to their ears into separate words, without any knowledge of what words are or what they mean. One theory is that babies quickly learn transitional probabilities of sounds that tell them where one word ends and another begins. But transitional probabilities alone are not enough. For a challenge, try reviewing this PDF:

Algorithms we use to construct sentences: This is the most controversial area you’ve asked about. The fact is, we linguists don’t really know how the brain constructs sentences. As I mentioned above, there are models based on transitional probabilities like Markov models, a computer algorithm designed to make those same kinds of guesses we made about “book”. Markov models and Cloze tests are a good example of psycholinguistics and computational linguistics coming together. As a theoretical contrast to statistical models, there are rule-based models like formal grammars. These are not mathematical in a typical sense, but they are based on formal logic, which is the underlying foundation of mathematics. Linguistics is in the middle of a war between the formal grammar camp and the statistical grammar camp. There’s no consensus on which is the *correct* model of language. However, in the last decade or so, the statistical side seems to have gained the advantage. If you really want to dig in to this war, here’s a challenging read.

Additional Reading:
Linguists who count (the comments are especially engaging; your teacher might be particularly interested in the calculus vs. algebra debate that ensues).


I hope this gets you off to a good start. Please don’t hesitate to ask for clarifications or more resources (especially let me know if you need more intro level or more advanced level; I wasn’t sure if I hit the level right or not). I’m happy to be of more assistance if I can. As a smart, dedicated student, I’m sure you’re ready to dig in to ngrams and Markov models. But, as a high school junior in southern California with June fast approaching, I’m also sure you’re ready for the beach. Both are required for a healthy life of the mind.

Wednesday, May 21, 2014

Jobs for linguists - May 2014

California is awash in jobs for linguists this Spring...

Update 5/24/14: Branding and Marketing
Interbrand
NY, NY
Consultant, Verbal Identity

B.A. degree, backgrounds of interest include any verbal-focused or writing intensive field (e.g. Linguistics)
Apply Here

Text Analytics Consultant
Medallia, Inc. - Palo Alto,California
Bachelor'€™s degree
Background in Linguistics
Demonstrated interest in technology
Strong preference for a French or German native speaker
(Not visible on company website, found on LinkedIn, sign in required)
Apply Here

Linguistic Intern
Bosch Group
SF Bay Area
Responsibilities: Support the development of next-generation language products in the areas of speech and language technologies and systems. Support the administration of user studies
Qualifications: Senior undergraduate or graduate students in Applied Linguistics, or related fields
Apply Here

Analytical Linguist, Ads Human Evaluation
Google
Los Angeles, CA, USA
Product Management
Responsibilities: Direct, monitor, train, and manage the day-to-day work of temporary workers.
Design and implement tests on data and worker quality, analyzing and reporting on the results using Python, XML/CSS, HTML/JavaScript, database queries, and Google-internal technologies.
Work directly with engineers and statisticians to devise and run experiments to answer specific questions about advertising and product quality.
Minimum Qualifications: MA/MS or PhD degree in an analytical field (e.g., Linguistics, Cognitive Science, Statistics, Mathematics), or equivalent practical experience.
Experience with one or more of the following: Python or another scripting language, Java or C++, XML/HTML/CSS/JavaScript, SQL or specialized database query languages and/or specialized analysis software such as Matlab, R, SPSS, STATA, SAS, Praat, or E-Prime.
Experience working with large quantities of data.
Apply Here
And see my context here

Apple
Lexical Resource Manager
Education
M.A. or PhD in Linguistics or related field
Strong background in phonology
Apply Here

Friday, February 21, 2014

RIP Charles Fillmore

I never met Charles Fillmore, but he had a deep influence on my linguistics education. When I was a graduate student in linguistics at SUNY Buffalo we only half jokingly called it Berkeley East because half the faculty had been trained at Berkeley and the department had a *perspective* on linguistics that was undeniably colored by Berkeley theory. Charles Fillmore was a hero at SUNY Buffalo and it was hard to take a class that didn't reference his work. His work on constructions and frame semantics was the underpinning of my interest in verb classes and prepositions.

I can't offer any unique thoughts on the man, so I'll simply point to some folks around the web who have offered theirs:

A Roundup of Reactions

Paul Kay - Charles J. Fillmore
The magnitude of Fillmore’s contributions to linguistics can hardly be exaggerated

George Lakoff - He Figured Out How Framing Works
He discovered that we think, largely unconsciously, in terms of conceptual frames — mental structures that organize our thought. Further, he found that every word is mentally defined in terms of frame structures.

Dominik Lukes - Linguistics According to Fillmore
Charles J Fillmore who was a towering figure among linguists without writing a single book. In my mind, he changed the face of linguistics three times with just three articles (one of them co-authored).

UC Berkeley - Linguistics Department
He was a gifted teacher, a beloved mentor, a treasured colleague and friend, and one of the great linguists of the last half-century.

Arnold Zwicky - Chuck Fillmore
...with a link to a wonderful video he made about his career in 2012.

Friday, January 31, 2014

The SOTU and Reading Level

Evan Fleischer wrote a cheeky little bit about the reading level of the SOTU over at Esquire: Is the State of the Union Getting Dumber?

It was triggered by this graph in The Guardian:

Even emailed me and several other linguists to get some reactions. He quotes me, Ben Zimmer, and Angus B. Grieve-Smith. We generally agreed that trend noted by the graph probably had more to do with changing trends in who the speech is for, rather than any change in intelligence level.

It's a fun little read.

Tuesday, January 28, 2014

Anticipating the SOTU

In anticipation of President Obama's 2014 State Of The Union speech tonight, and the inevitable bullshit word frequency analysis to follow, I am re-posting my post from 2010's SOTU reaction, in hope that maybe, just maybe, some political pundit might be slightly less stupid than they were last year ... sigh .. here's to hope

BTW, Liberman has been on top of the SOTU story for a while now. here's his latest.

(cropped image from Huffington Post)

It has long been a grand temptation to use simple word frequency* counts to judge a person's mental state. Like Freudian Slips, there is an assumption that this will give us a glimpse into what a person "really" believes and feels, deep inside. This trend came and went within linguistics when digital corpora were first being compiled and analyzed several decades ago. Linguists quickly realized that this was, in fact, a bogus methodology when they discovered that many (most) claims or hypotheses based solely on a person's simple word frequency data were easily refuted upon deeper inspection. Nonetheless, the message of the weakness of this technique never quite reached the outside world and word counts continue to be cited, even by reputable people, as a window into the mind of an individual. Geoff Nunberg recently railed against the practice here: The I's Dont Have It.

The latest victim of this scam is one of the blogging world's most respected statisticians, Nate Silver who performed a word frequency experiment on a variety of U.S. presidential State Of The Union speeches going back to 1962 HERE. I have a lot of respect for Silver, but I believe he's off the mark on this one. Silver leads into his analysis talking about his own pleasant surprise at the fact that the speech demonstrated "an awareness of the difficult situation in which the President now finds himself." Then, he justifies his linguistic analysis by stating that "subjective evaluations of Presidential speeches are notoriously useless. So let's instead attempt something a bit more rigorous, which is a word frequency analysis..." He explains his methodology this way:

To investigate, we'll compare the President's speech to the State of the Union addresses delivered by each president since John F. Kennedy in 1962 in advance of their respective midterm elections. We'll also look at the address that Obama delivered -- not technically a State of the Union -- to the Congress in February, 2009. I've highlighted a total of about 70 buzzwords from these speeches, which are broken down into six categories. The numbers you see below reflect the number of times that each President used term in his State of the Union address.

The comparisons and analysis he reports are bogus and at least as "subjective" as his original intuition. Here's why: