Sunday, November 29, 2015

online psycholinguistics demos 2015

I was asked recently about an old post from 2008 that listed a variety of online psycholinguistics demos. All of the links are dead now, so I was asked if I knew of any updated ones. This is what I can find. Any suggestions would be welcomed.

  • Harvard Implicit Associations TaskProject Implicit is a non-profit organization and international collaboration between researchers who are interested in implicit social cognition - thoughts and feelings outside of conscious awareness and control. The goal of the organization is to educate the public about hidden biases and to provide a “virtual laboratory” for collecting data on the Internet.
  • webspr - Conduct psycholinguistic experiments (e.g. self-paced reading and speeded acceptability judgment tasks) remotely using a web interface
  • Games With Words: Learn about language and about yourself while advancing cutting-edge science. How good is your language sense?
  • Lexical Decision Task demo: In a lexical decision task (LDT), a participant needs to make a decision about whether combinations of letters are words or not. For example, when you see the word "GIRL", you respond "yes, this is a real English word", but when you see the letters "XLFFE" you respond "No, this is not a real English word".
  • Categorical PerceptionCategorical perception means that a change in some variable along a continuum is perceived, not as gradual but as instances of discrete categories. The test presented here is a classical demonstration of categorical perception for a certain type of speech-like stimuli.
Paul Warren has a variety of demos at the site for his textbook "Introducing Psycholinguistics"

  • McGurk demo

  • Various other demos from Warren's textbook

  • Saturday, November 14, 2015

    Google's TensorFlow and "mathematical tricks"

    TensorFlow is a new open source software library for machine learning distributed by Google. In some ways, this could be seen as a competitor to BlueMix (though much less user friendly). Erik Mueller, who worked on the original Watson Jeopardy system (and has a vested interest in AI with his new company Symbolic AI), just wrote a brief review of TensorFLow for Wired.

    Google’s TensorFlow Alone Will Not Revolutionize AI

    Unfortunately, it's not really a review of TensorFlow itself, but rather makes a general point against statistical approaches, which I basically agree with, but the argument requires a much more comprehensive treatment.

    Some good quotes from the article:

    • "I think [TensorFlow] will focus our attention on experimenting with mathematical tricks, rather than on understanding human thought processes."
    • "I’d rather see us design AI systems that are understandable and communicative."

    Wednesday, June 10, 2015

    The Language Myth - Book Review

    Linguistics professor Vyvyan Evans recently published a new book that has at least one group of linguists in a state of frenzy: The Language Myth: Why language is not an instinct. The book's blurb sums up its content:
    Some scientists have argued that language is innate, a type of unique human 'instinct' pre-programmed in us from birth. In this book, Vyvyan Evans argues that this received wisdom is, in fact, a myth. Debunking the notion of a language 'instinct', Evans demonstrates that language is related to other animal forms of communication; that languages exhibit staggering diversity; that we learn our mother tongue drawing on general properties and abilities of the human mind, rather than an inborn 'universal' grammar; that language is not autonomous but is closely related to other aspects of our mental lives; and that, ultimately, language and the mind reflect and draw upon the way we interact with others in the world
    Evans' grounds his motivation in the idea that there are a variety of false claims about how language works ("myths") deeply rooted in our culture's background knowledge as well explicated in introductory text books. He goes further to claim that these false claims have been pushed by a small number of pre-eminent scholars whose fame and influence have caused these false claims to be taken more seriously than they deserve on their face.

    By all rights, I should be a good audience for this book. I was trained as a linguist in a department that was openly hostile to the language instinct doctrine that this book argues against (see my post about that experience). 

    The book is organized by two principles. First, each chapter starts by stating one false claim and providing a description of why it was proposed as an explanation of how language works. Second, each chapter then deconstructs the myth into component claims and shoots holes in each one. 

    The Good
    Evans does a service to the lay audience by pointing out that that deep divisions exist within the filed of linguistics. Too often non-experts assume a technical field is homogeneous and everyone agrees on the basic theories. This is simply not true of linguistics. 

    Evans also does a service to his audience by stepping through the logic of refutation. His point-counterpoint style can be detailed at times, but I appreciate a book that doesn't treat its readers like third graders (I'm looking at you Gladwell).

    For me, the standout chapter was 5: Is language a distinct module in the mind. This chapter is devoted to neurolinguistics and here Evans is at his sharpest when leading the reader through his point-counterpoint about brain regions and functionality.

    The Bad
    Evans fails to do justice to the myths he debunks. He was accused of creating straw men (and addresses this somewhat in the introduction), but ultimately I have to agree. Evans does not provide a fair description of arguments like poverty of the stimulus

    Evans quickly shows his bias and directly attacks just two people: Noam Chomsky and Steven Pinker (and to a lesser extent, Jerry Fodor). Evans wants to debunk general notions that have crept into the general public's background beliefs about language, but what he really does is rail against two guys. And worse, he often devolves into a detailed point-counterpoint with just one book, Pinkers' 1994 The Language Instinct. Any reader unfamiliar with that book will quickly get drowned by arguments against claims they never encountered. As an exercise, I would recommend Evans re-write this book without a single reference to Chomsky, Pinker, or Fodor. I suspect the result will be a more effective piece of writing. 

    Lest some Chomskyean take this review wrongly, let me be clear: I think Chomsky is broadly wrong and Evans is broadly right. But even though I believe Pinker is wrong and Evans is right, I find Pinker a far superior writer and seller of ideas. And that is a serious problem. 

    Evans would have been better off throwing away the anti-Chomsky rants and simply write his view of how language works. A book on its own terms. Instead he comes across as your drunk uncle at Christmas who can't stop complaining about how the ref in a high school football game 20 years ago screwed him over with a bad call. This might actually be true, but get over it. 

    I feel Evans has taken on too much. Each myth is worth a small book itself to debunk properly. This is partly what leads to the straw man arguments. Efficiency. A non-straw man version of Evans' book would be 3000 pages long and only appeal to the three people in the world who know enough detail about both Chomsky and functionalist theory to properly understand all that detail. So I *get* why Evans chose this style. I just think Pinker is better at it. Ultimately Evans alienates his lay audience by ranting about people they don't know and arguments they are unfamiliar with. 

    A detail complaint: He can be disingenuous with citations. On page 110 he uses the wording "the most recent version of Universal Grammar", but turn to the footnote on 264 and he cites publications from 1981 and 1993. In a book published in 2015, citations from 81 and 93 hardly count as recent. See also page 116 where he cites "a more recent study" that was actually published in 2004 (and probably conducted in 2002).

    I don't want to be critical of a book that argues a position I align with, but I must be honest. This book just doesn't cut it. 

    Sunday, May 17, 2015

    The Language Myth - Preliminary Thoughts

    I started reading The Language Myth: Why Language Is Not an Instinct by Vyvyan Evans. This book argues that Noam Chomsky is wrong about the basic nature of language. The book has sparked controversy and there have probably been published more words in blogs and tweets in response than are contained in the actual book.

    I'm two chapters in, but before I begin posting my review, I wanted to do a post on academic sub-culture, specifically the one I was trained in. I did my (not quite completed) PhD in linguistics at SUNY Buffalo in 1998-2004. The students only half-jokingly called it Berkeley East because, at the time, about half the faculty had been trained at Berkeley (and several others were closely affiliated in research), and Berkeley is one of the great strong-holds of anti-Chomsky sentiment. Buffalo was clearly a "functionalist" school (though no one ever really knew what that meant, functionalism never really being a field, more a culture).

    In any case, we were clearly, undeniably, virulently, anti-Chomsky. And that's the culture I want to describe to provide some sense of how different the associations are with the name "Chomsky" for me (and I suspect Evans), than for non linguists, and for non-Chomskyan linguists.

    So what was it like to be a grad student in a functionalist linguistics department, with respect to Noam Chomsky?

    [SPOILER ALERT - inflammatory language below. Most of this post is intended to represent a thought climate within functionalist linguistics, not factual evidence]

    I never quite drank the functionalist Kool-Aid (nor the Chomskyean Kool-Aid either, to be clear); nonetheless I remain endowed with a healthy dose of Chomsky skepticism.

    Here is how I remember the general critique of Chomsky echoed in the halls of SUNY Buffalo linguistics (this is my memory of ten+ years ago, not intended to be a technical critique; this is meant to give the impression of what the culture of a functionalist department felt like).

    The Presence of Chomsky

    • First, we didn't talk about Chomsky much, he was peripheral. What little we said about him was typically mocking and belittling (grad students, ya know).
    • The syntax courses, however, were designed to teach Chomsky's theories for half a semester, then each instructor was given the second half to teach whatever alternative theory they wanted. For my Syntax I course, we used one of Andrew Radford's Minimalism textbooks (then RRG for the second half). For my Syntax II, we used Elizabeth Cowper's GB textbook (then what Matthew Dryer called "Basic Theory", which I always preferred above all else).
    • We had a summer reading group for years. One summer we read Chomsky’s The Minimalist Program because we felt responsible for understanding the paradigm (we wanted to try to understand the *other*). The group included two senior faculty, both with serious syntax background. 

    The Perception of Chomsky 
    (amongst my cohort, this is what my professors and fellow grad students, and I, thought about the guy. Whether we were accurate or not is another thing)

    • Noam Chomsky is a likable man, for those who get to meet him in person.
    • Chomsky did linguistics a great service by taking linguistics in the general direction of hard science.
    • Chomsky's ideas have never been accepted by a majority of linguists, if you include semanticists, discourse, sociolinguistics, international linguists, psycholinguists, anthropological linguists, historical linguists, field linguists, philologists, etc. Outside of American syntacticians, Chomsky is a footnote, a non factor.
    • Many of his fiercest critics were former students or colleagues.
    • Chomsky radically changes his theories every ten years or so, simply ignoring his previous claims when they're proved wrong.
    • Chomsky has never made a serious attempt to understand other theories or engage in linguistic debate; he lives in a cocoon.
    • He bases major theoretical mechanisms on scant evidence, often obsessing over a single sentence in a language he himself has never studied, based only on evidence from an obscure source (like a grad student thesis).
    • He condescendingly dismisses most linguistic evidence (like spoken data) with the unfounded distinction between narrow syntax and broad syntax. This allows him to cherry pick data that suits him, and ignore data that refutes his claims.
    • When critiques are presented by serious linguists with evidence, the evidence is discarded as *irrelevant*, the linguists are derided as foolish amateurs, and the critiques are dismissed as naive. But rarely are the points taken as serious debate.
    • Chomsky only debates internal mechanisms of his own theories; anyone who argues using mechanisms outside of those Chomsky-internals is derided as ignorant. In other words, there is only one theoretical mechanism, only one set of theoretical terms and artifacts; only these will be recognized as *legitimate* linguistics. Anything else is ignored. 
    • Chomsky doesn't engage with the wider linguistics community. 
    • Chomsky expects to be taken seriously in a way that he himself would never allow anyone else to be taken seriously: lacking substantial evidence, lacking external coherence, and lacking anything approximating collegiality.
    • Oh, and Chomsky himself hasn't done serious linguistic analysis since the 80s. He has devoted most of the last 30 years to stabbing at political windmills. At most, he spends maybe 10% of his time on linguistics. 

    That’s the image of the man as I recall from the view of a functionalist department devoted to descriptive linguistics. Let the verbal assaults begin!!!

    UPDATE (May 5): This post prompted a spirited Reddit discussion, well word reading.

    Thursday, March 12, 2015

    Jobs with IBM Watson

    IBM Watson is currently recruiting Washington DC area engineers for "Natural Language Processing Analysts". We're looking for engineers who like to build stuff and travel.You can apply through the link, or feel free to contact me if you want more info (use the "View my complete profile" link to the right for my contact).

    Here's the official posting (hint: there is wiggle room)

    Job description
    Ready to change the way the world works? IBM Watson uses Cognitive Computing to tackle some of humanity's most challenging problems - like revolutionizing how doctors research cancer or transforming how businesses engage with their customers. We have an exciting opportunity for a Watson Natural Language Processing Analyst responsible for rigorous analysis of system performance phases including search, evidence scoring, and machine learning.

    Natural Language Processing (NLP) Analysts evaluate system performance, and identify steps to drive enhancements. The role is part analyst and part developer. Analysts are required to function independently to dive deep into system components, identify areas for improvement, and devise solutions. Analysts are expected to drive test and evaluation of their solutions, and empirically identify follow on steps to implement continuous system improvement. Natural Language Processing is an explosively dynamic field; analysts must expect ambiguity, and demonstrate the ability to develop courses of action on the basis of data driven analysis. Must be able to work independently and demonstrate initiative. Demonstrated analytical skills, security clearances preferred but not required.

    We live in a moment of remarkable change and opportunity. The convergence of data and technology is transforming industries, society and even the workplace. New roles are being created that never existed before to meet the demands of this transformation. And IBM Watson is now looking for talent in healthcare, life sciences, financial services, the public sector and others to new roles destined to usher in the next era of cognitive computing. Embark on the journey with us at IBM Watson.
    • Bachelor's Degree
    • At least 2 years experience in Text Search Engines (such as Lucene)
    • At least 2 years experience in Java Development Proficiency
    • Basic knowledge in Natural Language Processing
    • Basic knowledge in Text Analytics/ Info Retrieval
    • Basic knowledge in Unstructured Data
    • Readiness to travel 50% travel annually
    • U.S. citizenship required
    • English: Fluent
    • Master's Degree
    • At least 5 years experience in Text Search Engines (such as Lucene)
    • At least 5 years experience in Java Development Proficiency
    • At least 2 years experience in Natural Language Processing
    • At least 2 years experience in Text Analytics/ Info Retrieval
    • At least 2 years experience in Unstructured Data
    IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

    Monday, March 2, 2015

    The Linguistics behind IBM Watson

    I will be talking about the linguistics behind IBM Watson's Question Answering on March 11 at the DC Natural Language Processing MeetUp. Here's the blurb:

    In February 2011, IBM Watson defeated Brad Rutter and Ken Jennings in the Jeopardy! Challenge. Today, Watson is a cognitive system that enables a new partnership between people and computers that enhances and scales human expertise by providing a more natural relationship between the human and the computer. 

    One part of Watson’s cognitive computing platform is Question Answering. The main objective of QA is to analyze natural language questions and present concise answers with supporting evidence, rather than a list of possibly relevant documents like internet search engines.

    This talk will describe some of the natural language processing components that go into just three of the basic stages of IBM Watson’s Question Answering pipeline:

    • Question Analysis
    • Hypothesis Generation
    • Semantic Types

    The NLP components that help make this happen include a full syntactic parse, entity and relationship extraction, semantic tagging, co-reference, automatic frame discovery, and many others. This talk will discuss how sophisticated linguistic resources allow Watson to achieve true question answering functionality.

    Tuesday, February 24, 2015

    toy data and question answering

    Some first impressions of a really interesting paper on AI and Question Answering: Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (pdf)

    FWIW, I spent most of the time mis-reading the first author name as Watson, instead of Weston, assuming a false sense of irony :-)

    The basic idea is that in order to facilitate rapid development of artificial intelligence question answering applications, someone ought to create a set of standard data sets that are highly constrained as to what they test (gee, who might be the best people to create those? the authors maybe?).

    Basically, these are regression tests for reasoning. Each test contains a set of 2-5 short statements (= the corpus) and a set of 1-2 questions. The point is to expose the logical requirements for answering the questions given the way the answer is encoded in the set of statements.

    They define 20 specialized tests. All of them are truly interesting as AI tests.
    Their two simplest tests:

    3.1 is a fairly straight forward factoid Question-Answer pairing, but 3.2 requires that a system extract, store, and connect information across sentences before discovering the answer. Cool task. Non-trivial.

    3.17 is an even more challenging task:

    This immediately reminded me of a textual Tarski's World.

    The authors admit that few of these tasks are immediately relevant to an industrial scale question answering system like Watson, but argue that tasks like these focus research on identifying the skills needed to answer these questions
    Our goal is to categorize different kinds of questions into skill sets, which become our tasks. Our hope is that the analysis of performance on these tasks will help expose weaknesses of current models and help motivate new algorithm designs that alleviate these weaknesses. We further envision this as a feedback loop where new tasks can then be designed in response, perhaps in an adversarial fashion, in order to break the new models.
    My Big Take-Aways:

    • Not realistic QA from large data.
    • They're testing artificial recombination of facts into new understanding, not NLP or information extraction per se. It's an AI paper after all, so ignoring contemporary approaches to information extraction is fine. But, can one assume that many of these answers are discoverable in large data collections using existing NLP techniques? Their core design encodes the answer in one very limited way. But industrial scale QA systems utilize large data sets where answers are typically repeated, often many many times, in many different forms in the corpora.
    • Their test data is so specific, I worried that it might encourage over-fitting of solutions: Solve THAT problem, not solve THE problem. 
    Oh, BTW, I loved reading this paper, in case that wasn't clear.

    Monday, September 15, 2014

    can you still do linguistics without math?

    A reader emailed me an interesting question that's worth giving a wider audience to:
    It nearly broke my heart to hear that maths may be a required thing in linguistics, maths has pulled me back from a few opportunities in the past before linguistics, I'd been interested in engineering, marine biology, etc. I was just wondering if there was any work around, anything that would help me with linguistics that didn't require maths. Just. any advice at all, for getting into the field of linguistics with something as troubling as dyscalculia.
    The reader makes a good point I hadn't thought about. I remember my phonetics teacher telling us that she often recruited students into linguistics by telling them that it's one of the few fields that teach non-mathematical data analytics. That was something that appealed to me.

    I'm not familiar with dyscalculia so I can't speak to how it impacts the study of linguistics directly. But even linguists who don't perceive themselves as "doing math" often still are, in the form of complicated measurements and such, like in phonetics and psycholinguistics. Generally though, I think that there are still many opportunities to do non-mathematical linguistics, especially in fields like sociolinguistics, language policy, and language documentation. Let us not forget that the vast majority of the world's languages remain undocumented so we need an army of linguists to work with speakers the world over to record, analyze, and describe the lexicons, grammars, and sound systems of those languages. We also need to understand better child language acquisition, slang, pragmatic inferences, and a host of other deeply important linguistic issues. It still requires a lot of good old fashioned, non-mathematical linguistics skills to study those topics.

    Unfortunately, those are woefully underpaid skills as well. One of the reasons math is taking over linguistics is simple economics: that's where the money is. Both the job market and the research grant market are trending heavily towards quantitative skills and tools, regardless of the discipline. That's just a fact we all have to deal with. I didn't go to grad school in order to work at IBM. That's just where the job is. I couldn't get hired at a university to save my life right now, but I can make twice what a professor makes at IBM. So here I am (don't get me wrong. I have the enviable position of getting paid well to work on real language problems, so I ain't complaining).

    Increasingly, the value of descriptive linguistic skills is in the creation of corpora that can be processed automatically with tools like AntConc  and such. You can do a lot of corpus linguists these days without explicit math because the software does a lot of the work for you. But you will still need to understand the underlying math concepts (like why true "keywords" are not simply frequency searches). For details, I can highly recommend Lancaster University's MOOC "Corpus linguistics: method, analysis, interpretation" (it's free and online right now)

    The real question is; what do you want to do with linguistics? Do you want to get a PhD then become a professor? That's a tough road (and not just in linguistics. The academic market place is imploding due to funding issues). There aren't that many universities who hire pure descriptive linguists anymore. Those jobs do exist, but they're rare. SUNY Buffalo, Oregon, and New Mexico are three US schools that come to mind as still having descriptive field linguist faculties. But the list is short.

    If you want to teach a language, that's the most direct route to getting a job, but you'll need the TESOL Certificate too and frankly, those tend to be low paid, part-time jobs. Hard to build a secure career off of that.

    That leaves industry. There are industry jobs for non-quantitative linguists, but they're unpredictable. Marketing agencies occasionally hire linguists to do research on cross-linguistic brand names and such. Check out this old post for some examples.

    I hope this helps. I recommend asking this question over at The Linguist List too because I have my own biases. It's smart to get a wide variety of perspectives.

    Tuesday, September 2, 2014

    neural nets and question answering

    I just read A Neural Network for Factoid Question Answering by Iyyer et al  (presented at EMNLP 2014).

    I've been particularly keen on research about question answering NLP for a long time because my first ever NLP gig was as a grad student intern at a defunct question answering start-up in 2000 (QA was all the rage during the 90s tech bubble). QA is somewhat special among NLP fields because it is a combination of all of the others put together into a single, deeply complex pipeline.

    When I saw this paper Tweeted by Mat Kelcey, I was excited by the title, but after reading it, I suspect the constraints of their task make it not quite applicable to commercial QA applications.

    Here are some thoughts on the paper, but to be clear: these comments are my own and do not represent in any way those of my employer.

    What they did:
    Took question/answer pairs from a college Quiz Bowl game and trained a neural network to find answers to new questions. More to the point, "given a description of an entity, [they trained a neural net to] identify the person, place, or thing discussed".

    The downside:
    1. They used factoid questions from a game called Quiz Bowl
    2. Factoid questions assume small, easily identifiable answers (typically one word or maybe a short multi-word phrase)
    3. If you’re unfamiliar with the format of these quiz bowl games, you can play something similar at bars like Buffalo Wild Wings. You get a little device for inputting an answer and the questions are presented on TVs around the room. The *questions* are composed of 4-6 sentences, displayed one at a time. The faster you answer, the more points you get. The sentences in the question are hierarchically ordered in terms of information contained. The first sentence gives very little information away and is presented alone for maybe 5 seconds. If you can’t answer, the second sentence appears for 5 seconds giving a bit more detail. If you still can’t answer, the third sentence appears providing even more detail, but fewer points. And so on.
    4. Therefore, they had large *questions* composed of 4-6 sentences, providing more and more details about the answer. This amount of information is rare (though they report results of experimental guesses after just the first sentence, I believe they still used the entire *question* paragraph for training).
    5. They had fixed, known answer sets to train on. Plus (annotated) incorrect answers to train on.
    6. They whittled down their training and test data to a small set of QA pairs that *fit* their needs (no messy data) - "451 history answers and 595 literature answers that occur on average twelve times in the corpus".
    7. They could not handle multi-word named entities (so they manually pre-processed their corpus to convert these into single strings).
    The upside:

    1. Their use of dependency trees instead of bag o' words was nice. As a linguist, I want to see more sophisticated linguistic information used in NLP.
    2. They jointly learned answer and question representations in the same vector space rather than learning them separately because "most answers are themselves words (features) in other questions (e.g., a question on World War II might mention the Battle of the Bulge and vice versa). Thus, word vectors associated with such answers can be trained in the same vector space as question text enabling us to model relationships between answers instead of assuming incorrectly that all answers are independent."
    3. I found their error analysis in sections “5.2 Where the Attribute Space Helps Answer Questions” and 5.3 "Where all Models Struggle” especially thought provoking. More published research should include these kinds of sections.
    4. Footnote 7 is interesting: "We tried transforming Wikipedia sentences into quiz bowl sentences by replacing answer mentions with appropriate descriptors (e.g., \Joseph Heller" with \this author"), but the resulting sentences suffered from a variety of grammatical issues and did not help the final result." Yep, syntax. Find-and-replace not gonna cut it.

    Friday, August 1, 2014

    for linguists, by linguists

    The Speculative Grammarian is at it again, offering a happy hour discount on an already ridiculously inexpensive book of linguistic fun: The Speculative Grammarian Essential Guide to Linguistics.
    Speculative Grammarian is the premier scholarly journal featuring research in the oft neglected field of satirical linguistics—and it is now available in book form!

    a sidelong look at all that is humorous about the field. Containing over 150 articles, poems, cartoons, humorous ads and book announcements—plus a generous sprinkling of quotes, proverbs and other witticisms—the book discovers things to laugh about in most major subfields of Linguistics. It pokes good-natured fun at linguists (famous or otherwise), linguistic theory, and many aspects of language. The authors and editors are linguists who love their field, but who at the same time love to celebrate the funny aspects of Linguistics. The book invites readers to laugh along.