Contextual Thesaurus FAQ

Q: What is this?

A: This is a prototype Contextual Thesaurus developed by Microsoft Research. Actually, itfs quite a bit more than that: itfs an English-to-English machine translation system that employs the same architecture that the Microsoft Translator uses when translating different languages. To the best of our knowledge, this is the first large-scale paraphrasing system anywhere.

Q: What do you mean gContextual Thesaurush?

A: An ordinary thesaurus provides synonyms and near synonyms, usually only for single words, often without offering much information about when to use these terms. Try looking up the word gbreakh in a conventional thesaurus. Then look up gbusinesses are asking for tax breaksh in the Contextual Thesaurus. You will see the difference.

Q: How do I use it?

A: Type a short phrase into the input box. Then click the Submit button (the arrowhead in an orange circle) or hit the Enter key on your keyboard.  The system accepts only one sentence at a time.  Some suggestions:

· Limit your input to 4-8 words. The system is capable of generating paraphrases much longer than that, but results will usually be more varied and interesting if you type in fewer words rather than more. Even two or three words will sometimes be enough to retrieve a useful set of equivalents.

· Formal language works better than colloquial language. Because our training data consists mostly of documents in the business, government, or technology domains, the system performs better on input related to these domains than it does on song lyrics or first-person blog posts.

· Click one of the paraphrases to highlight the path through the graph taken by that sentence.

· If you click on a word in the graph, the top-ranked paraphrase containing that term will be highlighted.

· If you click the check mark beside a paraphrase, the text will be moved into the input box in order to be paraphrased. This way you can round trip your paraphrases to see more alternatives.

Q: When I type [favorite phrase] it doesnft show me [favorite paraphrase]. Why donft you have this obvious pair?

A: Our English-English translation model is learned from large amounts of text found on the web. The system may not find some perfectly good expressions that donft occur often enough in our data for them to surface. On the other hand, because we are using real data that reflects real usage, you probably wonft see too many out-of-date expressions of the kind that you would find in a conventional thesaurus.

Q: It makes a lot of grammatical errors.

A: Yes it does. The system has no knowledge of grammar, and the kinds of errors it produces are typical of machine translation systems. It doesnft do well on pronouns and function words, and tense and number often suffer badly. As we improve our algorithms, over time we expect grammatical quality to get better. In the meanwhile, non-native speakers of English might wish to use the system with caution.

Q: When I type in a long sentence, everything in the output seems pretty much similar. Why is this?

A: This is because of the way the algorithm selects what it thinks are the best options. Shorter phrases (4-8 words) generally produce results that are more varied.

Q: The first few suggestions seem OK, but there is a pile of real junk in there.

A: What you are seeing is the ranked output of the algorithm. Most translation systems donft show users what is happening under the hood. In general, the best suggestions will be found towards the top of the list. But there may still be gems to be found even among the lower ranked items.

Q: Ifve found an offensive result. Why does this happen? And who can I tell about it?

A: We do try to filter out the most obviously offensive terms. However, because much of our data has been scraped off the web, inappropriate material may occasionally slip through. In addition, the system can sometimes create inappropriate juxtapositions even when the input is innocuous. If you do find something inappropriate or offensive, please report it via the Feedback link, giving both the input and output so that we can address the issue.

Q: What is this good for?

A: We expect that the system will prove useful in many applications that need to recognize or generate semantically similar words and phrases. The following are just a few examples, in no special order: writing assistance, document simplification, document style adaptation, in-house style enforcement, grading of essays and short answers, language learning, plagiarism detection, steganography, document fingerprinting, summarizing and abstracting, question answering, conversational agents, interaction with game characters, search and information extraction and retrieval, search engine optimization, and command and control. (Contrary to rumor, we have not yet trained it to wash the dishes.)

Q:  Is there an API? 

A:  We are preparing to make this available as an API, using this page to collect thoughts and feedback.

Q: If I paste a really large block of text three or four times into the input box, it hangs my browser.

A: Donft do that.



© 2011 Microsoft Corporation. All rights reserved.