pip install q gensim import gensim downloader problem 1 word embedding
Search for question
Question
!pip install -q gensim
import gensim.downloader
Problem 1: Word embeddings
(15 pts)
As discussed in Section 6.10 in Jurafsky & Martin, word embeddings can be used to solve
proportional analogies ("W is to X as Y is to. _") which shows that they contain some kind of
relational knowledge.
To solve analogy using the parallelogram method, we add the word vectors for W and Y and
subtract the word vector for X. Whatever word has a vector that's closest to the result is the
solution. To take the example in the book, for "apple is to tree as grape is to ____". Compute the
vector:
answer = wv["grape"] + wv["tree"] - wv["apple"]
and then find the word whose vector is closest to this (by cosine distance):
wv.similar_by_vector(answer)
If the closest words are any of grape, tree, or apple, then we skip those and take the next closest
word.
To get started, download a set of word vectors:
print (gensim.downloader.info("glove-wiki-gigaword-50") ['description'])
gensim.downloader.load("glove-wiki-gigaword-50")
WV =
1. Show that vine is the solution to the analogy "apple is to tree as grape is to.
# <your code here>
2. Find another analogy that can be correctly solved using these word vectors:
# <your code here>
< your text here >
3. We can also use this method to find some kinds of antonyms. Show that unskilled is the
solution to the analogy "Good is to skilled as bad is to _____"
# < your code here > <your text here >
4. Find another pair of antonyms using word vectors:
# <your code here>
< your text here >
5. There are some kinds of analogies that the model has trouble with. Show the results for the
analogies "Tennis is to Navratilova as baseball is to "and "Ruth is to baseball as tennis is to".
Why do you think the model comes up with these answers?
# <your code here>
< your text here >
6. Find another analogy that the model is not able to solve, and say what you think the problem
with it is:
# <your code here>
<your text here >
7. In other cases, the model "solves" analogies in ways that reflect social biases but not
necessarily objective reality. What is the solution (according to the model) for the analogy "Man
is to chef as woman is to "?
# < your code here >
< your text here >
8. Find another analogy that reflects possibly unwanted social bias, and say why you think the
model does what it does.
# < your code here >
<your text here >
Problem 2: Sentiment analysis
(10 pts)
In section 4.4, Jurafsky and Martin describe a simple method for improving sentiment analysis
by taking negation into account: "A very simple baseline that is commonly used in sentiment
analysis to deal with negation is the following: during text normalization, prepend the prefix
NOT to every word after a token of logical negation (n't, not, no, never) until the next
punctuation mark. Thus the phrase didn't like this movie, but I becomes didn't
NOT_like NOT_this NOT_movie, but I. Newly formed 'words' like NOT like or NOT recommend will probably occur more often in negative documents and act as cues for
negative sentiment, while words like NOT_bored or NOT_dismiss will acquire positive
associations." (p. 65)
For this problem, write a function negify(text) that implements this trick: it should take a
text string as input and return the text with added NOT_s as output:
>>> negify("didn't like this movie, but I")
"didn't NOT_like NOT_this NOT_movie, but I"
You may use either pyfoma or Python's regular expression tools. You may also find it helpful to
use the split() function (which converts a string to a list of strings by cutting it at spaces) and
the join() function (which converts a list of strings into a single string by concatenating them
with a space as a separator).
Problem 3: Authorship identification
(25 pts)
Authorship identification is the problem of determining the writer or speaker of a given text
based on the analysis of its linguistic patterns, stylistic features, and content characteristics. It is
a field of study that combines computational linguistics, natural language processing, and
forensic linguistics.
One simple technique for authorship identification uses unigram frequencies of unusual words
as a marker for the writing style of individual authors. For authors A and B, find:
where r
n
Pr... VA ПPr₁VA × × PrVA
n
1
n
P(r₁,...r„VB)=[[ P (r₁ V B) × ... × P (rË VB)
.. are the words that occur only once each in the disputed text. If
Pr₁,...rVA > r₁,...r VB) then guess A wrote the text. Otherwise, guess B.
n
n
For this problem, write a function that can distinguish texts written by Jane Austen from texts by
G.K. Chesterton. Follow these steps:
1.
Construct unigram language models for the two authors using the texts in austen.txt
and chesterton.txt
2. Write a function called austen_or_chesterton (text) that takes a text (as a list of
words) and returns either "austen" or "chesterton".
3. Evaluate your authorship identifier on the texts samples in disputed.csv using:
correct = 0
total = 0
for item in csv. DictReader(open('disputed.csv')):
guess = austen_or_chesterton (item['text'].split())
if guess ==
item['author']: correct += 1
total += 1
print(f'{correct / total * 100:.2f}% correct guesses')
1.
2.
How well does your system work? How does itss accuracy compare to a baseline of
always guessing "austen"?
What are some potential weaknesses of this unigram-based method and how could you
improve on it?
<your code here>
< your text here >