Search for question
Question

!pip install -q gensim import gensim.downloader Problem 1: Word embeddings (15 pts) As discussed in Section 6.10 in Jurafsky & Martin, word embeddings can be used to solve proportional analogies ("W is to X as Y is to. _") which shows that they contain some kind of relational knowledge. To solve analogy using the parallelogram method, we add the word vectors for W and Y and subtract the word vector for X. Whatever word has a vector that's closest to the result is the solution. To take the example in the book, for "apple is to tree as grape is to ____". Compute the vector: answer = wv["grape"] + wv["tree"] - wv["apple"] and then find the word whose vector is closest to this (by cosine distance): wv.similar_by_vector(answer) If the closest words are any of grape, tree, or apple, then we skip those and take the next closest word. To get started, download a set of word vectors: print (gensim.downloader.info("glove-wiki-gigaword-50") ['description']) gensim.downloader.load("glove-wiki-gigaword-50") WV = 1. Show that vine is the solution to the analogy "apple is to tree as grape is to. # <your code here> 2. Find another analogy that can be correctly solved using these word vectors: # <your code here> < your text here > 3. We can also use this method to find some kinds of antonyms. Show that unskilled is the solution to the analogy "Good is to skilled as bad is to _____" # < your code here > <your text here > 4. Find another pair of antonyms using word vectors: # <your code here> < your text here > 5. There are some kinds of analogies that the model has trouble with. Show the results for the analogies "Tennis is to Navratilova as baseball is to "and "Ruth is to baseball as tennis is to". Why do you think the model comes up with these answers? # <your code here> < your text here > 6. Find another analogy that the model is not able to solve, and say what you think the problem with it is: # <your code here> <your text here > 7. In other cases, the model "solves" analogies in ways that reflect social biases but not necessarily objective reality. What is the solution (according to the model) for the analogy "Man is to chef as woman is to "? # < your code here > < your text here > 8. Find another analogy that reflects possibly unwanted social bias, and say why you think the model does what it does. # < your code here > <your text here > Problem 2: Sentiment analysis (10 pts) In section 4.4, Jurafsky and Martin describe a simple method for improving sentiment analysis by taking negation into account: "A very simple baseline that is commonly used in sentiment analysis to deal with negation is the following: during text normalization, prepend the prefix NOT to every word after a token of logical negation (n't, not, no, never) until the next punctuation mark. Thus the phrase didn't like this movie, but I becomes didn't NOT_like NOT_this NOT_movie, but I. Newly formed 'words' like NOT like or NOT recommend will probably occur more often in negative documents and act as cues for negative sentiment, while words like NOT_bored or NOT_dismiss will acquire positive associations." (p. 65) For this problem, write a function negify(text) that implements this trick: it should take a text string as input and return the text with added NOT_s as output: >>> negify("didn't like this movie, but I") "didn't NOT_like NOT_this NOT_movie, but I" You may use either pyfoma or Python's regular expression tools. You may also find it helpful to use the split() function (which converts a string to a list of strings by cutting it at spaces) and the join() function (which converts a list of strings into a single string by concatenating them with a space as a separator). Problem 3: Authorship identification (25 pts) Authorship identification is the problem of determining the writer or speaker of a given text based on the analysis of its linguistic patterns, stylistic features, and content characteristics. It is a field of study that combines computational linguistics, natural language processing, and forensic linguistics. One simple technique for authorship identification uses unigram frequencies of unusual words as a marker for the writing style of individual authors. For authors A and B, find: where r n Pr... VA ПPr₁VA × × PrVA n 1 n P(r₁,...r„VB)=[[ P (r₁ V B) × ... × P (rË VB) .. are the words that occur only once each in the disputed text. If Pr₁,...rVA > r₁,...r VB) then guess A wrote the text. Otherwise, guess B. n n For this problem, write a function that can distinguish texts written by Jane Austen from texts by G.K. Chesterton. Follow these steps: 1. Construct unigram language models for the two authors using the texts in austen.txt and chesterton.txt 2. Write a function called austen_or_chesterton (text) that takes a text (as a list of words) and returns either "austen" or "chesterton". 3. Evaluate your authorship identifier on the texts samples in disputed.csv using: correct = 0 total = 0 for item in csv. DictReader(open('disputed.csv')): guess = austen_or_chesterton (item['text'].split()) if guess == item['author']: correct += 1 total += 1 print(f'{correct / total * 100:.2f}% correct guesses') 1. 2. How well does your system work? How does itss accuracy compare to a baseline of always guessing "austen"? What are some potential weaknesses of this unigram-based method and how could you improve on it? <your code here> < your text here >