Search for question
Question

You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles about one particular company. You goals are: 1. Identify what is this company name, by looking at the entity distributions across both tweets and news articles o While analyzing news articles, extract separate entities from titles and texts 2. Identify what other companies are most frequently mentioned along with your primary company o What companies are most frequently mentioned within the same document (tweet and news article) as your primary company. 3. Identify most frequent locations of events, by extracting appropriate named entities o Locations may include countries, states, cities, regions, etc. In order to complete this analysis: • Discard non-English results Apply appropriate text cleaning methods • Within your Jupyter notebook: ○ Show a table or chart with your top-20 companies (sorted in the descending order) o You are welcome to use separate tables for titles and texts of the news articles • Use a couple of different NER packages and options, (i.e. both NLTK and SpaCy, also with and without sentence segmentation). This way you can evaluate which model provided you the best results o Your top-20 list should only be based on your most accurate results from the best performing NER package Rules and requirements: • Your final output and the code should be contained within Jupyter Notebook (ipynb) NLP Assignment 5 Starter.ipynb/n

Fig: 1