you have been provided with a starter notebook that reads a collection
Search for question
Question
You have been provided with a starter notebook that reads a collection of tweets and a collection of news articles about one particular company. You
goals are:
1. Identify what is this company name, by looking at the entity distributions across both tweets and news articles
o While analyzing news articles, extract separate entities from titles and texts
2. Identify what other companies are most frequently mentioned along with your primary company
o What companies are most frequently mentioned within the same document (tweet and news article) as your primary company.
3. Identify most frequent locations of events, by extracting appropriate named entities
o Locations may include countries, states, cities, regions, etc.
In order to complete this analysis:
• Discard non-English results
Apply appropriate text cleaning methods
• Within your Jupyter notebook:
○ Show a table or chart with your top-20 companies (sorted in the descending order)
o You are welcome to use separate tables for titles and texts of the news articles
• Use a couple of different NER packages and options, (i.e. both NLTK and SpaCy, also with and without sentence segmentation). This way you can
evaluate which model provided you the best results
o Your top-20 list should only be based on your most accurate results from the best performing NER package
Rules and requirements:
• Your final output and the code should be contained within Jupyter Notebook (ipynb)
NLP Assignment 5 Starter.ipynb/n