Science tweets sentiment analysis: using Stanford CoreNLP to train Google AutoML
I have been wanting to play with Google AutoML tools, so I decided to do a quick article on how to use the Stanford CoreNLP library to train Google Auto ML. I took a simple example of parsing tweets talking about science and passing them through the two libraries. Here is a summary of this work.
Note: All content of the work can be found on my GitHub, here: https://github.com/paulvid/corenlp-to-auto-ml
Getting data is easy with Nifi. I created a flow that grabs from the Science Filter Endpoint as depicted below:
As you can see, I then use PutGCSObject and PutFile to put the tweet messages respectively on my computer and on a GCS bucket (for historical purposes).
Running CoreNLP with Python
To run core NLP you will need to:
- Install the python library:
pip install stanfordcorenlp
- Download the latest NLP library: https://stanfordnlp.github.io/CoreNLP/download.html
- Run your program
The super simple code I used below creates two type of files for our AutoML data set:
- A new file for each sentence in a tweet
- A list csv file containing the name of file above as well as the sentiment value of the sentence
# Import Libraries
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r'/Users/paulvidal/Documents/Nerd/NLP/stanford-corenlp-full-2018-10-05')
# Init variables
veryNegativeResults = 0
negativeResults = 0
neutralResults = 0
positiveResults = 0
veryPositiveResults = 0
invalidResults = 0
# Loop trough files and get the tweet
rootdir = '/Users/paulvidal/Documents/Nerd/NLP/tweets'
outputDirectory = '/Users/paulvidal/Documents/Nerd/NLP/results'
csvF = open(os.path.join(outputDirectory, 'list.csv'), "w+")
sentenceNumber = 1
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f = open(os.path.join(subdir, file))
rawTweet = f.read()
for k in rawTweet.split("\n"):
sentence = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
# print sentence
jsonSentiment = json.loads(nlp.annotate(sentence,
currentSentence = ''
for s in jsonSentiment["sentences"]:
for y in s["tokens"]:
currentSentence = currentSentence + ' ' + y["originalText"]
if(s["sentiment"] == "Verynegative"):
veryNegativeResults = veryNegativeResults + 1
elif(s["sentiment"] == "Negative"):
negativeResults = negativeResults + 1
elif(s["sentiment"] == "Neutral"):
neutralResults = neutralResults + 1
elif(s["sentiment"] == "Positive"):
positiveResults = positiveResults + 1
elif(s["sentiment"] == "Verypositive"):
veryPositiveResults = veryPositiveResults + 1
invalidResults = invalidResults + 1
currentFilename = open(os.path.join(outputDirectory, 'analyzed-sentence-' + str(sentenceNumber) + '.txt'), "w+")
csvF.write('analyzed-sentence-' + str(sentenceNumber) + '.txt' + ',' + s["sentiment"] + '\n')
sentenceNumber = sentenceNumber + 1
print "Very Negative Results:", veryNegativeResults
print "Negative Results:", negativeResults
print "Neutral Results:", neutralResults
print "Positive Results:", positiveResults
print "Very Positive Results:", veryPositiveResults
print "Invalid Results:", invalidResults
# Do not forget to close! The backend server will consume a lot memory.
Running through about 1300 tweets, here are the results I get per sentence:
Very Negative Results: 3
Negative Results: 1019
Neutral Results: 638
Positive Results: 110
Very Positive Results: 1
Invalid Results: 0
Using the data set to train Google AutoML
Step 0: Create a project and add AutoML Natural Language to it.
This should be fairly straight forward, a lot of documentation out there, so I’m not going to spend time on this. Refer to the google cloud tutorials, they are easy to follow.
Step 1: Upload the output of your python program to your AutoML bucket
Using the storage interface or the gsutils command line, upload all the files you created and the list.csv as depicted below:
Step 2: Add a data set to your AutoML
Create a data set using the list.csv you just uploaded:
Step 3: Train Model
The data set will upload and you should see your sentences labeled according to the coreNLP classification.
Note that you will have to most likely remove the label from verynegativs or verypositive labels as there might be less than 10 items in it.
You can now use the AutoML interface to train your model:
Pretty straight forward all considered. I haven’t run the training model yet as I need to think of compute costs, but this is pretty cool!