paul vidal - pragmatic big data nerd

Tag Archives

4 Articles

Science tweets sentiment analysis: using Stanford CoreNLP to train Google AutoML

by paul 0 Comments

I have been wanting to play with Google AutoML tools, so I decided to do a quick article on how to use the Stanford CoreNLP library to train Google Auto ML. I took a simple example of parsing tweets talking about science and passing them through the two libraries. Here is a summary of this work.

Note: All content of the work can be found on my GitHub, here:

Getting Data

Getting data is easy with Nifi. I created a flow that grabs from the Science Filter Endpoint as depicted below:

As you can see, I then use PutGCSObject and PutFile to put the tweet messages respectively on my computer and on a GCS bucket (for historical purposes).

Running CoreNLP with Python

To run core NLP you will need to:

The super simple code I used below creates two type of files for our AutoML data set:

  • A new file for each sentence in a tweet
  • A list csv file containing the name of file above as well as the sentiment value of the sentence
# Import Libraries
from stanfordcorenlp import StanfordCoreNLP
import json
import os
import re

nlp = StanfordCoreNLP(r'/Users/paulvidal/Documents/Nerd/NLP/stanford-corenlp-full-2018-10-05')

# Init variables
veryNegativeResults = 0
negativeResults = 0
neutralResults = 0
positiveResults = 0
veryPositiveResults = 0
invalidResults = 0

# Loop trough files and get the tweet

rootdir = '/Users/paulvidal/Documents/Nerd/NLP/tweets'
outputDirectory = '/Users/paulvidal/Documents/Nerd/NLP/results'
csvF = open(os.path.join(outputDirectory, 'list.csv'), "w+")
sentenceNumber = 1

for subdir, dirs, files in os.walk(rootdir):
for file in files:
f = open(os.path.join(subdir, file))
rawTweet =
for k in rawTweet.split("\n"):
sentence = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
# print sentence
# Sentiment
jsonSentiment = json.loads(nlp.annotate(sentence,
'annotators': 'sentiment',
'outputFormat': 'json'

currentSentence = ''
for s in jsonSentiment["sentences"]:
for y in s["tokens"]:
currentSentence = currentSentence + ' ' + y["originalText"]
if(s["sentiment"] == "Verynegative"):
veryNegativeResults = veryNegativeResults + 1
elif(s["sentiment"] == "Negative"):
negativeResults = negativeResults + 1
elif(s["sentiment"] == "Neutral"):
neutralResults = neutralResults + 1
elif(s["sentiment"] == "Positive"):
positiveResults = positiveResults + 1
elif(s["sentiment"] == "Verypositive"):
veryPositiveResults = veryPositiveResults + 1
invalidResults = invalidResults + 1
currentFilename = open(os.path.join(outputDirectory, 'analyzed-sentence-' + str(sentenceNumber) + '.txt'), "w+")

csvF.write('analyzed-sentence-' + str(sentenceNumber) + '.txt' + ',' + s["sentiment"] + '\n')
sentenceNumber = sentenceNumber + 1


print "Very Negative Results:", veryNegativeResults
print "Negative Results:", negativeResults
print "Neutral Results:", neutralResults
print "Positive Results:", positiveResults
print "Very Positive Results:", veryPositiveResults
print "Invalid Results:", invalidResults
# Do not forget to close! The backend server will consume a lot memory.

Running through about 1300 tweets, here are the results I get per sentence:

 Very Negative Results: 3
Negative Results: 1019
Neutral Results: 638
Positive Results: 110
Very Positive Results: 1
Invalid Results: 0

Using the data set to train Google AutoML

Step 0: Create a project and add AutoML Natural Language to it.

This should be fairly straight forward, a lot of documentation out there, so I’m not going to spend time on this. Refer to the google cloud tutorials, they are easy to follow.

Step 1: Upload the output of your python program to your AutoML bucket

Using the storage interface or the gsutils command line, upload all the files you created and the list.csv as depicted below:

Step 2: Add a data set to your AutoML

Create a data set using the list.csv you just uploaded:

Step 3: Train Model

The data set will upload and you should see your sentences labeled according to the coreNLP classification.

Note that you will have to most likely remove the label from verynegativs or verypositive labels as there might be less than 10 items in it.

You can now use the AutoML interface to train your model:

Pretty straight forward all considered. I haven’t run the training model yet as I need to think of compute costs, but this is pretty cool!

True cloud automation for large organizations: a case of leading by example.

by paul 0 Comments
True cloud automation for large organizations: a case of leading by example.
If you're reading this Ashish, this is for you


As a solution engineer, it is my mandate to advocate large IT organizations on how to best leverage the tools available in the market to empower their users to be autonomous. In the world of big data, this has many aspects from defining business processes to ensuring security and governance. The key aspect however is the ability to “shift to the left”, meaning that the end user is in control of the infrastructure necessary for his job.

Ultimately this is the unicorn every organization wants: a highly guarded infrastructure that feels like a plain of freedom to the end user.

Some organizations choose to use one vendor (e.g. cloud vendor) to implement this unicorn vision, but soon realize (or will soon) that the limiting yourself to one vendor not only restrict the plain field of end users it paradoxically make the implementation of consistent governance and security harder (because of vendor limitations and a tendency towards restricting cross platform compatibility)

The solution is what I advice my customer every day: building the backbone of an organization on open inter compatible standards that allow growth while maintaining consistency. In today’s world, it’s the hybrid cloud approach.

When I joined Hortonworks (now Cloudera) a few months ago I was impressed by the work done by a few Solution Engineers (Dan Chaffelson and Vadim Vaks). They had built an automation tool that would allow any Solution Engineer in the organization to spin their own demo cluster and load whatever demo they want called Whoville. I led a Pre-Sales organization before: this is the dream. Having consistency over demos for each member of the organization while empowering the herd of cats that Solution Engineers are is no easy feat. Not only that, we were eating our own dog food, leveraging our tools to instantiate what we had been preaching!

I’m still in awe of the intelligence and competency of the team I just joined. But, being as lazy as I am, I decided to try and build an even easier button for Whoville which would:
1. Empower the rest of our organization even more
2. Show true thought leadership to our customers

So with the help of my good friend Josiah Goodson and the support of the Solution Engineering team (allowing us to work on this during a hackathon for instabce) we built Cloudbreak Cuisine.

Introducing Cloudbreak Cuisine

Cuisine is fully featured application, running in containers, that allows, in its alpha version, Solution Engineers to:
1. Get access to Whoville library, deploy and delete demo clusters
2. Monitor clusters deployment via Cloudbreak API
3. Create your own combination of Cloudbreak Blueprints & Recipes (a.k.a. bundles) for your own demo
4. Push those bundles to Cloudbreak

Below is a high level architecture of Cuisine:

Cloudbreak Cuisine Architecture

The deployment of the tool is automated, but requires access to Whoville (restricted within our own organization). All details can be found here:

The couple of videos below showcase what Cuisine can do, in its alpha version.

Glossary: Cuisine Bundles are a combination of Cloudbreak Blueprints & Recipes.


Push a bundle from Whoville

Push Bundle

Delete a Bundle via Cloudbreak

Delete Bundle

Add Custom Recipe

Add Recipe

Create Custom Bundle

Create Bundle

Download/Push Bundle

Download/Push Bundle

Additional tips/tricks

Tips & Tricks

Parting thoughts

First and foremost: thank you. Thank you Dan and Vadim for Whoville, thank you Josiah for your continuous help and thank you Hortonworks/Cloudera to allow us to be demonstrating such thought leadership.

Secondly, if you’re a Cloudera Solution Engineer, test it and let me know what you think!

Finally, for every other reader out there: this is what an open platform can do to your organization. Truly allow you to leverage any piece of data or any infrastructure available to you; from EDGE to AI :)

A Hybrid approach to the Hybrid Cloud

by paul 0 Comments
A Hybrid approach to the Hybrid Cloud

Unless you have been hiding under a rock, or maybe spending too much time looking at the clouds passing by, the last couple of years have seen the advent of the adoption of the cloud as a major part of enterprise IT infrastructure. ¬†As with everything in IT infrastructure, trends are followed for and without good reasons. Like I’ve argued before, outsourcing your non business critical software to SaaS may make sense, while maintaining your core business on site seem to be a good approach. In this piece however, I’d like to address the adoption of the cloud as PaaS, what are the pitfalls of that type of approach, and how adopting cloud as IaaS could alleviate some of these pitfalls. Perhaps more importantly, I’d like to offer a nuanced approach that will hopefully avoid an all-or-nothing approach. In short, here is how I view it:

Note: As always, I am touching here on enterprise data strategies as the backbone of a business, and therefore talking about data platforms as a whole. I’m not talking about expert systems/system of records and where/how they should be implemented.

Going all in

Exposing the flaws of going all into one cloud is fairly straight forward. Cloud infrastructure is super attractive. Being able to spin up at will nodes and services is super attractive. It’s like a kid at an arcade choosing what to play next. Until you run out of quarters and want to take a worthy price home. Here are some clear limitations of going all in into one cloud, and using all its services:

  • The services you use lock you in. If you develop something on AWS for instance, using lambda or any other tool, you will have a hard time the day you want to move these applications. To some extent, all the work you have done to liberate your data from your system of records and drive a true data driven business could be rendered void by going all in with one cloud.
  • Cloud vendors are very good a getting your data in and storing it for cheap, as well as running ephemeral elastic workloads. However, running long lasting compute or getting data out can be extremely costly.
  • Maintaining internal process of governance & security are very limited in the cloud.
  • Not all clouds are equal across the globe. If you truly are a global business you must have the ability to chose the cloud vendor that is available in your region.

The hybrid approach: a great option for today

The response to these limitations comes in form of a hybrid cloud. It is the idea of having workloads running on components that can be deployed on demand on premises or in the cloud in the same manner. Frankly, this solves 99% of the problems IT is trying to solve:

  • The services you use are infrastructure agnostic, and therefore allow you to maintain control of your data.
  • You can leverage cloud vendors for ephemeral workloads and on site for long lasting ones.
  • Governance and security are shared across cloud/on-prem.
  • You get to leverage any cloud.

As always, the devil is in the details. The only true way to implement a hybrid cloud is to have the same architecture on prem and in the cloud. This means separating storage and compute, as opposed to having storage and compute coupled like it is traditionally setup on premises. Theoretically, considering the advances of networking, and the advances of container management, morphing traditional architectures to have compute and storage separated should be fine.

The hybrid-er approach: a path towards true agnosticity

Like I mentioned before, I am a firm proponent of the hybrid approach. Nevertheless, I can’t help but imagining a world in 5 to 10 years, where everyone has implemented their hybrid data platform backend and the hot new tech is a new platform that provides a very specific and essential set of capabilities (think complex AI workloads) only possible by coupling compute and storage. Traditional RDBMS weren’t fit for many types of work (e.g. large scale, etc.), that does not mean they completely disappeared. I think we are going to see the same thing with containerization. It will be essential for many cases, but for others, different resources managers may be more appropriate. Regardless, these are truly exciting times, and I am very excited to be in the midst of this transformation.

Should all your data move to the cloud?

by paul 0 Comments
Should all your data move to the cloud?

I recently on multiple occasions engaged into conversations about whether or not Fortune 500 organizations are ready to move all their data to the cloud. While I’m not arguing about the benefits of distributed systems, I did encounter a significant number of organizations that are not ready to move to a SaaS model. Despite the obvious security reasons, I think it is an maintaining control over the core of your business to drive innovation is crucial (see Telsa example). Furthermore, many organizations’ strategy seem to be going towards build IaaS/PaaS and eventually SaaS within their own IT. These tendencies lead me to believe the dichotomy between SaaS and traditional in-house implementation isn’t absolute. Therefore, the market will see the advent of solutions enabling control over internal data while leveraging SaaS functionalities.

Since I work for a company offering one of these solutions, I wrote a white paper about it, so here it. Enjoy the read!