paul vidal - pragmatic big data nerd

Category Archives

22 Articles

Computer Science needs the scientific method

by paul 0 Comments
Computer Science needs the scientific method

Software development is basically the wild-west

At the exception of a small portions of academics, Computer Science is generally equated to software development. When I talk about Software development I mean it in the broadest sense possible, from scripting unix cron jobs to developing a mobile application. Thus, computer science in the colloquial term is an adolescent engineering practice that operates in a free wild-west like open world.

We basically develop new technologies through trial and error, then either iterate or choose another starting point when we hit a wall.

A good example of that is the evolution of data storage and analysis. The relational model has been extremely successful until it didn’t scale enough so we changed the paradigm to be distributed.

This method of development has been successful for one good reason: the outcome of software is 0 or 1. Data scales or doesn’t. The software works or doesn’t. To put it in better terms, the complexity of the systems we are implementing and studying in Computer Science are simple enough to be finite and fully evaluated.

Open Source isn’t Open Science

This also explains the success of the open source model. Crowd sourcing is easy when you can easily say if something is correct or not.

However, Computer Science is different today. Machine Learning and Artificial Intelligence are changing the system. Predictability of results is now statistical instead of binary, and the algorithms used can be black boxes.

This means that depending on the assumptions you took while developing your software, its outcome can differ without you being able to know if you are achieving what you were after.

For instance, you could create a deep learning software that uses facial recognition, natural language processing and data mining to determine whether one is inclined to like orange juice. There is no easy way for you to know whether the prediction of your software are true. You can know whether your model is accurate, and optimize for it, but you don’t know its real world truth value.

This is scary. Especially when you’re not trying to predict how much one likes orange juice.

Today’s guard rails are inadequate

The AI/ML industry brandishes model drift evaluation, model explainability or even ethics to address this fundamental shift. However, while these methods are necessary they fail to address the lack of methodology needed inherent to the increased complexity of the software development realm.

Indeed, before developing any software we should follow a typical scientific process including publication of methodology and hypothesis prior to development, and have the software peer reviewed.

To get a head start on how to apply this methodology, the open science taxonomy proposed by the center for open science is a good start.


  • Center for Open Science:
  • Open Science on Wikipedia:

We need ethics of data as much as we need ethics of AI

by paul 0 Comments
We need ethics of data as much as we need ethics of AI

I am always humbled and grateful by the persons I get to work with. I often find myself amazed (in the real sense of the term) and inspired by the intellectual journey of the people I encounter in the world of data. The data industry is at the forefront of human advancement: self driving cars, social networks, rocket launches, cyber attack prevention, medicine advancements, genomics… all fueled by data.

As technology advances, society advances and with it our code of ethics or what is considered morally good. A great example of that is applying modern standards of ethics to a generation or two before us. I’m sure we all had uncomfortable conversations with grand-parents.

Recently, one of the most mind boggling problem of ethics surrounds machine learning and artificial intelligence. Whether it is responding to sci-fi skynet-like ideas of artificial consciousness, or more down to earth considerations such as yet another iteration of a trolley problem substituting the subject with a self driving car, you will see write-ups popping up everywhere (often by people that don’t know much about the field). Of course, this work is important, and if you want to read a good book about this, here is the reference: Ethics and Data Science, by DJ Patil, Hilary Mason, Mike Loukides.

You do not need data science to think about ethics

Here is the core of my argument: you do not need data science to think about ethics. Ethics should be part of every piece of software brought to the public, especially when dealing with data.

Growing up, I was inspirited by Google’s modo: “Don’t be evil”, which later changed to “Do the right thing” (see wikipedia). Unfortunately, a few years later, I feel a different code of conduct coming from certain software companies, especially those who are at the center of our lives like social media.

Facebook is an example that comes very easily to mind, with recent scandals such as the Cambridge Analytica scandal. For those who need more that a moral reason of why that type of scandal is bad, look at the recent news and the exec that are leaving Facebook. Not considering ethics is just bad for business. Read the Power Paradox to get a bit more context.

Not considering ethics is just bad for business.

Now, I’m not an idealist that thinks that Facebook will give such a great service to the world without trying to monetize it. So much so that I was brought up with the idea of: if you don’t want something to be known, you should not put it on the internet. Here is the problem though: as our society evolves, this motto will be impossible to follow. Your personal data will be available on the next generation of quantum ai driven global network, or whichever buzz wordy combination makes you think of the future of technology.

This is why it is important to consider ethics now.

For software companies, it means enabling your customer to evaluate your ethics. I think that it would be done following these 3 pillars:

  • Transparency: If a social network comes to me and says: look, we are going to use your data to target adds at you, but we will never leak it to organizations promoting political campaigns and you get to see exactly which data is given to whom, I’m in! This is true for social medias, but for cloud, Software as a Service as well. If I am a small business and I subscribe to a payment processing engine, I want to know how you are using my sales data.
  • Openness: Software companies should not only tell us who uses our data, you should tell us what type of algorithms you use to create your software (not necessarily opening the code, but at least an idea of the architecture)
  • Accountability: Finally, if a company makes a mistake, own it and communicate it to the public. This seems obvious to me but apparently not common practice

For users, we need to vote with our decisions. I want companies to consider ethics to be an equally important consideration as customer satisfaction for instance, though I believe that in the long run, both are closely related. This is why I stopped using certain social networks, and I spent my days fighting against bad practices in the world of large enterprises software.

I realize defining and observing a code of ethics is difficult and that it is a moving target. Nevertheless, I feel that the software industry engine, and its oil that is data does not have the appropriate ethics institutions in place. I’d like for this to change.

Science tweets sentiment analysis: using Stanford CoreNLP to train Google AutoML

by paul 0 Comments

I have been wanting to play with Google AutoML tools, so I decided to do a quick article on how to use the Stanford CoreNLP library to train Google Auto ML. I took a simple example of parsing tweets talking about science and passing them through the two libraries. Here is a summary of this work.

Note: All content of the work can be found on my GitHub, here:

Getting Data

Getting data is easy with Nifi. I created a flow that grabs from the Science Filter Endpoint as depicted below:

As you can see, I then use PutGCSObject and PutFile to put the tweet messages respectively on my computer and on a GCS bucket (for historical purposes).

Running CoreNLP with Python

To run core NLP you will need to:

The super simple code I used below creates two type of files for our AutoML data set:

  • A new file for each sentence in a tweet
  • A list csv file containing the name of file above as well as the sentiment value of the sentence
# Import Libraries
from stanfordcorenlp import StanfordCoreNLP
import json
import os
import re

nlp = StanfordCoreNLP(r'/Users/paulvidal/Documents/Nerd/NLP/stanford-corenlp-full-2018-10-05')

# Init variables
veryNegativeResults = 0
negativeResults = 0
neutralResults = 0
positiveResults = 0
veryPositiveResults = 0
invalidResults = 0

# Loop trough files and get the tweet

rootdir = '/Users/paulvidal/Documents/Nerd/NLP/tweets'
outputDirectory = '/Users/paulvidal/Documents/Nerd/NLP/results'
csvF = open(os.path.join(outputDirectory, 'list.csv'), "w+")
sentenceNumber = 1

for subdir, dirs, files in os.walk(rootdir):
for file in files:
f = open(os.path.join(subdir, file))
rawTweet =
for k in rawTweet.split("\n"):
sentence = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
# print sentence
# Sentiment
jsonSentiment = json.loads(nlp.annotate(sentence,
'annotators': 'sentiment',
'outputFormat': 'json'

currentSentence = ''
for s in jsonSentiment["sentences"]:
for y in s["tokens"]:
currentSentence = currentSentence + ' ' + y["originalText"]
if(s["sentiment"] == "Verynegative"):
veryNegativeResults = veryNegativeResults + 1
elif(s["sentiment"] == "Negative"):
negativeResults = negativeResults + 1
elif(s["sentiment"] == "Neutral"):
neutralResults = neutralResults + 1
elif(s["sentiment"] == "Positive"):
positiveResults = positiveResults + 1
elif(s["sentiment"] == "Verypositive"):
veryPositiveResults = veryPositiveResults + 1
invalidResults = invalidResults + 1
currentFilename = open(os.path.join(outputDirectory, 'analyzed-sentence-' + str(sentenceNumber) + '.txt'), "w+")

csvF.write('analyzed-sentence-' + str(sentenceNumber) + '.txt' + ',' + s["sentiment"] + '\n')
sentenceNumber = sentenceNumber + 1


print "Very Negative Results:", veryNegativeResults
print "Negative Results:", negativeResults
print "Neutral Results:", neutralResults
print "Positive Results:", positiveResults
print "Very Positive Results:", veryPositiveResults
print "Invalid Results:", invalidResults
# Do not forget to close! The backend server will consume a lot memory.

Running through about 1300 tweets, here are the results I get per sentence:

 Very Negative Results: 3
Negative Results: 1019
Neutral Results: 638
Positive Results: 110
Very Positive Results: 1
Invalid Results: 0

Using the data set to train Google AutoML

Step 0: Create a project and add AutoML Natural Language to it.

This should be fairly straight forward, a lot of documentation out there, so I’m not going to spend time on this. Refer to the google cloud tutorials, they are easy to follow.

Step 1: Upload the output of your python program to your AutoML bucket

Using the storage interface or the gsutils command line, upload all the files you created and the list.csv as depicted below:

Step 2: Add a data set to your AutoML

Create a data set using the list.csv you just uploaded:

Step 3: Train Model

The data set will upload and you should see your sentences labeled according to the coreNLP classification.

Note that you will have to most likely remove the label from verynegativs or verypositive labels as there might be less than 10 items in it.

You can now use the AutoML interface to train your model:

Pretty straight forward all considered. I haven’t run the training model yet as I need to think of compute costs, but this is pretty cool!

True cloud automation for large organizations: a case of leading by example.

by paul 0 Comments
True cloud automation for large organizations: a case of leading by example.
If you're reading this Ashish, this is for you


As a solution engineer, it is my mandate to advocate large IT organizations on how to best leverage the tools available in the market to empower their users to be autonomous. In the world of big data, this has many aspects from defining business processes to ensuring security and governance. The key aspect however is the ability to “shift to the left”, meaning that the end user is in control of the infrastructure necessary for his job.

Ultimately this is the unicorn every organization wants: a highly guarded infrastructure that feels like a plain of freedom to the end user.

Some organizations choose to use one vendor (e.g. cloud vendor) to implement this unicorn vision, but soon realize (or will soon) that the limiting yourself to one vendor not only restrict the plain field of end users it paradoxically make the implementation of consistent governance and security harder (because of vendor limitations and a tendency towards restricting cross platform compatibility)

The solution is what I advice my customer every day: building the backbone of an organization on open inter compatible standards that allow growth while maintaining consistency. In today’s world, it’s the hybrid cloud approach.

When I joined Hortonworks (now Cloudera) a few months ago I was impressed by the work done by a few Solution Engineers (Dan Chaffelson and Vadim Vaks). They had built an automation tool that would allow any Solution Engineer in the organization to spin their own demo cluster and load whatever demo they want called Whoville. I led a Pre-Sales organization before: this is the dream. Having consistency over demos for each member of the organization while empowering the herd of cats that Solution Engineers are is no easy feat. Not only that, we were eating our own dog food, leveraging our tools to instantiate what we had been preaching!

I’m still in awe of the intelligence and competency of the team I just joined. But, being as lazy as I am, I decided to try and build an even easier button for Whoville which would:
1. Empower the rest of our organization even more
2. Show true thought leadership to our customers

So with the help of my good friend Josiah Goodson and the support of the Solution Engineering team (allowing us to work on this during a hackathon for instabce) we built Cloudbreak Cuisine.

Introducing Cloudbreak Cuisine

Cuisine is fully featured application, running in containers, that allows, in its alpha version, Solution Engineers to:
1. Get access to Whoville library, deploy and delete demo clusters
2. Monitor clusters deployment via Cloudbreak API
3. Create your own combination of Cloudbreak Blueprints & Recipes (a.k.a. bundles) for your own demo
4. Push those bundles to Cloudbreak

Below is a high level architecture of Cuisine:

Cloudbreak Cuisine Architecture

The deployment of the tool is automated, but requires access to Whoville (restricted within our own organization). All details can be found here:

The couple of videos below showcase what Cuisine can do, in its alpha version.

Glossary: Cuisine Bundles are a combination of Cloudbreak Blueprints & Recipes.


Push a bundle from Whoville

Push Bundle

Delete a Bundle via Cloudbreak

Delete Bundle

Add Custom Recipe

Add Recipe

Create Custom Bundle

Create Bundle

Download/Push Bundle

Download/Push Bundle

Additional tips/tricks

Tips & Tricks

Parting thoughts

First and foremost: thank you. Thank you Dan and Vadim for Whoville, thank you Josiah for your continuous help and thank you Hortonworks/Cloudera to allow us to be demonstrating such thought leadership.

Secondly, if you’re a Cloudera Solution Engineer, test it and let me know what you think!

Finally, for every other reader out there: this is what an open platform can do to your organization. Truly allow you to leverage any piece of data or any infrastructure available to you; from EDGE to AI :)

Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

by paul 0 Comments
Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

My blood recently turned from green to blue (after the Hortonworks-Cloudera merger) and I couldn’t be more excited to play with new toys. What I am particularly excited about is Cloudera Data Science Workbench. But, like in everything I do, I am very lazy. So here is a quick tutorial to install Altus Director, and use it to deploy a CDH 5.15 + CDSW cluster.

Step 1: Install Altus Director

Many ways to do that, but the one I chose was the AWS install, detailed here:

The installation documentation is very well done, but here are the important excerpts

Create a VPC for your Altus instance

Follow the documentation.

Few important points:

  • In the name of laziness I also recommend to add a 0-65535 rule from your personal IP.
  • Your VPC should have an internet gateway associated with it (you could do it without, but would require you manually pulling the CM/CDH software down and make internal repositories within your subnet)
  • Do not forget to open all traffic to your security group as described here. Your deployment will not work otherwise.

Launch a Redhat 7.3 instance

You can either search communities AMIs, or use this one: ami-6871a115

Install Altus

Connect to your ec2 instance:

ssh -i your_file.pem ec2-user@your_instance_ip

Install JDK and wget

sudo yum install java-1.8.0-openjdk
sudo yum install wget

Install/Start Altus server and client:

cd /etc/yum.repos.d/
sudo wget ""
sudo yum install cloudera-director-server cloudera-director-client
sudo service cloudera-director-server start
sudo systemctl disable firewalld
sudo systemctl stop firewalld

Connect to Altus Director

Go to http://your_instance_ip:7189/ and connect with admin/admin

Step 2: Modify the Director configuration file

CDSW cluster configuration can be found here

Modify the configuration file to use:

  • Your AWS accessKeyId/secretAccessKey
  • Your AWS region
  • Your AWS subnetId (same as the one you created for your Director instance)
  • Your AWS securityGroupsIds (same as the one you created for your Director instance)
  • Your private key path (e.g. /home/ec2-user/field.pem)
  • Your AWS image (e.g. ami-6871a115)

Step 3: Launch the cluster via director client

Go to your EC2 instance where Director is installed, and load your modified configuration file as well as the appropriate key.

Finally, run the following:

cloudera-director bootstrap-remote your_configuration_file.conf \
--lp.remote.username=admin \

Step 4: Access Cloudera Manager

You can follow the bootstrapping of the cluster both on command line or in the Director interface; once done, you can connect to Cloudera Manager using: http://your_manager_instance_ip:7180/

Step 5: Configure CDSW domain with your IP

Cloudera Data Science Workbench uses DNS. The correct approach is to setup a wildcard DNS record is required, as described here.

However, for testing purposes I used The only parameter to change is the Cloudera Data Science Workbench Domain, from as the conf file sets it up to, to cdsw.[YOUR_AWS_PUBLIC_IP], as depicted below:

Restart the CDSW service, then you should be able to access CDSW by clicking on the CDSW Web UI link. Register for a new account and you will have access to CDSW:

A Hybrid approach to the Hybrid Cloud

by paul 0 Comments
A Hybrid approach to the Hybrid Cloud

Unless you have been hiding under a rock, or maybe spending too much time looking at the clouds passing by, the last couple of years have seen the advent of the adoption of the cloud as a major part of enterprise IT infrastructure.  As with everything in IT infrastructure, trends are followed for and without good reasons. Like I’ve argued before, outsourcing your non business critical software to SaaS may make sense, while maintaining your core business on site seem to be a good approach. In this piece however, I’d like to address the adoption of the cloud as PaaS, what are the pitfalls of that type of approach, and how adopting cloud as IaaS could alleviate some of these pitfalls. Perhaps more importantly, I’d like to offer a nuanced approach that will hopefully avoid an all-or-nothing approach. In short, here is how I view it:

Note: As always, I am touching here on enterprise data strategies as the backbone of a business, and therefore talking about data platforms as a whole. I’m not talking about expert systems/system of records and where/how they should be implemented.

Going all in

Exposing the flaws of going all into one cloud is fairly straight forward. Cloud infrastructure is super attractive. Being able to spin up at will nodes and services is super attractive. It’s like a kid at an arcade choosing what to play next. Until you run out of quarters and want to take a worthy price home. Here are some clear limitations of going all in into one cloud, and using all its services:

  • The services you use lock you in. If you develop something on AWS for instance, using lambda or any other tool, you will have a hard time the day you want to move these applications. To some extent, all the work you have done to liberate your data from your system of records and drive a true data driven business could be rendered void by going all in with one cloud.
  • Cloud vendors are very good a getting your data in and storing it for cheap, as well as running ephemeral elastic workloads. However, running long lasting compute or getting data out can be extremely costly.
  • Maintaining internal process of governance & security are very limited in the cloud.
  • Not all clouds are equal across the globe. If you truly are a global business you must have the ability to chose the cloud vendor that is available in your region.

The hybrid approach: a great option for today

The response to these limitations comes in form of a hybrid cloud. It is the idea of having workloads running on components that can be deployed on demand on premises or in the cloud in the same manner. Frankly, this solves 99% of the problems IT is trying to solve:

  • The services you use are infrastructure agnostic, and therefore allow you to maintain control of your data.
  • You can leverage cloud vendors for ephemeral workloads and on site for long lasting ones.
  • Governance and security are shared across cloud/on-prem.
  • You get to leverage any cloud.

As always, the devil is in the details. The only true way to implement a hybrid cloud is to have the same architecture on prem and in the cloud. This means separating storage and compute, as opposed to having storage and compute coupled like it is traditionally setup on premises. Theoretically, considering the advances of networking, and the advances of container management, morphing traditional architectures to have compute and storage separated should be fine.

The hybrid-er approach: a path towards true agnosticity

Like I mentioned before, I am a firm proponent of the hybrid approach. Nevertheless, I can’t help but imagining a world in 5 to 10 years, where everyone has implemented their hybrid data platform backend and the hot new tech is a new platform that provides a very specific and essential set of capabilities (think complex AI workloads) only possible by coupling compute and storage. Traditional RDBMS weren’t fit for many types of work (e.g. large scale, etc.), that does not mean they completely disappeared. I think we are going to see the same thing with containerization. It will be essential for many cases, but for others, different resources managers may be more appropriate. Regardless, these are truly exciting times, and I am very excited to be in the midst of this transformation.

Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

by paul 0 Comments
Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

Introduction & Context

There is a reason why I spent my life studying and working in computer science: understanding a computer’s psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.

Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:

  • Openness to Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism (or Emotional Stability)

This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn’t make it justice here (I most likely already irritated any personality psychologist that read these first few lines).

Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.

Solution Overview

The solution put in place uses 3 main elements:

  • A NiFi flow to orchestrate data ingestion from API, personality detection and storage to Hive
  • Hive to store the results of the personality detection
  • Zeppelin for visualization of the results

The figure below gives an overview of the solution flow:

More precisely, the solution can be dissected in 5 main steps, that I’m describing in details below:

  • Step 1: Retrieving data from New York Times API
  • Step 2: Scrape HTML article data
  • Step 3: Run machine learning models for personality detection
  • Step 4: Store results to Hive
  • Step 5: Create simple Zeppelin notebook

Step 1: Retrieving data from New York Times API

Obtaining an API Key

This step is very straight forward. Go to and sign-up for a key:

Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I’m not creative.

Configuring InvokeHTTP

The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:

  • HTTP Method: GET
  • Remote URL:“The New York Times”)&page=0&sort=newest&fl=web_url,snippet,headline,pub_date,document_type,news_desk,byline&api-key=[YOUR_KEY] (This URL selects article from the New York Times as a source, and only selects some of the fields I am interested in: web_url,snippet,headline,pub_date,document_type,news_desk,byline).
  • Content-Type: ${mime.type}
  • Run Schedule: 5 mins (could be set to a little more, I’m not sure the frequency at which new articles are published)

Extracting results from Invoke HTTP response

The API call parameter page=0 returns results 0-9; for this exercise, I’m only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:

A few important points here:

  • The destination is set to flowfile-attribute because we are going to re-use these attributes later in the flow
  • I’m expecting the API to change after some time this article is published. Just to make sure that the JSON paths are good for your version of the API, I recommend JSON paths evaluators online.

Massage data to avoid conflicts when inserting to Hive

This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ‘ from the snippet as you can see, but it would deserve a second path I think:

Step 2: Scrape HTML article data

Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I’m using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).

Create a simple Java class to call boilerplate

After downloading the boilerpipe jars (using, use your favorite Java IDE and create this simple class:

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class extractArticle {
public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException {
if(args.length == 1) {
URL url = new URL("" + args[0]);
String text = ArticleExtractor.INSTANCE.getText(url);
} else {
System.out.println("Please Specify URL");

Once tested, create an executable jar (in my case extractArticle.jar).

Transfer jars to nifi server

Connect to your nifi server with your nifi user and create the following directory structure:

$ cd /home/nifi
$ mkdir extractArticle
$ cd extractArticle
$ mkdir lib

Transfer the following libraries to ~/extractArticle/lib/ :

  • xerces-2.9.1.jar
  • nekohtml-1.9.13.jar
  • boilerpipe-sources-1.2.0.jar
  • boilerpipe-javadoc-1.2.0.jar
  • boilerpipe-demo-1.2.0.jar
  • boilerpipe-1.2.0.jar
  • extractArticle.jar

Create a simple Unix script to execute HTML scraping

Under ~/extractArticle/ create the script as follows:

$JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*

Configure ExecuteStreamCommand processor

Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Step 3: Run machine learning models for personality detection

Setup PersonalityRecognizer on NiFi server

Just as for boilerpipe, we’re going to run an ExecuteStream command. To prepare the files, run the following commands:

$ cd /home/nifi
$ wget recognizer-1.0.3.tar.gz
$ tar -xvf recognizer-1.0.3.tar.gz
$ cd PersonalityRecognizer
$ mkdir texts

Modify the file as follows:

# Configuration File of the Personality Recognizer
# All variables should be modified according to your
# directory structure
# Warning: for Windows paths, backslashes need to be
# doubled, e.g. c:\\Program Files\\Recognizer
# Root directory of the application
appDir = /home/nifi/PersonalityRecognizer
# Path to the LIWC dictionary file (LIWC.CAT)
liwcCatFile = ./lib/LIWC.CAT
# Path to the MRC Psycholinguistic Database file (mrc2.dct)
mrcPath = ./ext/mrc2.dct

Modify the script PersonalityRecognizer as follows:

#! /bin/bash -
# ----------------------------------
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*

Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:

text=`ls -t texts/ | head -1`
./PersonalityRecognizer -i ./texts/$text > tmp.txt
extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}"
echo $json_output
rm tmp.txt texts/*

Configure PutFile processor to create article file

This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:

Configure the ExecuteStreamCommand Processor

Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Extract attributes from JSON output

Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:

Step 4: Store results to Hive

Create Hive DB and tables

Because we don’t control wether we receive the same article twice from the New York Times API, we need to make sure that we don’t insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.

Therefore connect to your hive server and create one database and two tables as follows:

CREATE DATABASE personality_detection;
use personality_detection;
CREATE TABLE text_evaluation (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
clustered by (web_url) into 2 buckets stored as orc

CREATE TABLE all_updates (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

Create HiveQL script

Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.

Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I’m storing it as a date):

use personality_detection;

insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}');

merge into text_evaluation
using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url
when matched then update set
when not matched then insert
values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type,
all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience);

truncate table all_updates;

Processor Overview:

Upsert data to hive

Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):

Step 5: Create simple Zeppelin notebook

Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:

select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10

Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:


While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!

Known possible improvements

  • Better control of data retrieval to avoid duplicate flows (depends on API)
  • Better special character replacement for HiveQL command
  • More elegant way to execute data scraping and run personality recognition java classes
  • Additional scraping from article text to remove title, byline, and other unnecessary information from boilerpipe output
  • More thorough testing of different personality recognizer models (and use other/more recent libraries)


  • Big Five personality traits:
  • The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives:
  • IBM Watson Personality Insights:
  • Personality Recognizer by Francois Mairesse:
  • NYT API sign up:
  • NYT Article Search readme:
  • JSON Path Evaluator:
  • Boilerpipe jar download:
  • Boilerpipe github:

Making data analytics operational

by paul 0 Comments
Making data analytics operational
I refuse to use the term-that-should-not-be-used when describing stale data lakes.

After 6 months of silence, I finally take the time to get back behind my keyboard. I would like to say that I used these 6 months to reflect upon my writing, the current data market and came out of this hiatus a better, more informed and well versed person, but that would be a lie. And, despite the current pace at which the social fabric of our society is moving towards considering lies as acceptable and moral, I prefer not to. I don’t really know why I stopped writing for a bit, but most likely because I had nothing to say. So today, brace yourselves for a semi-informed opinion piece on data analytics, because I actually changed my opinion a bit on it through real-life experience.

My opinion then: analytics are a fringe use case of data management

In my article “Why data driven companies should stop investing in data analytics” I argued for the death of dashboards. I still stand by that point of view, as too often the Business Intelligence (BI) platforms are an end point of the data life cycle. Countless data replication processes, ETL, busses and other goldengate push data into data warehouses or data lakes where data scientists pat themselves on the back by showing dashboards that could potentially contain information to be integrated in the current business processes. Quick aside and nugget of knowledge from my PhD friends: if your title contains “science” in it, you’re not a real scientist. Shots fired. Moving on, while I still stand knee deep in stale data lakes despite being on my soapbox, there is one thing I did not consider enough: Machine Learning algorithms. There are two main reasons why the existence of machine learning algorithms as they are implemented now changes my opinion. First and foremost, the problem I describe of BI being the end of the data chain and its outcome only being driven by humans trying to improve business process can be alleviated with analytics automation via these algorithms (to some extent at the moment, but will be more and more true as the technology progresses). Secondly, ML needs access to data lakes, not operational big data. The algorithms need to be able to train using any data sets, looking at data from any angle in order to make usable predictions.

My opinion now: analytics need to be better integrated in the data life cycle

Consequently, here my proposal to the data world. We need to envision an architecture where data warehouses are not the raiders of the lost ark type but more the amazon type: they need to be an inherent part of the data life cycle. Drilling a bit further in the architecture I contemplate, your data as a service layer would feed current data sets to your data warehouse, where ML would run asynchronously, but the outcome of these analytics would then feed back the rules of data manipulation embedded in your DaaS layer. If you manage a constant feedback loop of the kind, your end user application served by your DaaS will constantly get fed more accurate and relevant data, which in turn can enable the next generation of platforms: Information as a Service. But that’s for another day.

Why data driven companies should stop investing in data analytics

by paul 0 Comments
Why data driven companies should stop investing in data analytics

Despite the blatantly click-baitish and oxymoronic nature of the title of this post might lead you to believe, I actually have a point that I think is worth expressing (which I suppose is the reason I write any post, really). As it is often the case, this point is coming from an accumulation of real-life situations I encounter at my job. A lot of my work consist in rapidly enabling access to data to which companies have a really hard time accessing and distributing it in a very efficient big data architecture. The funny thing is, when I go to a customer and tell them that they will get access to all this data within a few days timeframe, they are either unprepared as to what to do with this data. Indeed, they (and we as an industry) have been focusing so much on the integration and exposure of data that we forgot to think about the value it can bring us. What’s the use case everyone think of when their data access is enabled? Analytics. “We’ll use [Name your favorite BI tool] to build dashboards and gain insight. This, to me, is incomplete and extremely short-sighted.

How analytics became synonym with Big Data

Before I dive into the reasons I suddenly want to make a good chunk of the big data industry hate me, let me try to express how we got to equate the term big data, or data in general with Business Intelligence or analytics. Note that I am trying to describe trends here; I am aware that there are outliers to these trends. With this said, and now that I have a license to express completely unverified facts since I apologized in advance, here is what I observed. When databases were first created, they were an enablement layer for software. We just needed a way to store data more efficiently in order for you application to run faster. Eventually, data changed from being a necessary evil to being a source of value. Indeed, once we realized that every activity that we engage in involves a piece of software, the data piece entrenched in each and every of these pieces of software became the best way to understand our own (and our users’) behavior. This is the advent of data warehouses, and BI tools. Then we realized there was a lot of data (I think the technical term is a shit ton), so we started developing big data lakes. In this transition, big data primary use became what data warehouses were used for: data analytics.

The death of dashboards

However, data analytics is only a very small percentage of the data use cases. Remember, data layers were design to enable applications, not showing you what they contain. Yes, dashboards and graphics are pretty but what is their goal? Their goal is to give you an idea on what whoever interacts with your software is doing in order to design solutions to palliate to the problems you find. Somehow however, analytics became a finality. Companies spend an insane amount of investment to achieve data analytics. This is extremely misguided. To be fair, part of the reason why analytics are a finality is a product of the limitation of data lakes, but that’s another topic.

Using data pro-actively

Solving the data integration, consolidation, distribution and exposure problem is not easy but it is being solved (I can tell that with confidence since I am on the front of that battle every day, though I would not say I put my life in danger every day, so that battle analogy stops here). My advice is to think beyond analyzing the data as a use case once you are able to have access to it. Instead of trying to identify trends, think about how to change them. Instead of trying to build an individualized snapshot of your customer, think about what action you should take based on that snapshot. Instead of getting a consolidated view of all of your systems, think about how to better orchestrate data flows between these systems to minimize the need for consolidation. I am purposely not listing specific examples, because they require a deep industry expertise which is not what I am trying to highlight here (my expertise being in data, not a specific industry). So, next time you are confronted with someone building a dashboard, ask yourself: why? why am building BI on top of my data? is BI going to give me insight on a problem to solve? if so, what is that problem? Once you have the answer to that question, try to build a platform that identifies and solves that problem rather than a platform that only allows you to identify it.

The Big Data market in 2017

by paul 0 Comments
The Big Data market in 2017
Buildings always look badass from the ground and with a black and white filter

Accepting reality is not trivial. As we sit in our echo-chambers, particularly exacerbated by our social networks, preference algorithms and suggested searches, our cognitive biases betray our picture of the world around us. Add that to the fact that everyone else than us is an idiot with whom we should not engage in a conversation (or if you prefer, call this mild social anxiety), and pretty soon you can convince yourself of anything. With this in mind, I decided to take the time today to expose my understanding of the reality of the big data market in 2017, for large entreprises. While it is inherently biased, arguably like any piece of writing, I did try to do my research reading a lot of white papers recently (of which you can find some reference below), but mostly, this is my domain of expertise which means I’m confronted to it every day. Of course, this categorization is subject to discussion and constructive criticism that I always welcome.

Out, or very little relevancy

  • Data Lakes: The fascination for unlimited data distribution has passed. Enterprises struggle to find a use to their data lakes and the layers written on top of them to make them useful seem too much effort for little reward.
  • Pure data analytics: The terms Data Analytics or Business Intelligence encapsulate a vast number of concepts that will always be useful one way or another in our data driven world. What is nowadays losing momentum is solutions making analytics the end goal (analyzing trends, population sub group preferences, etc.). BI is a very small portion of what Big Data offers and if the end goal of a solution is to give you trend analytics, it is too reductive.
  • SOA, ESBs, Convergent applications: This is been dead for a while but worth mentioning. The idea of a single convergent enterprise solution to encapsulate all data and functionalities is practically not feasible (too much market change, too much cost, too much complexity, too little agility).

Extremely relevant for 2017

  • Agile data platforms: At the opposite end of ESBs and massive consolidation into one data system is the micro-services architecture. The architecture enables extremely rapid and agile deployment of applications to respond to an ever changing market where end customers have more choices than ever and thus are very hard to retain. The bottleneck of micro-services architectures is often data. Being able to consolidate, cleanse and expose rapidly data to anywhere is a complicated proposition but some platforms can do it. If a platform is able to integrate from multiple sources, consolidate and expose data rapidly, then it enables today’s hottest use cases: digital transformation, agile test data management, micro-services implementation, and more.
  • Cloud enablement: More than ever, and despite previous reticense vis-a-vis security (just like older generations are still reluctant to use their credit card numbers on the web), the movement to cloud applications and platforms. Enabling the cloud, that is not only exposing/migrating data to cloud applications but also ensuring security, compliance and control over the data exposed is therefore a very important market trend.
  • Data Personalization: we live in a world where everyone expects their experience to be catered to them. Having to repeat your identity while being tossed from department to department on a help desk line is one of the most infuriating experience (after the complete loss of human rights and dignity one experiences in an airport). Seriously though, enabling the understanding of the individual is crucial, whether that individual is a person, a product or a machine in IoT use cases.

Not quite there yet

  • Predictive Individual Analytics: We already see some implementations of this in ad personalization or preference settings, but being able to predict what an entity (a person, a machine, a car, a product) will do, what it wants and needs is going to open the door to systems that give answer instead of respond to questions. It requires the problem of data personalization to be solved beforehand though.
  • Smart Data Discovery: Once agile data platforms are in place, the use-case of automated data mining will explode. Too many systems with too little experts on the systems will give birth to solution that’ll enable the enterprise to recover a fair percentage of the relevant data without human intervention.
  • Expert AI systems: Finally, and most exciting of all are expert AI systems. These are software that will replace the way that data is currently fed to our everyday software (CRM, machine monitoring, marketing analytics, etc.). The use cases are still not clear in my head but I know that finding the point where human intervention is the most costly (where it requires pattern recognition), and replacing it with automated AI will be a game changer.

Some references

  • Is the Cloud Secure? (Gartner): link
  • Marketing data management (Ascend2): link
  • Seizing the Digital Advantage in Banking and Financial Services (Cognizant): link
  • The Big data workbook (Informatica): link
  • Agile Test Data Management: The New Must-Have (Forrester): link