paul vidal - thoughts about tech and more


Why I chose to go meta.

by paul 0 Comments
Why I chose to go meta.

I’ve always been one to encourage people to embrace complexity, especially when it comes to tech. Being in technical sales/pre-sales/solution engineering for the past few years of my life, you will often go to my favorite answer when asked a yes or no question: “it depends”.

The fact is that there is always a choice in tech, there are always elements you can’t control or predict, and there is always a level deeper than you can dive into. The result is that decision makers have a hard time … making a decision.

My approach to help these decision makers (often my direct customers), is and always has been transparency and enablement. If you don’t believe me, check out the couple of articles I wrote over the years in the Cloudera community these past few years, my GitHub, or this very blog.

Bottom line: I have always loved enabling people to explain hard concepts.

Now, I have been living in a very complex and technical ecosystem my whole career essentially, and showcasing this stuff is … cumbersome. There is a reason why finding a good SE is very hard.

Recently however, while working on a side project (yes, it involves Magic: The Gathering), I came across something so elegant and useful for this type of enablement, it really picked my interest.

My first instinct was to try it out. It delivered.

From there, I got a chance to enable the founders of this tool to build their enterprise sales organization. And this is exactly what I intend to do.

So to summarize: I’m starting a position to enable sales for a company that enables software users to enable their audience. It’s pretty straightforward.

Seriously, here is the TL;DR:

  • I’m starting a new position at Reprise as VP of Enterprise Sales
  • Reprise has an awesome product, that is ahead of the market that enables you to record your application and distribute it in a matter of minutes without cost.
  • This means that the tool drives cost reduction, faster enablement and pipeline generation.
  • I’m particularly excited about this new challenge, because I think it will really help our industry
  • Yes, this is a sales role and I’m coming from the tech side. So you can be sure I can see value for the techies.
  • Yes, I will continue to code stuff on the side.
  • Also yes, you can expect tutorials of my favorite tools on a regular basis.

Computer Science needs the scientific method

by paul 0 Comments
Computer Science needs the scientific method

Software development is basically the wild-west

At the exception of a small portions of academics, Computer Science is generally equated to software development. When I talk about Software development I mean it in the broadest sense possible, from scripting unix cron jobs to developing a mobile application. Thus, computer science in the colloquial term is an adolescent engineering practice that operates in a free wild-west like open world.

We basically develop new technologies through trial and error, then either iterate or choose another starting point when we hit a wall.

A good example of that is the evolution of data storage and analysis. The relational model has been extremely successful until it didn’t scale enough so we changed the paradigm to be distributed.

This method of development has been successful for one good reason: the outcome of software is 0 or 1. Data scales or doesn’t. The software works or doesn’t. To put it in better terms, the complexity of the systems we are implementing and studying in Computer Science are simple enough to be finite and fully evaluated.

Open Source isn’t Open Science

This also explains the success of the open source model. Crowd sourcing is easy when you can easily say if something is correct or not.

However, Computer Science is different today. Machine Learning and Artificial Intelligence are changing the system. Predictability of results is now statistical instead of binary, and the algorithms used can be black boxes.

This means that depending on the assumptions you took while developing your software, its outcome can differ without you being able to know if you are achieving what you were after.

For instance, you could create a deep learning software that uses facial recognition, natural language processing and data mining to determine whether one is inclined to like orange juice. There is no easy way for you to know whether the prediction of your software are true. You can know whether your model is accurate, and optimize for it, but you don’t know its real world truth value.

This is scary. Especially when you’re not trying to predict how much one likes orange juice.

Today’s guard rails are inadequate

The AI/ML industry brandishes model drift evaluation, model explainability or even ethics to address this fundamental shift. However, while these methods are necessary they fail to address the lack of methodology needed inherent to the increased complexity of the software development realm.

Indeed, before developing any software we should follow a typical scientific process including publication of methodology and hypothesis prior to development, and have the software peer reviewed.

To get a head start on how to apply this methodology, the open science taxonomy proposed by the center for open science is a good start.


  • Center for Open Science:
  • Open Science on Wikipedia:

We need ethics of data as much as we need ethics of AI

by paul 0 Comments
We need ethics of data as much as we need ethics of AI

I am always humbled and grateful by the persons I get to work with. I often find myself amazed (in the real sense of the term) and inspired by the intellectual journey of the people I encounter in the world of data. The data industry is at the forefront of human advancement: self driving cars, social networks, rocket launches, cyber attack prevention, medicine advancements, genomics… all fueled by data.

As technology advances, society advances and with it our code of ethics or what is considered morally good. A great example of that is applying modern standards of ethics to a generation or two before us. I’m sure we all had uncomfortable conversations with grand-parents.

Recently, one of the most mind boggling problem of ethics surrounds machine learning and artificial intelligence. Whether it is responding to sci-fi skynet-like ideas of artificial consciousness, or more down to earth considerations such as yet another iteration of a trolley problem substituting the subject with a self driving car, you will see write-ups popping up everywhere (often by people that don’t know much about the field). Of course, this work is important, and if you want to read a good book about this, here is the reference: Ethics and Data Science, by DJ Patil, Hilary Mason, Mike Loukides.

You do not need data science to think about ethics

Here is the core of my argument: you do not need data science to think about ethics. Ethics should be part of every piece of software brought to the public, especially when dealing with data.

Growing up, I was inspirited by Google’s modo: “Don’t be evil”, which later changed to “Do the right thing” (see wikipedia). Unfortunately, a few years later, I feel a different code of conduct coming from certain software companies, especially those who are at the center of our lives like social media.

Facebook is an example that comes very easily to mind, with recent scandals such as the Cambridge Analytica scandal. For those who need more that a moral reason of why that type of scandal is bad, look at the recent news and the exec that are leaving Facebook. Not considering ethics is just bad for business. Read the Power Paradox to get a bit more context.

Not considering ethics is just bad for business.

Now, I’m not an idealist that thinks that Facebook will give such a great service to the world without trying to monetize it. So much so that I was brought up with the idea of: if you don’t want something to be known, you should not put it on the internet. Here is the problem though: as our society evolves, this motto will be impossible to follow. Your personal data will be available on the next generation of quantum ai driven global network, or whichever buzz wordy combination makes you think of the future of technology.

This is why it is important to consider ethics now.

For software companies, it means enabling your customer to evaluate your ethics. I think that it would be done following these 3 pillars:

  • Transparency: If a social network comes to me and says: look, we are going to use your data to target adds at you, but we will never leak it to organizations promoting political campaigns and you get to see exactly which data is given to whom, I’m in! This is true for social medias, but for cloud, Software as a Service as well. If I am a small business and I subscribe to a payment processing engine, I want to know how you are using my sales data.
  • Openness: Software companies should not only tell us who uses our data, you should tell us what type of algorithms you use to create your software (not necessarily opening the code, but at least an idea of the architecture)
  • Accountability: Finally, if a company makes a mistake, own it and communicate it to the public. This seems obvious to me but apparently not common practice

For users, we need to vote with our decisions. I want companies to consider ethics to be an equally important consideration as customer satisfaction for instance, though I believe that in the long run, both are closely related. This is why I stopped using certain social networks, and I spent my days fighting against bad practices in the world of large enterprises software.

I realize defining and observing a code of ethics is difficult and that it is a moving target. Nevertheless, I feel that the software industry engine, and its oil that is data does not have the appropriate ethics institutions in place. I’d like for this to change.

Science tweets sentiment analysis: using Stanford CoreNLP to train Google AutoML

by paul 0 Comments

I have been wanting to play with Google AutoML tools, so I decided to do a quick article on how to use the Stanford CoreNLP library to train Google Auto ML. I took a simple example of parsing tweets talking about science and passing them through the two libraries. Here is a summary of this work.

Note: All content of the work can be found on my GitHub, here:

Getting Data

Getting data is easy with Nifi. I created a flow that grabs from the Science Filter Endpoint as depicted below:

As you can see, I then use PutGCSObject and PutFile to put the tweet messages respectively on my computer and on a GCS bucket (for historical purposes).

Running CoreNLP with Python

To run core NLP you will need to:

The super simple code I used below creates two type of files for our AutoML data set:

  • A new file for each sentence in a tweet
  • A list csv file containing the name of file above as well as the sentiment value of the sentence
# Import Libraries
from stanfordcorenlp import StanfordCoreNLP
import json
import os
import re

nlp = StanfordCoreNLP(r'/Users/paulvidal/Documents/Nerd/NLP/stanford-corenlp-full-2018-10-05')

# Init variables
veryNegativeResults = 0
negativeResults = 0
neutralResults = 0
positiveResults = 0
veryPositiveResults = 0
invalidResults = 0

# Loop trough files and get the tweet

rootdir = '/Users/paulvidal/Documents/Nerd/NLP/tweets'
outputDirectory = '/Users/paulvidal/Documents/Nerd/NLP/results'
csvF = open(os.path.join(outputDirectory, 'list.csv'), "w+")
sentenceNumber = 1

for subdir, dirs, files in os.walk(rootdir):
for file in files:
f = open(os.path.join(subdir, file))
rawTweet =
for k in rawTweet.split("\n"):
sentence = re.sub(r"[^a-zA-Z0-9]+", ' ', k)
# print sentence
# Sentiment
jsonSentiment = json.loads(nlp.annotate(sentence,
'annotators': 'sentiment',
'outputFormat': 'json'

currentSentence = ''
for s in jsonSentiment["sentences"]:
for y in s["tokens"]:
currentSentence = currentSentence + ' ' + y["originalText"]
if(s["sentiment"] == "Verynegative"):
veryNegativeResults = veryNegativeResults + 1
elif(s["sentiment"] == "Negative"):
negativeResults = negativeResults + 1
elif(s["sentiment"] == "Neutral"):
neutralResults = neutralResults + 1
elif(s["sentiment"] == "Positive"):
positiveResults = positiveResults + 1
elif(s["sentiment"] == "Verypositive"):
veryPositiveResults = veryPositiveResults + 1
invalidResults = invalidResults + 1
currentFilename = open(os.path.join(outputDirectory, 'analyzed-sentence-' + str(sentenceNumber) + '.txt'), "w+")

csvF.write('analyzed-sentence-' + str(sentenceNumber) + '.txt' + ',' + s["sentiment"] + '\n')
sentenceNumber = sentenceNumber + 1


print "Very Negative Results:", veryNegativeResults
print "Negative Results:", negativeResults
print "Neutral Results:", neutralResults
print "Positive Results:", positiveResults
print "Very Positive Results:", veryPositiveResults
print "Invalid Results:", invalidResults
# Do not forget to close! The backend server will consume a lot memory.

Running through about 1300 tweets, here are the results I get per sentence:

 Very Negative Results: 3
Negative Results: 1019
Neutral Results: 638
Positive Results: 110
Very Positive Results: 1
Invalid Results: 0

Using the data set to train Google AutoML

Step 0: Create a project and add AutoML Natural Language to it.

This should be fairly straight forward, a lot of documentation out there, so I’m not going to spend time on this. Refer to the google cloud tutorials, they are easy to follow.

Step 1: Upload the output of your python program to your AutoML bucket

Using the storage interface or the gsutils command line, upload all the files you created and the list.csv as depicted below:

Step 2: Add a data set to your AutoML

Create a data set using the list.csv you just uploaded:

Step 3: Train Model

The data set will upload and you should see your sentences labeled according to the coreNLP classification.

Note that you will have to most likely remove the label from verynegativs or verypositive labels as there might be less than 10 items in it.

You can now use the AutoML interface to train your model:

Pretty straight forward all considered. I haven’t run the training model yet as I need to think of compute costs, but this is pretty cool!

True cloud automation for large organizations: a case of leading by example.

by paul 0 Comments
True cloud automation for large organizations: a case of leading by example.
If you're reading this Ashish, this is for you


As a solution engineer, it is my mandate to advocate large IT organizations on how to best leverage the tools available in the market to empower their users to be autonomous. In the world of big data, this has many aspects from defining business processes to ensuring security and governance. The key aspect however is the ability to “shift to the left”, meaning that the end user is in control of the infrastructure necessary for his job.

Ultimately this is the unicorn every organization wants: a highly guarded infrastructure that feels like a plain of freedom to the end user.

Some organizations choose to use one vendor (e.g. cloud vendor) to implement this unicorn vision, but soon realize (or will soon) that the limiting yourself to one vendor not only restrict the plain field of end users it paradoxically make the implementation of consistent governance and security harder (because of vendor limitations and a tendency towards restricting cross platform compatibility)

The solution is what I advice my customer every day: building the backbone of an organization on open inter compatible standards that allow growth while maintaining consistency. In today’s world, it’s the hybrid cloud approach.

When I joined Hortonworks (now Cloudera) a few months ago I was impressed by the work done by a few Solution Engineers (Dan Chaffelson and Vadim Vaks). They had built an automation tool that would allow any Solution Engineer in the organization to spin their own demo cluster and load whatever demo they want called Whoville. I led a Pre-Sales organization before: this is the dream. Having consistency over demos for each member of the organization while empowering the herd of cats that Solution Engineers are is no easy feat. Not only that, we were eating our own dog food, leveraging our tools to instantiate what we had been preaching!

I’m still in awe of the intelligence and competency of the team I just joined. But, being as lazy as I am, I decided to try and build an even easier button for Whoville which would:
1. Empower the rest of our organization even more
2. Show true thought leadership to our customers

So with the help of my good friend Josiah Goodson and the support of the Solution Engineering team (allowing us to work on this during a hackathon for instabce) we built Cloudbreak Cuisine.

Introducing Cloudbreak Cuisine

Cuisine is fully featured application, running in containers, that allows, in its alpha version, Solution Engineers to:
1. Get access to Whoville library, deploy and delete demo clusters
2. Monitor clusters deployment via Cloudbreak API
3. Create your own combination of Cloudbreak Blueprints & Recipes (a.k.a. bundles) for your own demo
4. Push those bundles to Cloudbreak

Below is a high level architecture of Cuisine:

Cloudbreak Cuisine Architecture

The deployment of the tool is automated, but requires access to Whoville (restricted within our own organization). All details can be found here:

The couple of videos below showcase what Cuisine can do, in its alpha version.

Glossary: Cuisine Bundles are a combination of Cloudbreak Blueprints & Recipes.


Push a bundle from Whoville

Push Bundle

Delete a Bundle via Cloudbreak

Delete Bundle

Add Custom Recipe

Add Recipe

Create Custom Bundle

Create Bundle

Download/Push Bundle

Download/Push Bundle

Additional tips/tricks

Tips & Tricks

Parting thoughts

First and foremost: thank you. Thank you Dan and Vadim for Whoville, thank you Josiah for your continuous help and thank you Hortonworks/Cloudera to allow us to be demonstrating such thought leadership.

Secondly, if you’re a Cloudera Solution Engineer, test it and let me know what you think!

Finally, for every other reader out there: this is what an open platform can do to your organization. Truly allow you to leverage any piece of data or any infrastructure available to you; from EDGE to AI :)

Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

by paul 0 Comments
Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

My blood recently turned from green to blue (after the Hortonworks-Cloudera merger) and I couldn’t be more excited to play with new toys. What I am particularly excited about is Cloudera Data Science Workbench. But, like in everything I do, I am very lazy. So here is a quick tutorial to install Altus Director, and use it to deploy a CDH 5.15 + CDSW cluster.

Step 1: Install Altus Director

Many ways to do that, but the one I chose was the AWS install, detailed here:

The installation documentation is very well done, but here are the important excerpts

Create a VPC for your Altus instance

Follow the documentation.

Few important points:

  • In the name of laziness I also recommend to add a 0-65535 rule from your personal IP.
  • Your VPC should have an internet gateway associated with it (you could do it without, but would require you manually pulling the CM/CDH software down and make internal repositories within your subnet)
  • Do not forget to open all traffic to your security group as described here. Your deployment will not work otherwise.

Launch a Redhat 7.3 instance

You can either search communities AMIs, or use this one: ami-6871a115

Install Altus

Connect to your ec2 instance:

ssh -i your_file.pem ec2-user@your_instance_ip

Install JDK and wget

sudo yum install java-1.8.0-openjdk
sudo yum install wget

Install/Start Altus server and client:

cd /etc/yum.repos.d/
sudo wget ""
sudo yum install cloudera-director-server cloudera-director-client
sudo service cloudera-director-server start
sudo systemctl disable firewalld
sudo systemctl stop firewalld

Connect to Altus Director

Go to http://your_instance_ip:7189/ and connect with admin/admin

Step 2: Modify the Director configuration file

CDSW cluster configuration can be found here

Modify the configuration file to use:

  • Your AWS accessKeyId/secretAccessKey
  • Your AWS region
  • Your AWS subnetId (same as the one you created for your Director instance)
  • Your AWS securityGroupsIds (same as the one you created for your Director instance)
  • Your private key path (e.g. /home/ec2-user/field.pem)
  • Your AWS image (e.g. ami-6871a115)

Step 3: Launch the cluster via director client

Go to your EC2 instance where Director is installed, and load your modified configuration file as well as the appropriate key.

Finally, run the following:

cloudera-director bootstrap-remote your_configuration_file.conf \
--lp.remote.username=admin \

Step 4: Access Cloudera Manager

You can follow the bootstrapping of the cluster both on command line or in the Director interface; once done, you can connect to Cloudera Manager using: http://your_manager_instance_ip:7180/

Step 5: Configure CDSW domain with your IP

Cloudera Data Science Workbench uses DNS. The correct approach is to setup a wildcard DNS record is required, as described here.

However, for testing purposes I used The only parameter to change is the Cloudera Data Science Workbench Domain, from as the conf file sets it up to, to cdsw.[YOUR_AWS_PUBLIC_IP], as depicted below:

Restart the CDSW service, then you should be able to access CDSW by clicking on the CDSW Web UI link. Register for a new account and you will have access to CDSW:

A Hybrid approach to the Hybrid Cloud

by paul 0 Comments
A Hybrid approach to the Hybrid Cloud

Unless you have been hiding under a rock, or maybe spending too much time looking at the clouds passing by, the last couple of years have seen the advent of the adoption of the cloud as a major part of enterprise IT infrastructure.  As with everything in IT infrastructure, trends are followed for and without good reasons. Like I’ve argued before, outsourcing your non business critical software to SaaS may make sense, while maintaining your core business on site seem to be a good approach. In this piece however, I’d like to address the adoption of the cloud as PaaS, what are the pitfalls of that type of approach, and how adopting cloud as IaaS could alleviate some of these pitfalls. Perhaps more importantly, I’d like to offer a nuanced approach that will hopefully avoid an all-or-nothing approach. In short, here is how I view it:

Note: As always, I am touching here on enterprise data strategies as the backbone of a business, and therefore talking about data platforms as a whole. I’m not talking about expert systems/system of records and where/how they should be implemented.

Going all in

Exposing the flaws of going all into one cloud is fairly straight forward. Cloud infrastructure is super attractive. Being able to spin up at will nodes and services is super attractive. It’s like a kid at an arcade choosing what to play next. Until you run out of quarters and want to take a worthy price home. Here are some clear limitations of going all in into one cloud, and using all its services:

  • The services you use lock you in. If you develop something on AWS for instance, using lambda or any other tool, you will have a hard time the day you want to move these applications. To some extent, all the work you have done to liberate your data from your system of records and drive a true data driven business could be rendered void by going all in with one cloud.
  • Cloud vendors are very good a getting your data in and storing it for cheap, as well as running ephemeral elastic workloads. However, running long lasting compute or getting data out can be extremely costly.
  • Maintaining internal process of governance & security are very limited in the cloud.
  • Not all clouds are equal across the globe. If you truly are a global business you must have the ability to chose the cloud vendor that is available in your region.

The hybrid approach: a great option for today

The response to these limitations comes in form of a hybrid cloud. It is the idea of having workloads running on components that can be deployed on demand on premises or in the cloud in the same manner. Frankly, this solves 99% of the problems IT is trying to solve:

  • The services you use are infrastructure agnostic, and therefore allow you to maintain control of your data.
  • You can leverage cloud vendors for ephemeral workloads and on site for long lasting ones.
  • Governance and security are shared across cloud/on-prem.
  • You get to leverage any cloud.

As always, the devil is in the details. The only true way to implement a hybrid cloud is to have the same architecture on prem and in the cloud. This means separating storage and compute, as opposed to having storage and compute coupled like it is traditionally setup on premises. Theoretically, considering the advances of networking, and the advances of container management, morphing traditional architectures to have compute and storage separated should be fine.

The hybrid-er approach: a path towards true agnosticity

Like I mentioned before, I am a firm proponent of the hybrid approach. Nevertheless, I can’t help but imagining a world in 5 to 10 years, where everyone has implemented their hybrid data platform backend and the hot new tech is a new platform that provides a very specific and essential set of capabilities (think complex AI workloads) only possible by coupling compute and storage. Traditional RDBMS weren’t fit for many types of work (e.g. large scale, etc.), that does not mean they completely disappeared. I think we are going to see the same thing with containerization. It will be essential for many cases, but for others, different resources managers may be more appropriate. Regardless, these are truly exciting times, and I am very excited to be in the midst of this transformation.

Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

by paul 0 Comments
Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

Introduction & Context

There is a reason why I spent my life studying and working in computer science: understanding a computer’s psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.

Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:

  • Openness to Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism (or Emotional Stability)

This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn’t make it justice here (I most likely already irritated any personality psychologist that read these first few lines).

Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.

Solution Overview

The solution put in place uses 3 main elements:

  • A NiFi flow to orchestrate data ingestion from API, personality detection and storage to Hive
  • Hive to store the results of the personality detection
  • Zeppelin for visualization of the results

The figure below gives an overview of the solution flow:

More precisely, the solution can be dissected in 5 main steps, that I’m describing in details below:

  • Step 1: Retrieving data from New York Times API
  • Step 2: Scrape HTML article data
  • Step 3: Run machine learning models for personality detection
  • Step 4: Store results to Hive
  • Step 5: Create simple Zeppelin notebook

Step 1: Retrieving data from New York Times API

Obtaining an API Key

This step is very straight forward. Go to and sign-up for a key:

Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I’m not creative.

Configuring InvokeHTTP

The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:

  • HTTP Method: GET
  • Remote URL:“The New York Times”)&page=0&sort=newest&fl=web_url,snippet,headline,pub_date,document_type,news_desk,byline&api-key=[YOUR_KEY] (This URL selects article from the New York Times as a source, and only selects some of the fields I am interested in: web_url,snippet,headline,pub_date,document_type,news_desk,byline).
  • Content-Type: ${mime.type}
  • Run Schedule: 5 mins (could be set to a little more, I’m not sure the frequency at which new articles are published)

Extracting results from Invoke HTTP response

The API call parameter page=0 returns results 0-9; for this exercise, I’m only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:

A few important points here:

  • The destination is set to flowfile-attribute because we are going to re-use these attributes later in the flow
  • I’m expecting the API to change after some time this article is published. Just to make sure that the JSON paths are good for your version of the API, I recommend JSON paths evaluators online.

Massage data to avoid conflicts when inserting to Hive

This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ‘ from the snippet as you can see, but it would deserve a second path I think:

Step 2: Scrape HTML article data

Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I’m using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).

Create a simple Java class to call boilerplate

After downloading the boilerpipe jars (using, use your favorite Java IDE and create this simple class:

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class extractArticle {
public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException {
if(args.length == 1) {
URL url = new URL("" + args[0]);
String text = ArticleExtractor.INSTANCE.getText(url);
} else {
System.out.println("Please Specify URL");

Once tested, create an executable jar (in my case extractArticle.jar).

Transfer jars to nifi server

Connect to your nifi server with your nifi user and create the following directory structure:

$ cd /home/nifi
$ mkdir extractArticle
$ cd extractArticle
$ mkdir lib

Transfer the following libraries to ~/extractArticle/lib/ :

  • xerces-2.9.1.jar
  • nekohtml-1.9.13.jar
  • boilerpipe-sources-1.2.0.jar
  • boilerpipe-javadoc-1.2.0.jar
  • boilerpipe-demo-1.2.0.jar
  • boilerpipe-1.2.0.jar
  • extractArticle.jar

Create a simple Unix script to execute HTML scraping

Under ~/extractArticle/ create the script as follows:

$JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*

Configure ExecuteStreamCommand processor

Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Step 3: Run machine learning models for personality detection

Setup PersonalityRecognizer on NiFi server

Just as for boilerpipe, we’re going to run an ExecuteStream command. To prepare the files, run the following commands:

$ cd /home/nifi
$ wget recognizer-1.0.3.tar.gz
$ tar -xvf recognizer-1.0.3.tar.gz
$ cd PersonalityRecognizer
$ mkdir texts

Modify the file as follows:

# Configuration File of the Personality Recognizer
# All variables should be modified according to your
# directory structure
# Warning: for Windows paths, backslashes need to be
# doubled, e.g. c:\\Program Files\\Recognizer
# Root directory of the application
appDir = /home/nifi/PersonalityRecognizer
# Path to the LIWC dictionary file (LIWC.CAT)
liwcCatFile = ./lib/LIWC.CAT
# Path to the MRC Psycholinguistic Database file (mrc2.dct)
mrcPath = ./ext/mrc2.dct

Modify the script PersonalityRecognizer as follows:

#! /bin/bash -
# ----------------------------------
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*

Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:

text=`ls -t texts/ | head -1`
./PersonalityRecognizer -i ./texts/$text > tmp.txt
extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}"
echo $json_output
rm tmp.txt texts/*

Configure PutFile processor to create article file

This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:

Configure the ExecuteStreamCommand Processor

Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Extract attributes from JSON output

Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:

Step 4: Store results to Hive

Create Hive DB and tables

Because we don’t control wether we receive the same article twice from the New York Times API, we need to make sure that we don’t insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.

Therefore connect to your hive server and create one database and two tables as follows:

CREATE DATABASE personality_detection;
use personality_detection;
CREATE TABLE text_evaluation (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
clustered by (web_url) into 2 buckets stored as orc

CREATE TABLE all_updates (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

Create HiveQL script

Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.

Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I’m storing it as a date):

use personality_detection;

insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}');

merge into text_evaluation
using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url
when matched then update set
when not matched then insert
values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type,
all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience);

truncate table all_updates;

Processor Overview:

Upsert data to hive

Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):

Step 5: Create simple Zeppelin notebook

Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:

select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10

Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:


While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!

Known possible improvements

  • Better control of data retrieval to avoid duplicate flows (depends on API)
  • Better special character replacement for HiveQL command
  • More elegant way to execute data scraping and run personality recognition java classes
  • Additional scraping from article text to remove title, byline, and other unnecessary information from boilerpipe output
  • More thorough testing of different personality recognizer models (and use other/more recent libraries)


  • Big Five personality traits:
  • The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives:
  • IBM Watson Personality Insights:
  • Personality Recognizer by Francois Mairesse:
  • NYT API sign up:
  • NYT Article Search readme:
  • JSON Path Evaluator:
  • Boilerpipe jar download:
  • Boilerpipe github:

Making data analytics operational

by paul 0 Comments
Making data analytics operational
I refuse to use the term-that-should-not-be-used when describing stale data lakes.

After 6 months of silence, I finally take the time to get back behind my keyboard. I would like to say that I used these 6 months to reflect upon my writing, the current data market and came out of this hiatus a better, more informed and well versed person, but that would be a lie. And, despite the current pace at which the social fabric of our society is moving towards considering lies as acceptable and moral, I prefer not to. I don’t really know why I stopped writing for a bit, but most likely because I had nothing to say. So today, brace yourselves for a semi-informed opinion piece on data analytics, because I actually changed my opinion a bit on it through real-life experience.

My opinion then: analytics are a fringe use case of data management

In my article “Why data driven companies should stop investing in data analytics” I argued for the death of dashboards. I still stand by that point of view, as too often the Business Intelligence (BI) platforms are an end point of the data life cycle. Countless data replication processes, ETL, busses and other goldengate push data into data warehouses or data lakes where data scientists pat themselves on the back by showing dashboards that could potentially contain information to be integrated in the current business processes. Quick aside and nugget of knowledge from my PhD friends: if your title contains “science” in it, you’re not a real scientist. Shots fired. Moving on, while I still stand knee deep in stale data lakes despite being on my soapbox, there is one thing I did not consider enough: Machine Learning algorithms. There are two main reasons why the existence of machine learning algorithms as they are implemented now changes my opinion. First and foremost, the problem I describe of BI being the end of the data chain and its outcome only being driven by humans trying to improve business process can be alleviated with analytics automation via these algorithms (to some extent at the moment, but will be more and more true as the technology progresses). Secondly, ML needs access to data lakes, not operational big data. The algorithms need to be able to train using any data sets, looking at data from any angle in order to make usable predictions.

My opinion now: analytics need to be better integrated in the data life cycle

Consequently, here my proposal to the data world. We need to envision an architecture where data warehouses are not the raiders of the lost ark type but more the amazon type: they need to be an inherent part of the data life cycle. Drilling a bit further in the architecture I contemplate, your data as a service layer would feed current data sets to your data warehouse, where ML would run asynchronously, but the outcome of these analytics would then feed back the rules of data manipulation embedded in your DaaS layer. If you manage a constant feedback loop of the kind, your end user application served by your DaaS will constantly get fed more accurate and relevant data, which in turn can enable the next generation of platforms: Information as a Service. But that’s for another day.

What I Don’t Talk About When I Don’t Talk About Running

by paul 0 Comments
What I Don’t Talk About When I Don’t Talk About Running
The Track. My old nemesis.

Today’s post will differ quite a bit to my previous ones. While I enjoy discussing the current state of affairs in the fascinating world of data management, I would like to take the opportunity to write a very personal piece. So very personal in fact, that I have hesitated a long time to publish it. To put things into context, at this instant in time in my life, I strive to be a very logical person. One of things I loathe the most in this world is the proportion of emotion driven decisions that occur. It’s been a soap box of mine for quite a while now, upon which I stand and promote logic and critical thinking, comfortably sitting (or I suppose in this case, standing) in my eco-chamber of like minded skeptics. That being said, there are really only two domains in my life for which I do not strive to apply that very rational framework and let my emotions take the best of me: my family, and running (obviously to a greater extent with regard to my family, I am not a monster). Here is the thing about very emotional topics: I do not feel comfortable talking about them, which is why I very so seldom share anything about my love for my family; and that’s not going to start in this blog post either, let me makes things clear. No, today, I’m going to talk about my relationship with running. Once, then I will go back to not talking about it, except on rare occasions and with selected people.

I do not run to be healthy

Before I start pouring my logorrhea of deeply intimate and potentially completely uninteresting facts about the place that running has in my life, let me do a little bit of house cleaning. Yes, exercise is good for you. The list of benefits is never ending, from better longevity, easier weight management, improved cognitive functions, stress reduction and much more. This is NOT the running I’m going to talk about here. As a by product of running I may be more healthy, but that’s really not the goal or the attitude I have towards running. I don’t need an end goal to run. I run because this is one of the things I love the most in life.

I am in a relationship with running

Indeed, while I recognize that I’m still a “new” runner, having only started seriously about 5 years ago, I can tell you that running is part of my life. As soon as I started, I sincerely fell in love with it. I’m choosing the world love carefully, meaning that I am emotionally involved with running. The joy and pain I feel for getting the opportunity to spend time running, not even running itself, just the idea that I can go on a run are extremely potent for me. To give you a recent example, about a week ago, I got a free bib to run a race. The day of the race, while warming up, I realized I was sick, something completely out of my control. In any other domain, things that I cannot control do not affect me greatly. But that day, as I realized I wasn’t going to be able to run hard, and it made more sense for me not to race, I sat down behind a tree and cried. In hindsight, this attitude is completely disproportionate to the situation. The bib was free, I signed up less than a week in advance, I did nor really prepared for the race, I could not control being sick or not, my family still loves me, the sun is going to rise tomorrow, etc. I realize that this type of emotional response is completely alienating to my entourage, from people who don’t care about my last workout splits to my wife having to deal with my training schedule and nervous breakdowns. However, this relationship I am in is not unhealthy, not like it was when I started. I know when to take time off. Reluctantly sure, but I take it nonetheless. I even questioned the very nature of it, to see if the emotions I feel about running are not a coping mechanism for underlying deeper issues, but I ultimately arrived at the conclusion that it isn’t. I stepped into despair from the pit of dread, and as described by Kierkegaard I emerged trying to know myself. And the person that I am is a very lucky man, who gets to really be passionate about running. If running is taken away from me, it would really suck, but I would still be me.

I am not a social runner

Don’t think for a minute that I consider my case exceptional in any way. I’m sure people feel very passionate about many things, running included. Here is one thing that I am not though: a social runner. The community of runners is perhaps one of the most amazing things I get to witness on a regular basis. The camaraderie and motivation that groups offer for people that are just starting out to people that are constantly striving to better themselves are amazing, and reach far beyond the act of running itself. Running promotes charity, inclusiveness, and self worth. When you run in a group, you are always welcome, no matter where you come from, how fast you’re going, how long you’ve been running or if you will ever go and run with this group again. It took me a while to admit this but I, however, have very little need for belonging to a group. Note that this is how I currently feel, and that it may be subject to change in the future, but I think it is a mature position that I hold, and the reason why I think it is fair for me to write it black on white. I enjoy running with people I like (and most of the people I meet while running I like), but I do not look for being part of a community (running group or beyond). I do not seek out runners to go out for a run. I do not feel like waving a flag saying to the world that I am a runner. I do not wear running clothes outside of a running-related event. I do not post on a regular basis on running groups. I do not like to talk about running all the time, while I’m running. I enjoy a little bit of it, but only a little. I confess that I have unfollowed many groups and friends that constantly posts about running on Facebook. And I think that I figured out why. First, I have psychopathic tendencies and therefore have never really felt the appeal of community. I don’t care about identifying myself one way or the other. Yes, I am a runner, as a matter of fact I am very passionate about running but I don’t need group recognition. But to some extent, I think I am a jealous lover. I do not want to share running with anyone else. Running is very intimate to me, so much that sometimes, I hear songs describing a relationship and I feel they are describing my relationship with running. So you will understand why it is so hard for me to see pictures of people having an orgy of good feelings during a race I just raced, and for which I am not completely satisfied of my results.

Running is my Sisyphus’ boulder

Which brings me to my next point: I am a competitive person. And the person I love to beat the most is myself. Running is the best avenue for me to compete with myself. And I thoroughly enjoy the process. Training every day to gain little edges of fitness is the most gratifying process in which I get to engage. The process is complex, dependent on many variables, which is why I am lucky to have a coach that is almost as passionate and knowledge about coaching as I am about running. This by no means ensures that I am always making the optimum choice of training every day, but leaning on people’s expertise is a good rule of thumb if you want to improve at a particular skill set. The great thing about self-competition is that you can always find ways to improve, whether it is getting better at a certain type of workout/distance or acknowledging the person that you are and knowing your limitations. More than improvement, training to be a better runner teaches me to have short and long term goals, all the while giving me a constant in my ever changing life. No matter where I am, I know that I can train. Which is why days like today, where I am unfortunately sick and it is more advised not to run at all, are very hard. Regardless, I know that the task of training will never finish, and that makes me pretty happy, because that means I get to run more!

Final thoughts

As I mentioned at the beginning of this post, the strings of words that are part of this article are a departure from my normal writing. They may in fact not even be coherent. Why publish them at all? Partly to experiment and see people’s reaction to it, but mostly to shut the voice inside me that beg me to write about the subject; just once, so that you understand how deeply and sincerely attached I am to running, and how much it humbles me. Running is so important to me, I try to advertise it as little as possible.