paul vidal - thoughts about tech and more

Tag Archives

4 Articles

Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

by paul 0 Comments
Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

Introduction & Context

There is a reason why I spent my life studying and working in computer science: understanding a computer’s psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.

Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:

  • Openness to Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism (or Emotional Stability)

This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn’t make it justice here (I most likely already irritated any personality psychologist that read these first few lines).

Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.

Solution Overview

The solution put in place uses 3 main elements:

  • A NiFi flow to orchestrate data ingestion from API, personality detection and storage to Hive
  • Hive to store the results of the personality detection
  • Zeppelin for visualization of the results

The figure below gives an overview of the solution flow:

More precisely, the solution can be dissected in 5 main steps, that I’m describing in details below:

  • Step 1: Retrieving data from New York Times API
  • Step 2: Scrape HTML article data
  • Step 3: Run machine learning models for personality detection
  • Step 4: Store results to Hive
  • Step 5: Create simple Zeppelin notebook

Step 1: Retrieving data from New York Times API

Obtaining an API Key

This step is very straight forward. Go to and sign-up for a key:

Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I’m not creative.

Configuring InvokeHTTP

The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:

  • HTTP Method: GET
  • Remote URL:“The New York Times”)&page=0&sort=newest&fl=web_url,snippet,headline,pub_date,document_type,news_desk,byline&api-key=[YOUR_KEY] (This URL selects article from the New York Times as a source, and only selects some of the fields I am interested in: web_url,snippet,headline,pub_date,document_type,news_desk,byline).
  • Content-Type: ${mime.type}
  • Run Schedule: 5 mins (could be set to a little more, I’m not sure the frequency at which new articles are published)

Extracting results from Invoke HTTP response

The API call parameter page=0 returns results 0-9; for this exercise, I’m only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:

A few important points here:

  • The destination is set to flowfile-attribute because we are going to re-use these attributes later in the flow
  • I’m expecting the API to change after some time this article is published. Just to make sure that the JSON paths are good for your version of the API, I recommend JSON paths evaluators online.

Massage data to avoid conflicts when inserting to Hive

This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ‘ from the snippet as you can see, but it would deserve a second path I think:

Step 2: Scrape HTML article data

Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I’m using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).

Create a simple Java class to call boilerplate

After downloading the boilerpipe jars (using, use your favorite Java IDE and create this simple class:

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class extractArticle {
public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException {
if(args.length == 1) {
URL url = new URL("" + args[0]);
String text = ArticleExtractor.INSTANCE.getText(url);
} else {
System.out.println("Please Specify URL");

Once tested, create an executable jar (in my case extractArticle.jar).

Transfer jars to nifi server

Connect to your nifi server with your nifi user and create the following directory structure:

$ cd /home/nifi
$ mkdir extractArticle
$ cd extractArticle
$ mkdir lib

Transfer the following libraries to ~/extractArticle/lib/ :

  • xerces-2.9.1.jar
  • nekohtml-1.9.13.jar
  • boilerpipe-sources-1.2.0.jar
  • boilerpipe-javadoc-1.2.0.jar
  • boilerpipe-demo-1.2.0.jar
  • boilerpipe-1.2.0.jar
  • extractArticle.jar

Create a simple Unix script to execute HTML scraping

Under ~/extractArticle/ create the script as follows:

$JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*

Configure ExecuteStreamCommand processor

Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Step 3: Run machine learning models for personality detection

Setup PersonalityRecognizer on NiFi server

Just as for boilerpipe, we’re going to run an ExecuteStream command. To prepare the files, run the following commands:

$ cd /home/nifi
$ wget recognizer-1.0.3.tar.gz
$ tar -xvf recognizer-1.0.3.tar.gz
$ cd PersonalityRecognizer
$ mkdir texts

Modify the file as follows:

# Configuration File of the Personality Recognizer
# All variables should be modified according to your
# directory structure
# Warning: for Windows paths, backslashes need to be
# doubled, e.g. c:\\Program Files\\Recognizer
# Root directory of the application
appDir = /home/nifi/PersonalityRecognizer
# Path to the LIWC dictionary file (LIWC.CAT)
liwcCatFile = ./lib/LIWC.CAT
# Path to the MRC Psycholinguistic Database file (mrc2.dct)
mrcPath = ./ext/mrc2.dct

Modify the script PersonalityRecognizer as follows:

#! /bin/bash -
# ----------------------------------
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*

Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:

text=`ls -t texts/ | head -1`
./PersonalityRecognizer -i ./texts/$text > tmp.txt
extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}"
echo $json_output
rm tmp.txt texts/*

Configure PutFile processor to create article file

This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:

Configure the ExecuteStreamCommand Processor

Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Extract attributes from JSON output

Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:

Step 4: Store results to Hive

Create Hive DB and tables

Because we don’t control wether we receive the same article twice from the New York Times API, we need to make sure that we don’t insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.

Therefore connect to your hive server and create one database and two tables as follows:

CREATE DATABASE personality_detection;
use personality_detection;
CREATE TABLE text_evaluation (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
clustered by (web_url) into 2 buckets stored as orc

CREATE TABLE all_updates (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

Create HiveQL script

Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.

Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I’m storing it as a date):

use personality_detection;

insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}');

merge into text_evaluation
using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url
when matched then update set
when not matched then insert
values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type,
all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience);

truncate table all_updates;

Processor Overview:

Upsert data to hive

Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):

Step 5: Create simple Zeppelin notebook

Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:

select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10

Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:


While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!

Known possible improvements

  • Better control of data retrieval to avoid duplicate flows (depends on API)
  • Better special character replacement for HiveQL command
  • More elegant way to execute data scraping and run personality recognition java classes
  • Additional scraping from article text to remove title, byline, and other unnecessary information from boilerpipe output
  • More thorough testing of different personality recognizer models (and use other/more recent libraries)


  • Big Five personality traits:
  • The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives:
  • IBM Watson Personality Insights:
  • Personality Recognizer by Francois Mairesse:
  • NYT API sign up:
  • NYT Article Search readme:
  • JSON Path Evaluator:
  • Boilerpipe jar download:
  • Boilerpipe github:

Making data analytics operational

by paul 0 Comments
Making data analytics operational
I refuse to use the term-that-should-not-be-used when describing stale data lakes.

After 6 months of silence, I finally take the time to get back behind my keyboard. I would like to say that I used these 6 months to reflect upon my writing, the current data market and came out of this hiatus a better, more informed and well versed person, but that would be a lie. And, despite the current pace at which the social fabric of our society is moving towards considering lies as acceptable and moral, I prefer not to. I don’t really know why I stopped writing for a bit, but most likely because I had nothing to say. So today, brace yourselves for a semi-informed opinion piece on data analytics, because I actually changed my opinion a bit on it through real-life experience.

My opinion then: analytics are a fringe use case of data management

In my article “Why data driven companies should stop investing in data analytics” I argued for the death of dashboards. I still stand by that point of view, as too often the Business Intelligence (BI) platforms are an end point of the data life cycle. Countless data replication processes, ETL, busses and other goldengate push data into data warehouses or data lakes where data scientists pat themselves on the back by showing dashboards that could potentially contain information to be integrated in the current business processes. Quick aside and nugget of knowledge from my PhD friends: if your title contains “science” in it, you’re not a real scientist. Shots fired. Moving on, while I still stand knee deep in stale data lakes despite being on my soapbox, there is one thing I did not consider enough: Machine Learning algorithms. There are two main reasons why the existence of machine learning algorithms as they are implemented now changes my opinion. First and foremost, the problem I describe of BI being the end of the data chain and its outcome only being driven by humans trying to improve business process can be alleviated with analytics automation via these algorithms (to some extent at the moment, but will be more and more true as the technology progresses). Secondly, ML needs access to data lakes, not operational big data. The algorithms need to be able to train using any data sets, looking at data from any angle in order to make usable predictions.

My opinion now: analytics need to be better integrated in the data life cycle

Consequently, here my proposal to the data world. We need to envision an architecture where data warehouses are not the raiders of the lost ark type but more the amazon type: they need to be an inherent part of the data life cycle. Drilling a bit further in the architecture I contemplate, your data as a service layer would feed current data sets to your data warehouse, where ML would run asynchronously, but the outcome of these analytics would then feed back the rules of data manipulation embedded in your DaaS layer. If you manage a constant feedback loop of the kind, your end user application served by your DaaS will constantly get fed more accurate and relevant data, which in turn can enable the next generation of platforms: Information as a Service. But that’s for another day.

The future of Data is Augmentation, not Disruption

by paul 1 Comment
The future of Data is Augmentation, not Disruption
I'm disrupting the light bulb market by enabling wireless. I call it "Photoshop"

I spent last week enjoying the Cassandra Summit, so much that I did not take the time to write a blog post. I had a few ideas but I chose quality over quantity. That being said, something interesting happened at the summit: we coined the term “augmentation” for one of my companies key go to market use case, instead of data layer modernization or digitalization. even got the opportunity to try both terms to the different people visiting our booth. In this extremely small sample, people really tended to have a much better degree of understanding when I used the word augmentation, which got me thinking. I even read a very interesting article from Tom O’Reilly called: Don’t Replace People. Augment Them. in which he argues against technology fully replacing people. Could this concept of augmentation be applied in a broader scale to understand our data technology trends? Maybe, at least that’s what I’m going to try to lay out in this article.

Technological progress relies on augmentation.

That’s the first thing that struck me when I pondered on augmentation in our world, and more specifically when it comes to software. At the exception of very few, the platforms, apps and tool that we use are all based on augmentation of existing basic functions: Amazon? Augmentation of store using technology. Uber? Augmentation of taxis. Chatbots? Augmentation of chat clients. Slack? Augmentation of email + chats. Distributed/Cloud applications? Augmentation of legacy applications. To some extent even Google is an augmentation of a manual filing system. I would admit listing examples that confirm an idea that I already had is close to a logical fallacy, so I tried to find counter examples, i.e. software solutions that try to introduce completely new concepts, but could not think of any. Of course we could argue over semantics in defining what constitute true innovation versus augmentation of an existing technology, but ultimately I think it is fair to say that the most successful technologies are augmenting our experience rather than being completely disruptive, despite what most of my field would argue. Therefore, augmentation must be at least considered as part of the future of any software industry, such as the Big Data industry.

Augmentation is better than transformation

Human nature needs comfort, that’s why most of us prefer augmentation over disruption. By disruption, I’m talking about transforming or replacing the existing systems, not adding features: selling unpaired socks over internet is not disrupting the sock industry, despite what the TED talks would like me to believe. Seriously, when you have existing technologies, as every company does, a replacement/transformation is a hard pill to swallow. Loss of investment, knowledge, process, etc. It is especially risky and complex when talking about data layer transformation, as I argued before in this very blog. So when given a choice, augmenting existing data layers is an obvious choice for risk-advert IT organizations.

Augmentation drives innovation

Perhaps the most convincing argument towards acknowledging that augmentation is the future of data is the analysis of the most innovative big data software solutions: machine learning, neural networks and all of these extremely complex systems which behaviors are almost impossible to predict, even for experts. These systems are design to augment their own capabilities, instead of having a set of deterministic rules to follow. Indeed, these systems are designed to approach the capabilities of complex biological systems and therefore incorporate their “messiness”. We can’t think of big data systems using physics thinking (i.e. here is an algorithm, here is a set of parameters, this is the result expected), but we should rather rely to biology thinking (i.e. what is the results I get if I input this parameter). A great example of this type of thinking is Netflix’s Chaos Monkey, a service running on AWS to simulate failures and understand the behavior of their architecture. Self-augmentation is the principle upon which the technologies of the future are built. We understand the algorithms we input but not necessarily the outcome, which can have unintended consequences sometimes (see: Microsoft Tay), but ultimately is a better pathway to intelligent technologies. I’m a control freak, and not being able to understand a system end to end drives me nuts, but I’m willing to relinquish my sanity for the good of Artificial Intelligence.


With software Augmentation being part of our everyday life, a safer and easier way to add features to existing data layer, and the core concept of machine learning, I think it is fair to say that it is the future of Data. Did I convince myself? Yes, which is good because my opinion is usually my first go to when it comes to figuring out what I think. Seriously though, what do you think? As always, I long to learn more and listen to everyone’s opinion!

Essential resources on Machine Learning

by paul 0 Comments
Essential resources on Machine Learning
"Maybe you should be spending some time learning instead of relying on machines" - Some hipster

I’ve always been fascinated by Artificial Intelligence in science fiction. I’m so lucky to live in an era that is seeing the birth of a new kind of Artificial Intelligence, enabled by Big Data, advancement in super computers and Machine Learning. I’m even working in a field that gets to implement that kind of technologies, which continues to excite and fascinate me. Machine learning is today moving out of the realm of pure research to real-world applicability. But like any new cutting-edge technology, we need to beware of products untruthfully using the word Machine Learning in their marketing message or Machine Learning being the cure for all diseases. Therefore, I think it’s important that we spend some time understanding what Machine Learning is, as well as what it does and can do in the industry. Since I’m not an expert on Machine Learning (… yet), I spent some time gathering resources to enhance your Human Learning about Machine Learning. Happy reading!


  • First things first, wikipedia: link
  • An excellent visual introduction on Machine Learning from R2D3: link
  • An early draft of a Machine Learning book from Stanford University: link
  • Introduction to Machine Learning from Cambridge University: link
  • Technical courses

  • In-depth videos on Machine Learning, from Data School: link
  • What is Machine Learning, from Data Camp: link
  • Introduction to Machine Learning, from Udacity: link
  • Machine Learning, from Coursera: link
  • Machine Learning in the market

  • Gartner 2015 Hype Cycle: Big Data is Out, Machine Learning is in: link
  • Gartner 2016 top 10 trends: link
  • Machine Learning, What it is & why it matters, from SAS: link
  • Marketplace for Machine Learning Algorithm, Algorithmia: link
  • The future of Machine Learning, from David Karger on Quora: link