paul vidal - thoughts about tech and more

Yearly Archives

2 Articles

A Hybrid approach to the Hybrid Cloud

by paul 0 Comments
A Hybrid approach to the Hybrid Cloud

Unless you have been hiding under a rock, or maybe spending too much time looking at the clouds passing by, the last couple of years have seen the advent of the adoption of the cloud as a major part of enterprise IT infrastructure.  As with everything in IT infrastructure, trends are followed for and without good reasons. Like I’ve argued before, outsourcing your non business critical software to SaaS may make sense, while maintaining your core business on site seem to be a good approach. In this piece however, I’d like to address the adoption of the cloud as PaaS, what are the pitfalls of that type of approach, and how adopting cloud as IaaS could alleviate some of these pitfalls. Perhaps more importantly, I’d like to offer a nuanced approach that will hopefully avoid an all-or-nothing approach. In short, here is how I view it:

Note: As always, I am touching here on enterprise data strategies as the backbone of a business, and therefore talking about data platforms as a whole. I’m not talking about expert systems/system of records and where/how they should be implemented.

Going all in

Exposing the flaws of going all into one cloud is fairly straight forward. Cloud infrastructure is super attractive. Being able to spin up at will nodes and services is super attractive. It’s like a kid at an arcade choosing what to play next. Until you run out of quarters and want to take a worthy price home. Here are some clear limitations of going all in into one cloud, and using all its services:

  • The services you use lock you in. If you develop something on AWS for instance, using lambda or any other tool, you will have a hard time the day you want to move these applications. To some extent, all the work you have done to liberate your data from your system of records and drive a true data driven business could be rendered void by going all in with one cloud.
  • Cloud vendors are very good a getting your data in and storing it for cheap, as well as running ephemeral elastic workloads. However, running long lasting compute or getting data out can be extremely costly.
  • Maintaining internal process of governance & security are very limited in the cloud.
  • Not all clouds are equal across the globe. If you truly are a global business you must have the ability to chose the cloud vendor that is available in your region.

The hybrid approach: a great option for today

The response to these limitations comes in form of a hybrid cloud. It is the idea of having workloads running on components that can be deployed on demand on premises or in the cloud in the same manner. Frankly, this solves 99% of the problems IT is trying to solve:

  • The services you use are infrastructure agnostic, and therefore allow you to maintain control of your data.
  • You can leverage cloud vendors for ephemeral workloads and on site for long lasting ones.
  • Governance and security are shared across cloud/on-prem.
  • You get to leverage any cloud.

As always, the devil is in the details. The only true way to implement a hybrid cloud is to have the same architecture on prem and in the cloud. This means separating storage and compute, as opposed to having storage and compute coupled like it is traditionally setup on premises. Theoretically, considering the advances of networking, and the advances of container management, morphing traditional architectures to have compute and storage separated should be fine.

The hybrid-er approach: a path towards true agnosticity

Like I mentioned before, I am a firm proponent of the hybrid approach. Nevertheless, I can’t help but imagining a world in 5 to 10 years, where everyone has implemented their hybrid data platform backend and the hot new tech is a new platform that provides a very specific and essential set of capabilities (think complex AI workloads) only possible by coupling compute and storage. Traditional RDBMS weren’t fit for many types of work (e.g. large scale, etc.), that does not mean they completely disappeared. I think we are going to see the same thing with containerization. It will be essential for many cases, but for others, different resources managers may be more appropriate. Regardless, these are truly exciting times, and I am very excited to be in the midst of this transformation.

Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

by paul 0 Comments
Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

Introduction & Context

There is a reason why I spent my life studying and working in computer science: understanding a computer’s psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.

Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:

  • Openness to Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism (or Emotional Stability)

This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn’t make it justice here (I most likely already irritated any personality psychologist that read these first few lines).

Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.

Solution Overview

The solution put in place uses 3 main elements:

  • A NiFi flow to orchestrate data ingestion from API, personality detection and storage to Hive
  • Hive to store the results of the personality detection
  • Zeppelin for visualization of the results

The figure below gives an overview of the solution flow:

More precisely, the solution can be dissected in 5 main steps, that I’m describing in details below:

  • Step 1: Retrieving data from New York Times API
  • Step 2: Scrape HTML article data
  • Step 3: Run machine learning models for personality detection
  • Step 4: Store results to Hive
  • Step 5: Create simple Zeppelin notebook

Step 1: Retrieving data from New York Times API

Obtaining an API Key

This step is very straight forward. Go to https://developer.nytimes.com/signup and sign-up for a key:

Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I’m not creative.

Configuring InvokeHTTP

The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:

  • HTTP Method: GET
  • Remote URL: http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=source:(“The New York Times”)&page=0&sort=newest&fl=web_url,snippet,headline,pub_date,document_type,news_desk,byline&api-key=[YOUR_KEY] (This URL selects article from the New York Times as a source, and only selects some of the fields I am interested in: web_url,snippet,headline,pub_date,document_type,news_desk,byline).
  • Content-Type: ${mime.type}
  • Run Schedule: 5 mins (could be set to a little more, I’m not sure the frequency at which new articles are published)

Extracting results from Invoke HTTP response

The API call parameter page=0 returns results 0-9; for this exercise, I’m only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:

A few important points here:

  • The destination is set to flowfile-attribute because we are going to re-use these attributes later in the flow
  • I’m expecting the API to change after some time this article is published. Just to make sure that the JSON paths are good for your version of the API, I recommend JSON paths evaluators online.

Massage data to avoid conflicts when inserting to Hive

This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ‘ from the snippet as you can see, but it would deserve a second path I think:

Step 2: Scrape HTML article data

Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I’m using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).

Create a simple Java class to call boilerplate

After downloading the boilerpipe jars (using http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm), use your favorite Java IDE and create this simple class:

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import java.net.MalformedURLException;
import java.net.URL;
public class extractArticle {
public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException {
if(args.length == 1) {
URL url = new URL("" + args[0]);
String text = ArticleExtractor.INSTANCE.getText(url);
System.out.println(text);
} else {
System.out.println("Please Specify URL");
}
}
}

Once tested, create an executable jar (in my case extractArticle.jar).

Transfer jars to nifi server

Connect to your nifi server with your nifi user and create the following directory structure:

$ cd /home/nifi
$ mkdir extractArticle
$ cd extractArticle
$ mkdir lib

Transfer the following libraries to ~/extractArticle/lib/ :

  • xerces-2.9.1.jar
  • nekohtml-1.9.13.jar
  • boilerpipe-sources-1.2.0.jar
  • boilerpipe-javadoc-1.2.0.jar
  • boilerpipe-demo-1.2.0.jar
  • boilerpipe-1.2.0.jar
  • extractArticle.jar

Create a simple Unix script to execute HTML scraping

Under ~/extractArticle/ create the script extract_article.sh as follows:

#!/bin/bash
JDK_PATH=/usr
LIB1=./lib/xerces-2.9.1.jar
LIB2=./lib/nekohtml-1.9.13.jar
LIB3=./lib/boilerpipe-sources-1.2.0.jar
LIB4=./lib/boilerpipe-javadoc-1.2.0.jar
LIB5=./lib/boilerpipe-demo-1.2.0.jar
LIB6=./lib/boilerpipe-1.2.0.jar
LIB7=./lib/extractArticle.jar
LIBS=$LIB1:$LIB2:$LIB3:$LIB4:$LIB5:$LIB6:$LIB7
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*

Configure ExecuteStreamCommand processor

Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Step 3: Run machine learning models for personality detection

Setup PersonalityRecognizer on NiFi server

Just as for boilerpipe, we’re going to run an ExecuteStream command. To prepare the files, run the following commands:

$ cd /home/nifi
$ wget http://farm2.user.srcf.net/research/personality/recognizer-1.0.3.tar.gz recognizer-1.0.3.tar.gz
$ tar -xvf recognizer-1.0.3.tar.gz
$ cd PersonalityRecognizer
$ mkdir texts

Modify the file PersonalityRecognizer.properties as follows:

##################################################
# Configuration File of the Personality Recognizer
##################################################
# All variables should be modified according to your
# directory structure
# Warning: for Windows paths, backslashes need to be
# doubled, e.g. c:\\Program Files\\Recognizer
# Root directory of the application
appDir = /home/nifi/PersonalityRecognizer
# Path to the LIWC dictionary file (LIWC.CAT)
liwcCatFile = ./lib/LIWC.CAT
# Path to the MRC Psycholinguistic Database file (mrc2.dct)
mrcPath = ./ext/mrc2.dct

Modify the script PersonalityRecognizer as follows:

#! /bin/bash -
# ENVIRONMENT VARIABLES
JDK_PATH=/usr
WEKA=./ext/weka-3-4/weka.jar
# ----------------------------------
COMMONS_CLI=./lib/commons-cli-1.0.jar
MRC=./lib/jmrc.jar
LIBS=.:$WEKA:$COMMONS_CLI:$MRC:bin/
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*

Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:

#!/bin/bash
text=`ls -t texts/ | head -1`
./PersonalityRecognizer -i ./texts/$text > tmp.txt
extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}"
echo $json_output
rm tmp.txt texts/*

Configure PutFile processor to create article file

This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:

Configure the ExecuteStreamCommand Processor

Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Extract attributes from JSON output

Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:

Step 4: Store results to Hive

Create Hive DB and tables

Because we don’t control wether we receive the same article twice from the New York Times API, we need to make sure that we don’t insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.

Therefore connect to your hive server and create one database and two tables as follows:

CREATE DATABASE personality_detection;
use personality_detection;
CREATE TABLE text_evaluation (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
)
clustered by (web_url) into 2 buckets stored as orc
tblproperties("transactional"="true");

CREATE TABLE all_updates (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

Create HiveQL script

Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.

Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I’m storing it as a date):

use personality_detection;

insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}');

merge into text_evaluation
using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url
when matched then update set
snippet=all_updates.snippet,
byline=all_updates.byline,
pub_date=all_updates.pub_date,
headline=all_updates.headline,
document_type=all_updates.document_type,
news_desk=all_updates.news_desk,
last_updated=from_unixtime(unix_timestamp()),
extraversion=all_updates.extraversion,
emotional_stability=all_updates.emotional_stability,
agreeableness=all_updates.agreeableness,
conscientiousness=all_updates.conscientiousness,
openness_to_experience=all_updates.openness_to_experience
when not matched then insert
values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type,
all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience);

truncate table all_updates;

Processor Overview:

Upsert data to hive

Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):

Step 5: Create simple Zeppelin notebook

Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:

%jdbc(hive)
select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10

Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:

Conclusion

While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!

Known possible improvements

  • Better control of data retrieval to avoid duplicate flows (depends on API)
  • Better special character replacement for HiveQL command
  • More elegant way to execute data scraping and run personality recognition java classes
  • Additional scraping from article text to remove title, byline, and other unnecessary information from boilerpipe output
  • More thorough testing of different personality recognizer models (and use other/more recent libraries)

References

  • Big Five personality traits: https://en.wikipedia.org/wiki/Big_Five_personality_traits
  • The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives: http://moityca.com.br/pdfs/bigfive_john.pdf
  • IBM Watson Personality Insights: https://personality-insights-demo.ng.bluemix.net/
  • Personality Recognizer by Francois Mairesse: http://farm2.user.srcf.net/research/personality/recognizer
  • NYT API sign up: https://developer.nytimes.com/signup
  • NYT API FAQ: https://developer.nytimes.com/faq
  • NYT Article Search readme: https://developer.nytimes.com/article_search_v2.json#/README
  • JSON Path Evaluator: http://jsonpath.com/
  • Boilerpipe jar download: http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm
  • Boilerpipe github: https://github.com/kohlschutter/boilerpipe