paul vidal - pragmatic big data nerd

Tag Archives

4 Articles

A Hybrid approach to the Hybrid Cloud

by paul 0 Comments
A Hybrid approach to the Hybrid Cloud

Unless you have been hiding under a rock, or maybe spending too much time looking at the clouds passing by, the last couple of years have seen the advent of the adoption of the cloud as a major part of enterprise IT infrastructure.  As with everything in IT infrastructure, trends are followed for and without good reasons. Like I’ve argued before, outsourcing your non business critical software to SaaS may make sense, while maintaining your core business on site seem to be a good approach. In this piece however, I’d like to address the adoption of the cloud as PaaS, what are the pitfalls of that type of approach, and how adopting cloud as IaaS could alleviate some of these pitfalls. Perhaps more importantly, I’d like to offer a nuanced approach that will hopefully avoid an all-or-nothing approach. In short, here is how I view it:

Note: As always, I am touching here on enterprise data strategies as the backbone of a business, and therefore talking about data platforms as a whole. I’m not talking about expert systems/system of records and where/how they should be implemented.

Going all in

Exposing the flaws of going all into one cloud is fairly straight forward. Cloud infrastructure is super attractive. Being able to spin up at will nodes and services is super attractive. It’s like a kid at an arcade choosing what to play next. Until you run out of quarters and want to take a worthy price home. Here are some clear limitations of going all in into one cloud, and using all its services:

  • The services you use lock you in. If you develop something on AWS for instance, using lambda or any other tool, you will have a hard time the day you want to move these applications. To some extent, all the work you have done to liberate your data from your system of records and drive a true data driven business could be rendered void by going all in with one cloud.
  • Cloud vendors are very good a getting your data in and storing it for cheap, as well as running ephemeral elastic workloads. However, running long lasting compute or getting data out can be extremely costly.
  • Maintaining internal process of governance & security are very limited in the cloud.
  • Not all clouds are equal across the globe. If you truly are a global business you must have the ability to chose the cloud vendor that is available in your region.

The hybrid approach: a great option for today

The response to these limitations comes in form of a hybrid cloud. It is the idea of having workloads running on components that can be deployed on demand on premises or in the cloud in the same manner. Frankly, this solves 99% of the problems IT is trying to solve:

  • The services you use are infrastructure agnostic, and therefore allow you to maintain control of your data.
  • You can leverage cloud vendors for ephemeral workloads and on site for long lasting ones.
  • Governance and security are shared across cloud/on-prem.
  • You get to leverage any cloud.

As always, the devil is in the details. The only true way to implement a hybrid cloud is to have the same architecture on prem and in the cloud. This means separating storage and compute, as opposed to having storage and compute coupled like it is traditionally setup on premises. Theoretically, considering the advances of networking, and the advances of container management, morphing traditional architectures to have compute and storage separated should be fine.

The hybrid-er approach: a path towards true agnosticity

Like I mentioned before, I am a firm proponent of the hybrid approach. Nevertheless, I can’t help but imagining a world in 5 to 10 years, where everyone has implemented their hybrid data platform backend and the hot new tech is a new platform that provides a very specific and essential set of capabilities (think complex AI workloads) only possible by coupling compute and storage. Traditional RDBMS weren’t fit for many types of work (e.g. large scale, etc.), that does not mean they completely disappeared. I think we are going to see the same thing with containerization. It will be essential for many cases, but for others, different resources managers may be more appropriate. Regardless, these are truly exciting times, and I am very excited to be in the midst of this transformation.

Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

by paul 0 Comments
Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

Introduction & Context

There is a reason why I spent my life studying and working in computer science: understanding a computer’s psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.

Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:

  • Openness to Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism (or Emotional Stability)

This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn’t make it justice here (I most likely already irritated any personality psychologist that read these first few lines).

Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.

Solution Overview

The solution put in place uses 3 main elements:

  • A NiFi flow to orchestrate data ingestion from API, personality detection and storage to Hive
  • Hive to store the results of the personality detection
  • Zeppelin for visualization of the results

The figure below gives an overview of the solution flow:

More precisely, the solution can be dissected in 5 main steps, that I’m describing in details below:

  • Step 1: Retrieving data from New York Times API
  • Step 2: Scrape HTML article data
  • Step 3: Run machine learning models for personality detection
  • Step 4: Store results to Hive
  • Step 5: Create simple Zeppelin notebook

Step 1: Retrieving data from New York Times API

Obtaining an API Key

This step is very straight forward. Go to https://developer.nytimes.com/signup and sign-up for a key:

Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I’m not creative.

Configuring InvokeHTTP

The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:

  • HTTP Method: GET
  • Remote URL: http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=source:(“The New York Times”)&page=0&sort=newest&fl=web_url,snippet,headline,pub_date,document_type,news_desk,byline&api-key=[YOUR_KEY] (This URL selects article from the New York Times as a source, and only selects some of the fields I am interested in: web_url,snippet,headline,pub_date,document_type,news_desk,byline).
  • Content-Type: ${mime.type}
  • Run Schedule: 5 mins (could be set to a little more, I’m not sure the frequency at which new articles are published)

Extracting results from Invoke HTTP response

The API call parameter page=0 returns results 0-9; for this exercise, I’m only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:

A few important points here:

  • The destination is set to flowfile-attribute because we are going to re-use these attributes later in the flow
  • I’m expecting the API to change after some time this article is published. Just to make sure that the JSON paths are good for your version of the API, I recommend JSON paths evaluators online.

Massage data to avoid conflicts when inserting to Hive

This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ‘ from the snippet as you can see, but it would deserve a second path I think:

Step 2: Scrape HTML article data

Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I’m using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).

Create a simple Java class to call boilerplate

After downloading the boilerpipe jars (using http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm), use your favorite Java IDE and create this simple class:

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import java.net.MalformedURLException;
import java.net.URL;
public class extractArticle {
public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException {
if(args.length == 1) {
URL url = new URL("" + args[0]);
String text = ArticleExtractor.INSTANCE.getText(url);
System.out.println(text);
} else {
System.out.println("Please Specify URL");
}
}
}

Once tested, create an executable jar (in my case extractArticle.jar).

Transfer jars to nifi server

Connect to your nifi server with your nifi user and create the following directory structure:

$ cd /home/nifi
$ mkdir extractArticle
$ cd extractArticle
$ mkdir lib

Transfer the following libraries to ~/extractArticle/lib/ :

  • xerces-2.9.1.jar
  • nekohtml-1.9.13.jar
  • boilerpipe-sources-1.2.0.jar
  • boilerpipe-javadoc-1.2.0.jar
  • boilerpipe-demo-1.2.0.jar
  • boilerpipe-1.2.0.jar
  • extractArticle.jar

Create a simple Unix script to execute HTML scraping

Under ~/extractArticle/ create the script extract_article.sh as follows:

#!/bin/bash
JDK_PATH=/usr
LIB1=./lib/xerces-2.9.1.jar
LIB2=./lib/nekohtml-1.9.13.jar
LIB3=./lib/boilerpipe-sources-1.2.0.jar
LIB4=./lib/boilerpipe-javadoc-1.2.0.jar
LIB5=./lib/boilerpipe-demo-1.2.0.jar
LIB6=./lib/boilerpipe-1.2.0.jar
LIB7=./lib/extractArticle.jar
LIBS=$LIB1:$LIB2:$LIB3:$LIB4:$LIB5:$LIB6:$LIB7
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*

Configure ExecuteStreamCommand processor

Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Step 3: Run machine learning models for personality detection

Setup PersonalityRecognizer on NiFi server

Just as for boilerpipe, we’re going to run an ExecuteStream command. To prepare the files, run the following commands:

$ cd /home/nifi
$ wget http://farm2.user.srcf.net/research/personality/recognizer-1.0.3.tar.gz recognizer-1.0.3.tar.gz
$ tar -xvf recognizer-1.0.3.tar.gz
$ cd PersonalityRecognizer
$ mkdir texts

Modify the file PersonalityRecognizer.properties as follows:

##################################################
# Configuration File of the Personality Recognizer
##################################################
# All variables should be modified according to your
# directory structure
# Warning: for Windows paths, backslashes need to be
# doubled, e.g. c:\\Program Files\\Recognizer
# Root directory of the application
appDir = /home/nifi/PersonalityRecognizer
# Path to the LIWC dictionary file (LIWC.CAT)
liwcCatFile = ./lib/LIWC.CAT
# Path to the MRC Psycholinguistic Database file (mrc2.dct)
mrcPath = ./ext/mrc2.dct

Modify the script PersonalityRecognizer as follows:

#! /bin/bash -
# ENVIRONMENT VARIABLES
JDK_PATH=/usr
WEKA=./ext/weka-3-4/weka.jar
# ----------------------------------
COMMONS_CLI=./lib/commons-cli-1.0.jar
MRC=./lib/jmrc.jar
LIBS=.:$WEKA:$COMMONS_CLI:$MRC:bin/
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*

Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:

#!/bin/bash
text=`ls -t texts/ | head -1`
./PersonalityRecognizer -i ./texts/$text > tmp.txt
extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}"
echo $json_output
rm tmp.txt texts/*

Configure PutFile processor to create article file

This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:

Configure the ExecuteStreamCommand Processor

Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Extract attributes from JSON output

Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:

Step 4: Store results to Hive

Create Hive DB and tables

Because we don’t control wether we receive the same article twice from the New York Times API, we need to make sure that we don’t insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.

Therefore connect to your hive server and create one database and two tables as follows:

CREATE DATABASE personality_detection;
use personality_detection;
CREATE TABLE text_evaluation (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
)
clustered by (web_url) into 2 buckets stored as orc
tblproperties("transactional"="true");

CREATE TABLE all_updates (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

Create HiveQL script

Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.

Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I’m storing it as a date):

use personality_detection;

insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}');

merge into text_evaluation
using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url
when matched then update set
snippet=all_updates.snippet,
byline=all_updates.byline,
pub_date=all_updates.pub_date,
headline=all_updates.headline,
document_type=all_updates.document_type,
news_desk=all_updates.news_desk,
last_updated=from_unixtime(unix_timestamp()),
extraversion=all_updates.extraversion,
emotional_stability=all_updates.emotional_stability,
agreeableness=all_updates.agreeableness,
conscientiousness=all_updates.conscientiousness,
openness_to_experience=all_updates.openness_to_experience
when not matched then insert
values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type,
all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience);

truncate table all_updates;

Processor Overview:

Upsert data to hive

Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):

Step 5: Create simple Zeppelin notebook

Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:

%jdbc(hive)
select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10

Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:

Conclusion

While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!

Known possible improvements

  • Better control of data retrieval to avoid duplicate flows (depends on API)
  • Better special character replacement for HiveQL command
  • More elegant way to execute data scraping and run personality recognition java classes
  • Additional scraping from article text to remove title, byline, and other unnecessary information from boilerpipe output
  • More thorough testing of different personality recognizer models (and use other/more recent libraries)

References

  • Big Five personality traits: https://en.wikipedia.org/wiki/Big_Five_personality_traits
  • The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives: http://moityca.com.br/pdfs/bigfive_john.pdf
  • IBM Watson Personality Insights: https://personality-insights-demo.ng.bluemix.net/
  • Personality Recognizer by Francois Mairesse: http://farm2.user.srcf.net/research/personality/recognizer
  • NYT API sign up: https://developer.nytimes.com/signup
  • NYT API FAQ: https://developer.nytimes.com/faq
  • NYT Article Search readme: https://developer.nytimes.com/article_search_v2.json#/README
  • JSON Path Evaluator: http://jsonpath.com/
  • Boilerpipe jar download: http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm
  • Boilerpipe github: https://github.com/kohlschutter/boilerpipe

Making data analytics operational

by paul 0 Comments
Making data analytics operational
I refuse to use the term-that-should-not-be-used when describing stale data lakes.

After 6 months of silence, I finally take the time to get back behind my keyboard. I would like to say that I used these 6 months to reflect upon my writing, the current data market and came out of this hiatus a better, more informed and well versed person, but that would be a lie. And, despite the current pace at which the social fabric of our society is moving towards considering lies as acceptable and moral, I prefer not to. I don’t really know why I stopped writing for a bit, but most likely because I had nothing to say. So today, brace yourselves for a semi-informed opinion piece on data analytics, because I actually changed my opinion a bit on it through real-life experience.

My opinion then: analytics are a fringe use case of data management

In my article “Why data driven companies should stop investing in data analytics” I argued for the death of dashboards. I still stand by that point of view, as too often the Business Intelligence (BI) platforms are an end point of the data life cycle. Countless data replication processes, ETL, busses and other goldengate push data into data warehouses or data lakes where data scientists pat themselves on the back by showing dashboards that could potentially contain information to be integrated in the current business processes. Quick aside and nugget of knowledge from my PhD friends: if your title contains “science” in it, you’re not a real scientist. Shots fired. Moving on, while I still stand knee deep in stale data lakes despite being on my soapbox, there is one thing I did not consider enough: Machine Learning algorithms. There are two main reasons why the existence of machine learning algorithms as they are implemented now changes my opinion. First and foremost, the problem I describe of BI being the end of the data chain and its outcome only being driven by humans trying to improve business process can be alleviated with analytics automation via these algorithms (to some extent at the moment, but will be more and more true as the technology progresses). Secondly, ML needs access to data lakes, not operational big data. The algorithms need to be able to train using any data sets, looking at data from any angle in order to make usable predictions.

My opinion now: analytics need to be better integrated in the data life cycle

Consequently, here my proposal to the data world. We need to envision an architecture where data warehouses are not the raiders of the lost ark type but more the amazon type: they need to be an inherent part of the data life cycle. Drilling a bit further in the architecture I contemplate, your data as a service layer would feed current data sets to your data warehouse, where ML would run asynchronously, but the outcome of these analytics would then feed back the rules of data manipulation embedded in your DaaS layer. If you manage a constant feedback loop of the kind, your end user application served by your DaaS will constantly get fed more accurate and relevant data, which in turn can enable the next generation of platforms: Information as a Service. But that’s for another day.

The future of Data is Augmentation, not Disruption

by paul 1 Comment
The future of Data is Augmentation, not Disruption
I'm disrupting the light bulb market by enabling wireless. I call it "Photoshop"

I spent last week enjoying the Cassandra Summit, so much that I did not take the time to write a blog post. I had a few ideas but I chose quality over quantity. That being said, something interesting happened at the summit: we coined the term “augmentation” for one of my companies key go to market use case, instead of data layer modernization or digitalization. even got the opportunity to try both terms to the different people visiting our booth. In this extremely small sample, people really tended to have a much better degree of understanding when I used the word augmentation, which got me thinking. I even read a very interesting article from Tom O’Reilly called: Don’t Replace People. Augment Them. in which he argues against technology fully replacing people. Could this concept of augmentation be applied in a broader scale to understand our data technology trends? Maybe, at least that’s what I’m going to try to lay out in this article.

Technological progress relies on augmentation.

That’s the first thing that struck me when I pondered on augmentation in our world, and more specifically when it comes to software. At the exception of very few, the platforms, apps and tool that we use are all based on augmentation of existing basic functions: Amazon? Augmentation of store using technology. Uber? Augmentation of taxis. Chatbots? Augmentation of chat clients. Slack? Augmentation of email + chats. Distributed/Cloud applications? Augmentation of legacy applications. To some extent even Google is an augmentation of a manual filing system. I would admit listing examples that confirm an idea that I already had is close to a logical fallacy, so I tried to find counter examples, i.e. software solutions that try to introduce completely new concepts, but could not think of any. Of course we could argue over semantics in defining what constitute true innovation versus augmentation of an existing technology, but ultimately I think it is fair to say that the most successful technologies are augmenting our experience rather than being completely disruptive, despite what most of my field would argue. Therefore, augmentation must be at least considered as part of the future of any software industry, such as the Big Data industry.

Augmentation is better than transformation

Human nature needs comfort, that’s why most of us prefer augmentation over disruption. By disruption, I’m talking about transforming or replacing the existing systems, not adding features: selling unpaired socks over internet is not disrupting the sock industry, despite what the TED talks would like me to believe. Seriously, when you have existing technologies, as every company does, a replacement/transformation is a hard pill to swallow. Loss of investment, knowledge, process, etc. It is especially risky and complex when talking about data layer transformation, as I argued before in this very blog. So when given a choice, augmenting existing data layers is an obvious choice for risk-advert IT organizations.

Augmentation drives innovation

Perhaps the most convincing argument towards acknowledging that augmentation is the future of data is the analysis of the most innovative big data software solutions: machine learning, neural networks and all of these extremely complex systems which behaviors are almost impossible to predict, even for experts. These systems are design to augment their own capabilities, instead of having a set of deterministic rules to follow. Indeed, these systems are designed to approach the capabilities of complex biological systems and therefore incorporate their “messiness”. We can’t think of big data systems using physics thinking (i.e. here is an algorithm, here is a set of parameters, this is the result expected), but we should rather rely to biology thinking (i.e. what is the results I get if I input this parameter). A great example of this type of thinking is Netflix’s Chaos Monkey, a service running on AWS to simulate failures and understand the behavior of their architecture. Self-augmentation is the principle upon which the technologies of the future are built. We understand the algorithms we input but not necessarily the outcome, which can have unintended consequences sometimes (see: Microsoft Tay), but ultimately is a better pathway to intelligent technologies. I’m a control freak, and not being able to understand a system end to end drives me nuts, but I’m willing to relinquish my sanity for the good of Artificial Intelligence.

Conclusion

With software Augmentation being part of our everyday life, a safer and easier way to add features to existing data layer, and the core concept of machine learning, I think it is fair to say that it is the future of Data. Did I convince myself? Yes, which is good because my opinion is usually my first go to when it comes to figuring out what I think. Seriously though, what do you think? As always, I long to learn more and listen to everyone’s opinion!