paul vidal - pragmatic big data nerd

Tag Archives

19 Articles

Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

by paul 0 Comments
Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

My blood recently turned from green to blue (after the Hortonworks-Cloudera merger) and I couldn’t be more excited to play with new toys. What I am particularly excited about is Cloudera Data Science Workbench. But, like in everything I do, I am very lazy. So here is a quick tutorial to install Altus Director, and use it to deploy a CDH 5.15 + CDSW cluster.

Step 1: Install Altus Director

Many ways to do that, but the one I chose was the AWS install, detailed here: https://www.cloudera.com/documentation/director/latest/topics/director_aws_setup_client.html

The installation documentation is very well done, but here are the important excerpts

Create a VPC for your Altus instance

Follow the documentation.

Few important points:

  • In the name of laziness I also recommend to add a 0-65535 rule from your personal IP.
  • Your VPC should have an internet gateway associated with it (you could do it without, but would require you manually pulling the CM/CDH software down and make internal repositories within your subnet)
  • Do not forget to open all traffic to your security group as described here. Your deployment will not work otherwise.

Launch a Redhat 7.3 instance

You can either search communities AMIs, or use this one: ami-6871a115

Install Altus

Connect to your ec2 instance:

ssh -i your_file.pem ec2-user@your_instance_ip

Install JDK and wget

sudo yum install java-1.8.0-openjdk
sudo yum install wget

Install/Start Altus server and client:

cd /etc/yum.repos.d/
sudo wget "http://archive.cloudera.com/director6/6.1/redhat7/cloudera-director.repo"
sudo yum install cloudera-director-server cloudera-director-client
sudo service cloudera-director-server start
sudo systemctl disable firewalld
sudo systemctl stop firewalld

Connect to Altus Director

Go to http://your_instance_ip:7189/ and connect with admin/admin

Step 2: Modify the Director configuration file

CDSW cluster configuration can be found here https://github.com/cloudera/director-scripts/blob/master/configs/aws.cdsw.conf

Modify the configuration file to use:

  • Your AWS accessKeyId/secretAccessKey
  • Your AWS region
  • Your AWS subnetId (same as the one you created for your Director instance)
  • Your AWS securityGroupsIds (same as the one you created for your Director instance)
  • Your private key path (e.g. /home/ec2-user/field.pem)
  • Your AWS image (e.g. ami-6871a115)

Step 3: Launch the cluster via director client

Go to your EC2 instance where Director is installed, and load your modified configuration file as well as the appropriate key.

Finally, run the following:

cloudera-director bootstrap-remote your_configuration_file.conf \
--lp.remote.username=admin \
--lp.remote.password=admin

Step 4: Access Cloudera Manager

You can follow the bootstrapping of the cluster both on command line or in the Director interface; once done, you can connect to Cloudera Manager using: http://your_manager_instance_ip:7180/

Step 5: Configure CDSW domain with your IP

Cloudera Data Science Workbench uses DNS. The correct approach is to setup a wildcard DNS record is required, as described here.

However, for testing purposes I used nip.io. The only parameter to change is the Cloudera Data Science Workbench Domain, from cdsw.my-domain.com as the conf file sets it up to, to cdsw.[YOUR_AWS_PUBLIC_IP].nip.io, as depicted below:

Restart the CDSW service, then you should be able to access CDSW by clicking on the CDSW Web UI link. Register for a new account and you will have access to CDSW:

A Hybrid approach to the Hybrid Cloud

by paul 0 Comments
A Hybrid approach to the Hybrid Cloud

Unless you have been hiding under a rock, or maybe spending too much time looking at the clouds passing by, the last couple of years have seen the advent of the adoption of the cloud as a major part of enterprise IT infrastructure. ¬†As with everything in IT infrastructure, trends are followed for and without good reasons. Like I’ve argued before, outsourcing your non business critical software to SaaS may make sense, while maintaining your core business on site seem to be a good approach. In this piece however, I’d like to address the adoption of the cloud as PaaS, what are the pitfalls of that type of approach, and how adopting cloud as IaaS could alleviate some of these pitfalls. Perhaps more importantly, I’d like to offer a nuanced approach that will hopefully avoid an all-or-nothing approach. In short, here is how I view it:

Note: As always, I am touching here on enterprise data strategies as the backbone of a business, and therefore talking about data platforms as a whole. I’m not talking about expert systems/system of records and where/how they should be implemented.

Going all in

Exposing the flaws of going all into one cloud is fairly straight forward. Cloud infrastructure is super attractive. Being able to spin up at will nodes and services is super attractive. It’s like a kid at an arcade choosing what to play next. Until you run out of quarters and want to take a worthy price home. Here are some clear limitations of going all in into one cloud, and using all its services:

  • The services you use lock you in. If you develop something on AWS for instance, using lambda or any other tool, you will have a hard time the day you want to move these applications. To some extent, all the work you have done to liberate your data from your system of records and drive a true data driven business could be rendered void by going all in with one cloud.
  • Cloud vendors are very good a getting your data in and storing it for cheap, as well as running ephemeral elastic workloads. However, running long lasting compute or getting data out can be extremely costly.
  • Maintaining internal process of governance & security are very limited in the cloud.
  • Not all clouds are equal across the globe. If you truly are a global business you must have the ability to chose the cloud vendor that is available in your region.

The hybrid approach: a great option for today

The response to these limitations comes in form of a hybrid cloud. It is the idea of having workloads running on components that can be deployed on demand on premises or in the cloud in the same manner. Frankly, this solves 99% of the problems IT is trying to solve:

  • The services you use are infrastructure agnostic, and therefore allow you to maintain control of your data.
  • You can leverage cloud vendors for ephemeral workloads and on site for long lasting ones.
  • Governance and security are shared across cloud/on-prem.
  • You get to leverage any cloud.

As always, the devil is in the details. The only true way to implement a hybrid cloud is to have the same architecture on prem and in the cloud. This means separating storage and compute, as opposed to having storage and compute coupled like it is traditionally setup on premises. Theoretically, considering the advances of networking, and the advances of container management, morphing traditional architectures to have compute and storage separated should be fine.

The hybrid-er approach: a path towards true agnosticity

Like I mentioned before, I am a firm proponent of the hybrid approach. Nevertheless, I can’t help but imagining a world in 5 to 10 years, where everyone has implemented their hybrid data platform backend and the hot new tech is a new platform that provides a very specific and essential set of capabilities (think complex AI workloads) only possible by coupling compute and storage. Traditional RDBMS weren’t fit for many types of work (e.g. large scale, etc.), that does not mean they completely disappeared. I think we are going to see the same thing with containerization. It will be essential for many cases, but for others, different resources managers may be more appropriate. Regardless, these are truly exciting times, and I am very excited to be in the midst of this transformation.

Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

by paul 0 Comments
Determining the big 5 traits of Personality Psychology of news articles using NiFi, Hive & Zeppelin

Introduction & Context

There is a reason why I spent my life studying and working in computer science: understanding a computer’s psychology is usually fairly straight forward. Indeed, when presented with a specific input, computer programs tend to respond in a very predictable way, as opposed to our fellow human beings. Of course, this observation goes out of the window as our algorithms become increasingly more complex and capable of learning.

Regardless, as much as I love computer science, I always had a keen interest in human sciences. Personality psychology is a fascinating subject that has seen its ups and downs as any science topic. At the center of personality psychology reside the big five personality traits:

  • Openness to Experience
  • Conscientiousness
  • Extraversion
  • Agreeableness
  • Neuroticism (or Emotional Stability)

This taxonomy was determined by applying statistical models to personality surveys, essentially clustering results of surveys of people describing fellow human beings. As such, these traits are meant to categorize common aspect of personality across human beings without moral connotation. The validity of the model and its predictability for real life outcomes is of course controversial, and I wouldn’t make it justice here (I most likely already irritated any personality psychologist that read these first few lines).

Recently, multiple machine learning algorithms have been designed to determine these 5 personality traits from texts have surfaced, including IBM Watson personality insights. For this article I chose to use the personality recognizer written by Francois Mairesse, and automate personality detection of New York Times articles using HDF 3.1 and HDP 3.0.

Solution Overview

The solution put in place uses 3 main elements:

  • A NiFi flow to orchestrate data ingestion from API, personality detection and storage to Hive
  • Hive to store the results of the personality detection
  • Zeppelin for visualization of the results

The figure below gives an overview of the solution flow:

More precisely, the solution can be dissected in 5 main steps, that I’m describing in details below:

  • Step 1: Retrieving data from New York Times API
  • Step 2: Scrape HTML article data
  • Step 3: Run machine learning models for personality detection
  • Step 4: Store results to Hive
  • Step 5: Create simple Zeppelin notebook

Step 1: Retrieving data from New York Times API

Obtaining an API Key

This step is very straight forward. Go to https://developer.nytimes.com/signup and sign-up for a key:

Note: The New-York Times API is for non-commercial use only. I could have of course used any news API, but I’m not creative.

Configuring InvokeHTTP

The InvokeHTTP is used here with all default parameters, except for the URL. Here are some key configuration items and a screenshot of the Processor configuration:

  • HTTP Method: GET
  • Remote URL: http://api.nytimes.com/svc/search/v2/articlesearch.json?fq=source:(“The New York Times”)&page=0&sort=newest&fl=web_url,snippet,headline,pub_date,document_type,news_desk,byline&api-key=[YOUR_KEY] (This URL selects article from the New York Times as a source, and only selects some of the fields I am interested in: web_url,snippet,headline,pub_date,document_type,news_desk,byline).
  • Content-Type: ${mime.type}
  • Run Schedule: 5 mins (could be set to a little more, I’m not sure the frequency at which new articles are published)

Extracting results from Invoke HTTP response

The API call parameter page=0 returns results 0-9; for this exercise, I’m only interested in the latest article, so I setup an evaluateJSONpath processor take care of that, as you can see below:

A few important points here:

  • The destination is set to flowfile-attribute because we are going to re-use these attributes later in the flow
  • I’m expecting the API to change after some time this article is published. Just to make sure that the JSON paths are good for your version of the API, I recommend JSON paths evaluators online.

Massage data to avoid conflicts when inserting to Hive

This step is definitely not optimized. The point here is to escape the special characters to avoid errors when inserting into hive. The only thing I am doing here is removing the ‘ from the snippet as you can see, but it would deserve a second path I think:

Step 2: Scrape HTML article data

Once we retrieved the meta data of the article, we must obtain the actual text of the article. For this, I’m using boilerpipe, an open source boilerplate removal and fulltext extraction from HTML pages (see reference for details).

Create a simple Java class to call boilerplate

After downloading the boilerpipe jars (using http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm), use your favorite Java IDE and create this simple class:

import de.l3s.boilerpipe.BoilerpipeProcessingException;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import java.net.MalformedURLException;
import java.net.URL;
public class extractArticle {
public static void main (String args[]) throws MalformedURLException, BoilerpipeProcessingException {
if(args.length == 1) {
URL url = new URL("" + args[0]);
String text = ArticleExtractor.INSTANCE.getText(url);
System.out.println(text);
} else {
System.out.println("Please Specify URL");
}
}
}

Once tested, create an executable jar (in my case extractArticle.jar).

Transfer jars to nifi server

Connect to your nifi server with your nifi user and create the following directory structure:

$ cd /home/nifi
$ mkdir extractArticle
$ cd extractArticle
$ mkdir lib

Transfer the following libraries to ~/extractArticle/lib/ :

  • xerces-2.9.1.jar
  • nekohtml-1.9.13.jar
  • boilerpipe-sources-1.2.0.jar
  • boilerpipe-javadoc-1.2.0.jar
  • boilerpipe-demo-1.2.0.jar
  • boilerpipe-1.2.0.jar
  • extractArticle.jar

Create a simple Unix script to execute HTML scraping

Under ~/extractArticle/ create the script extract_article.sh as follows:

#!/bin/bash
JDK_PATH=/usr
LIB1=./lib/xerces-2.9.1.jar
LIB2=./lib/nekohtml-1.9.13.jar
LIB3=./lib/boilerpipe-sources-1.2.0.jar
LIB4=./lib/boilerpipe-javadoc-1.2.0.jar
LIB5=./lib/boilerpipe-demo-1.2.0.jar
LIB6=./lib/boilerpipe-1.2.0.jar
LIB7=./lib/extractArticle.jar
LIBS=$LIB1:$LIB2:$LIB3:$LIB4:$LIB5:$LIB6:$LIB7
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS extractArticle $*

Configure ExecuteStreamCommand processor

Configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Step 3: Run machine learning models for personality detection

Setup PersonalityRecognizer on NiFi server

Just as for boilerpipe, we’re going to run an ExecuteStream command. To prepare the files, run the following commands:

$ cd /home/nifi
$ wget http://farm2.user.srcf.net/research/personality/recognizer-1.0.3.tar.gz recognizer-1.0.3.tar.gz
$ tar -xvf recognizer-1.0.3.tar.gz
$ cd PersonalityRecognizer
$ mkdir texts

Modify the file PersonalityRecognizer.properties as follows:

##################################################
# Configuration File of the Personality Recognizer
##################################################
# All variables should be modified according to your
# directory structure
# Warning: for Windows paths, backslashes need to be
# doubled, e.g. c:\\Program Files\\Recognizer
# Root directory of the application
appDir = /home/nifi/PersonalityRecognizer
# Path to the LIWC dictionary file (LIWC.CAT)
liwcCatFile = ./lib/LIWC.CAT
# Path to the MRC Psycholinguistic Database file (mrc2.dct)
mrcPath = ./ext/mrc2.dct

Modify the script PersonalityRecognizer as follows:

#! /bin/bash -
# ENVIRONMENT VARIABLES
JDK_PATH=/usr
WEKA=./ext/weka-3-4/weka.jar
# ----------------------------------
COMMONS_CLI=./lib/commons-cli-1.0.jar
MRC=./lib/jmrc.jar
LIBS=.:$WEKA:$COMMONS_CLI:$MRC:bin/
$JDK_PATH/bin/java -Xmx512m -classpath $LIBS recognizer.PersonalityRecognizer $*

Finally, create a wrapper script that, using the latest file from the folder texts runs PersonalityRecognizer and outputs only the results in a json format:

#!/bin/bash
text=`ls -t texts/ | head -1`
./PersonalityRecognizer -i ./texts/$text > tmp.txt
extraversion=`cat tmp.txt | grep extraversion | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
emotional_stability=`cat tmp.txt | grep emotional | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
agreeableness=`cat tmp.txt | grep agreeableness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
conscientiousness=`cat tmp.txt | grep conscientiousness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
openness_to_experience=`cat tmp.txt | grep openness | grep -Eo '[+-]?[0-9]+([.][0-9]+)?'`
json_output="{\"web_url\" : \"$1\", \"extraversion\" : \"$extraversion\",\"emotional_stability\" : \"$emotional_stability\",\"agreeableness\" : \"$agreeableness\",\"conscientiousness\" : \"$conscientiousness\",\"openness_to_experience\" : \"$openness_to_experience\"}"
echo $json_output
rm tmp.txt texts/*

Configure PutFile processor to create article file

This processor takes the output stream of the HTML scraping to create a file, under the appropriate folder, as shown below:

Configure the ExecuteStreamCommand Processor

Just as for HTML scraping, configure the processor to pass the URL in argument and outputting the output stream to the next processor, as follows:

Extract attributes from JSON output

Using EvaluateJSONPath, retrieve the results of the PersonalityRecognizer to attributes:

Step 4: Store results to Hive

Create Hive DB and tables

Because we don’t control wether we receive the same article twice from the New York Times API, we need to make sure that we don’t insert the same data twice into Hive (i.e. upsert data into Hive). Upsert can be implemented by two tables and the merge command.

Therefore connect to your hive server and create one database and two tables as follows:

CREATE DATABASE personality_detection;
use personality_detection;
CREATE TABLE text_evaluation (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
)
clustered by (web_url) into 2 buckets stored as orc
tblproperties("transactional"="true");

CREATE TABLE all_updates (
web_url String,
snippet String,
byline String,
pub_date date,
headline String,
document_type String,
news_desk String,
last_updated String,
extraversion decimal(10,4),
emotional_stability decimal(10,4),
agreeableness decimal(10,4),
conscientiousness decimal(10,4),
openness_to_experience decimal(10,4)
) STORED AS ORC tblproperties ("orc.compress" = "SNAPPY");

Create HiveQL script

Using a ReplaceText processor, create the appropriate HiveQL command to be executed to upsert data into your tables from the data collected in the flow.

Code for Replacement Value (note that I remove the timestamp from the pub_date here, because I’m storing it as a date):

use personality_detection;

insert into all_updates values('${web_url}','${snippet}','${byline}','${pub_date:substring(0,10)}','${headline}','${document_type}','${news_desk}','${now()}','${extraversion}','${emotional_stability}','${agreeableness}','${conscientiousness}','${openness_to_experience}');

merge into text_evaluation
using (select distinct web_url, snippet, byline, pub_date, headline, document_type, news_desk, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from all_updates) all_updates on text_evaluation.web_url = all_updates.web_url
when matched then update set
snippet=all_updates.snippet,
byline=all_updates.byline,
pub_date=all_updates.pub_date,
headline=all_updates.headline,
document_type=all_updates.document_type,
news_desk=all_updates.news_desk,
last_updated=from_unixtime(unix_timestamp()),
extraversion=all_updates.extraversion,
emotional_stability=all_updates.emotional_stability,
agreeableness=all_updates.agreeableness,
conscientiousness=all_updates.conscientiousness,
openness_to_experience=all_updates.openness_to_experience
when not matched then insert
values(all_updates.web_url,all_updates.snippet, all_updates.byline, all_updates.pub_date, all_updates.headline, all_updates.document_type,
all_updates.news_desk, from_unixtime(unix_timestamp()), all_updates.extraversion, all_updates.emotional_stability, all_updates.agreeableness, all_updates.conscientiousness, all_updates.openness_to_experience);

truncate table all_updates;

Processor Overview:

Upsert data to hive

Finally, configure a simple PutHiveQL processor as follows (make sure you configured your HiveConnectionPool beforehand):

Step 5: Create simple Zeppelin notebook

Lastly, after running the NiFi flow for a while, create a simple Zeppelin notebook to show your result. This notebook will use the jdbc interpreter for Hive and run the following query:

%jdbc(hive)
select byline, extraversion, emotional_stability, agreeableness, conscientiousness, openness_to_experience from personality_detection.text_evaluation limit 10

Then, you can play with Zeppelin visualizations to display the average of the big 5 by byline:

Conclusion

While being a very simple, this exercise is a good starting point for on-the-wire personality recognition. More importantly, in an age of information overload or even misinformation, having the ability to classifying the psychology of a text on the fly can be extremely useful. I do plan on tinkering with this project, improving performance, optimizing models and ingesting more data, so stay tuned!

Known possible improvements

  • Better control of data retrieval to avoid duplicate flows (depends on API)
  • Better special character replacement for HiveQL command
  • More elegant way to execute data scraping and run personality recognition java classes
  • Additional scraping from article text to remove title, byline, and other unnecessary information from boilerpipe output
  • More thorough testing of different personality recognizer models (and use other/more recent libraries)

References

  • Big Five personality traits: https://en.wikipedia.org/wiki/Big_Five_personality_traits
  • The Big-Five Trait Taxonomy: History, Measurement, and Theoretical Perspectives: http://moityca.com.br/pdfs/bigfive_john.pdf
  • IBM Watson Personality Insights: https://personality-insights-demo.ng.bluemix.net/
  • Personality Recognizer by Francois Mairesse: http://farm2.user.srcf.net/research/personality/recognizer
  • NYT API sign up: https://developer.nytimes.com/signup
  • NYT API FAQ: https://developer.nytimes.com/faq
  • NYT Article Search readme: https://developer.nytimes.com/article_search_v2.json#/README
  • JSON Path Evaluator: http://jsonpath.com/
  • Boilerpipe jar download: http://www.java2s.com/Code/Jar/b/Downloadboilerpipe120jar.htm
  • Boilerpipe github: https://github.com/kohlschutter/boilerpipe

Making data analytics operational

by paul 0 Comments
Making data analytics operational
I refuse to use the term-that-should-not-be-used when describing stale data lakes.

After 6 months of silence, I finally take the time to get back behind my keyboard. I would like to say that I used these 6 months to reflect upon my writing, the current data market and came out of this hiatus a better, more informed and well versed person, but that would be a lie. And, despite the current pace at which the social fabric of our society is moving towards considering lies as acceptable and moral, I prefer not to. I don’t really know why I stopped writing for a bit, but most likely because I had nothing to say. So today, brace yourselves for a semi-informed opinion piece on data analytics, because I actually changed my opinion a bit on it through real-life experience.

My opinion then: analytics are a fringe use case of data management

In my article “Why data driven companies should stop investing in data analytics” I argued for the death of dashboards. I still stand by that point of view, as too often the Business Intelligence (BI) platforms are an end point of the data life cycle. Countless data replication processes, ETL, busses and other goldengate push data into data warehouses or data lakes where data scientists pat themselves on the back by showing dashboards that could potentially contain information to be integrated in the current business processes. Quick aside and nugget of knowledge from my PhD friends: if your title contains “science” in it, you’re not a real scientist. Shots fired. Moving on, while I still stand knee deep in stale data lakes despite being on my soapbox, there is one thing I did not consider enough: Machine Learning algorithms. There are two main reasons why the existence of machine learning algorithms as they are implemented now changes my opinion. First and foremost, the problem I describe of BI being the end of the data chain and its outcome only being driven by humans trying to improve business process can be alleviated with analytics automation via these algorithms (to some extent at the moment, but will be more and more true as the technology progresses). Secondly, ML needs access to data lakes, not operational big data. The algorithms need to be able to train using any data sets, looking at data from any angle in order to make usable predictions.

My opinion now: analytics need to be better integrated in the data life cycle

Consequently, here my proposal to the data world. We need to envision an architecture where data warehouses are not the raiders of the lost ark type but more the amazon type: they need to be an inherent part of the data life cycle. Drilling a bit further in the architecture I contemplate, your data as a service layer would feed current data sets to your data warehouse, where ML would run asynchronously, but the outcome of these analytics would then feed back the rules of data manipulation embedded in your DaaS layer. If you manage a constant feedback loop of the kind, your end user application served by your DaaS will constantly get fed more accurate and relevant data, which in turn can enable the next generation of platforms: Information as a Service. But that’s for another day.

Why data driven companies should stop investing in data analytics

by paul 0 Comments
Why data driven companies should stop investing in data analytics

Despite the blatantly click-baitish and oxymoronic nature of the title of this post might lead you to believe, I actually have a point that I think is worth expressing (which I suppose is the reason I write any post, really). As it is often the case, this point is coming from an accumulation of real-life situations I encounter at my job. A lot of my work consist in rapidly enabling access to data to which companies have a really hard time accessing and distributing it in a very efficient big data architecture. The funny thing is, when I go to a customer and tell them that they will get access to all this data within a few days timeframe, they are either unprepared as to what to do with this data. Indeed, they (and we as an industry) have been focusing so much on the integration and exposure of data that we forgot to think about the value it can bring us. What’s the use case everyone think of when their data access is enabled? Analytics. “We’ll use [Name your favorite BI tool] to build dashboards and gain insight. This, to me, is incomplete and extremely short-sighted.

How analytics became synonym with Big Data

Before I dive into the reasons I suddenly want to make a good chunk of the big data industry hate me, let me try to express how we got to equate the term big data, or data in general with Business Intelligence or analytics. Note that I am trying to describe trends here; I am aware that there are outliers to these trends. With this said, and now that I have a license to express completely unverified facts since I apologized in advance, here is what I observed. When databases were first created, they were an enablement layer for software. We just needed a way to store data more efficiently in order for you application to run faster. Eventually, data changed from being a necessary evil to being a source of value. Indeed, once we realized that every activity that we engage in involves a piece of software, the data piece entrenched in each and every of these pieces of software became the best way to understand our own (and our users’) behavior. This is the advent of data warehouses, and BI tools. Then we realized there was a lot of data (I think the technical term is a shit ton), so we started developing big data lakes. In this transition, big data primary use became what data warehouses were used for: data analytics.

The death of dashboards

However, data analytics is only a very small percentage of the data use cases. Remember, data layers were design to enable applications, not showing you what they contain. Yes, dashboards and graphics are pretty but what is their goal? Their goal is to give you an idea on what whoever interacts with your software is doing in order to design solutions to palliate to the problems you find. Somehow however, analytics became a finality. Companies spend an insane amount of investment to achieve data analytics. This is extremely misguided. To be fair, part of the reason why analytics are a finality is a product of the limitation of data lakes, but that’s another topic.

Using data pro-actively

Solving the data integration, consolidation, distribution and exposure problem is not easy but it is being solved (I can tell that with confidence since I am on the front of that battle every day, though I would not say I put my life in danger every day, so that battle analogy stops here). My advice is to think beyond analyzing the data as a use case once you are able to have access to it. Instead of trying to identify trends, think about how to change them. Instead of trying to build an individualized snapshot of your customer, think about what action you should take based on that snapshot. Instead of getting a consolidated view of all of your systems, think about how to better orchestrate data flows between these systems to minimize the need for consolidation. I am purposely not listing specific examples, because they require a deep industry expertise which is not what I am trying to highlight here (my expertise being in data, not a specific industry). So, next time you are confronted with someone building a dashboard, ask yourself: why? why am building BI on top of my data? is BI going to give me insight on a problem to solve? if so, what is that problem? Once you have the answer to that question, try to build a platform that identifies and solves that problem rather than a platform that only allows you to identify it.

The Big Data market in 2017

by paul 0 Comments
The Big Data market in 2017
Buildings always look badass from the ground and with a black and white filter

Accepting reality is not trivial. As we sit in our echo-chambers, particularly exacerbated by our social networks, preference algorithms and suggested searches, our cognitive biases betray our picture of the world around us. Add that to the fact that everyone else than us is an idiot with whom we should not engage in a conversation (or if you prefer, call this mild social anxiety), and pretty soon you can convince yourself of anything. With this in mind, I decided to take the time today to expose my understanding of the reality of the big data market in 2017, for large entreprises. While it is inherently biased, arguably like any piece of writing, I did try to do my research reading a lot of white papers recently (of which you can find some reference below), but mostly, this is my domain of expertise which means I’m confronted to it every day. Of course, this categorization is subject to discussion and constructive criticism that I always welcome.

Out, or very little relevancy

  • Data Lakes: The fascination for unlimited data distribution has passed. Enterprises struggle to find a use to their data lakes and the layers written on top of them to make them useful seem too much effort for little reward.
  • Pure data analytics: The terms Data Analytics or Business Intelligence encapsulate a vast number of concepts that will always be useful one way or another in our data driven world. What is nowadays losing momentum is solutions making analytics the end goal (analyzing trends, population sub group preferences, etc.). BI is a very small portion of what Big Data offers and if the end goal of a solution is to give you trend analytics, it is too reductive.
  • SOA, ESBs, Convergent applications: This is been dead for a while but worth mentioning. The idea of a single convergent enterprise solution to encapsulate all data and functionalities is practically not feasible (too much market change, too much cost, too much complexity, too little agility).

Extremely relevant for 2017

  • Agile data platforms: At the opposite end of ESBs and massive consolidation into one data system is the micro-services architecture. The architecture enables extremely rapid and agile deployment of applications to respond to an ever changing market where end customers have more choices than ever and thus are very hard to retain. The bottleneck of micro-services architectures is often data. Being able to consolidate, cleanse and expose rapidly data to anywhere is a complicated proposition but some platforms can do it. If a platform is able to integrate from multiple sources, consolidate and expose data rapidly, then it enables today’s hottest use cases: digital transformation, agile test data management, micro-services implementation, and more.
  • Cloud¬†enablement: More than ever, and despite previous reticense vis-a-vis security (just like older generations are still reluctant to use their credit card numbers on the web), the movement to cloud applications and platforms. Enabling the cloud, that is not only exposing/migrating data to cloud applications but also ensuring security, compliance and control over the data exposed is therefore a very important market trend.
  • Data Personalization: we live in a world where everyone expects their experience to be catered to them. Having to repeat your identity while being tossed from department to department on a help desk line is one of the most infuriating experience (after the complete loss of human rights and dignity one experiences in an airport). Seriously though, enabling the understanding of the individual is crucial, whether that individual is a person, a product or a machine in IoT use cases.

Not quite there yet

  • Predictive Individual Analytics: We already see some implementations of this in ad personalization or preference settings, but being able to predict what an entity (a person, a machine, a car, a product) will do, what it wants and needs is going to open the door to systems that give answer instead of respond to questions. It requires the problem of data personalization to be solved beforehand though.
  • Smart Data Discovery: Once agile data platforms are in place, the use-case of automated data mining will explode. Too many systems with too little experts on the systems will give birth to solution that’ll enable the enterprise to recover a fair percentage of the relevant data without human intervention.
  • Expert AI systems: Finally, and most exciting of all are expert AI systems. These are software that will replace the way that data is currently fed to our everyday software (CRM, machine monitoring, marketing analytics, etc.). The use cases are still not clear in my head but I know that finding the point where human intervention is the most costly (where it requires pattern recognition), and replacing it with automated AI will be a game changer.

Some references

  • Is the Cloud Secure? (Gartner): link
  • Marketing data management (Ascend2): link
  • Seizing the Digital Advantage in Banking and Financial Services (Cognizant): link
  • The Big data workbook (Informatica): link
  • Agile Test Data Management: The New Must-Have (Forrester): link

Data As A Microservice: the future of data architecture

by paul 0 Comments
Data As A Microservice: the future of data architecture

Let me preface this article with an understatement: sometimes, enterprise architecture can be complicated. Large companies run thousands of applications, multiplied by dozens of environments replicated for testing, user testing, sandboxing, accumulated over years of acquisitions, re-architecturing (yes, it is a word I made up), and experiments all with the purpose of driving business forward. Like any complex system, human beings have been trying to make sense out of it by conceptualizing models and architectures aimed at simplifying the system thus making it more efficient, robust, scalable, secure, and spiritually vertuous (OK, maybe not the last part, although can a piece of software be inherently virtuous? A question for another day). With all this in mind, I would like to take some time to reflect on one of these concepts: Micro-Services and how this concept can apply in the realm of data management.

Microservices VS Enterprise Service Bus

First introduced in 2011 during workshop of software architects held near Venice in May 2011, Microservice Architecture is defined by James Lewis as follows:

The term “Microservice Architecture” has sprung up over the last few years to describe a particular way of designing software applications as suites of independently deployable services. While there is no precise definition of this architectural style, there are certain common characteristics around organization around business capability, automated deployment, intelligence in the endpoints, and decentralized control of languages and data.

Microservice architecture is a subset of a Service Oriented Architecture (SOA), aiming at distributing microcomponents to deploy applications as opposed to a centralized application integraiton layer, often called Enterprise Integration Layer (EAI) or Enterprise Service Bus (ESB). Leaving aside the obvious angry developer argument stating that all of this is marketing jargon and rebranding of the same products, it’s interesting to take note of a fundamental trend I covered before in this blog: enterprise are looking to implement agile enviroment, which extremely granular elements in order to ensure business reactivity. The dream of the all-integrated all-consolidated entreprise layer is fading.

Data As A Microservice

In a very similar manner, the idea of single source of truth containing all the enterprise data is coming to an end. And, unlike some data lakes proponents would like to make you believe, it is not because of the pitfalls of traditional technologies that can’t handle large volume of data or distribute it efficiently. Building a single centralized source of data is an utopia. Instead, companies are now shifting their focus towards platforms enabling rapid agnostic data integration, agile data schema modification, and complete distribution. These platforms can then be used in a microservice architecture, making them Data As A Microservice platforms. I’ll admit, I may have made that term up too because it sounds cool, but it is very important to note for you data vendor, data scientist or data consumer (CIOs and CTOs organization). The future of data microservice-like agility, not monolithic unification.

References:

http://martinfowler.com/articles/microservices.html
http://stackoverflow.com/questions/25501098/difference-between-microservices-architecture-and-soa
https://www.voxxed.com/blog/2015/01/good-microservices-architectures-death-enterprise-service-bus-part-one/
https://en.wikipedia.org/wiki/Microservices

What is the most underrated aspect of software development and why is it measurability?

by paul 0 Comments
What is the most underrated aspect of software development and why is it measurability?

Designing and developing software is complicated. I have heard there might even be a full industry gathering experts in this domain, and that it could be doing well. Not sure if it will ever be a thing. All joking aside, theories about the optimum way to approach software development are numerous and constantly evolving, which is excellent. Today however, I want to talk about an underrated concept, especially within the realm of software development: measurability. Despite online dictionaries results, I’m pretty sure I just made up that word, or at least the concept attached to it vis-a-vis software development, so let me define it.

What do you mean by measurability and why should I care about it?

Within the realm of software development, measurability can be catalogued in the same category as other transversal high-level concepts, that must be considered at each and every step of the development process, such as user experience, performance, scalability, re-usability and security. Measurability in this sense is the idea that each and every feature of you develop for your software can be measured for popularity and efficacy in order to ultimately evaluate its necessity. That is a lot of y-ending words, which should have convinced you already. Hoping it didn’t, let me explain you why it is important to consider. First, I believe that the importance of these types of high-level concepts does not need further justification: we have all witnessed software failures when their development ignored one of these key concepts, security being the one making the front page most often. The impact of measurability is more subtle but nonetheless crucial. Without measurability, decisions you make about feature prioritization or design become irrational. For instance, if you are developing an API that contains multiple methods of access, if you are unable to measure their popularity or efficacy you will end up with either features that are being costly maintained for no benefits to your end user or features that are massively used by necessity but incrementally building your end user’s frustration. This is a very simple example but it illustrate an underlying notion that we rarely see in the world of zeros and one: irrationality. Indeed, a piece of software is usually extremely rational and quantifiable, which makes evaluating performance, scalability, security or even re-usability a relatively easy mathematical problem. With the advent of software popularization we see user experience has been on the forefront of Agile development, making customer feedback a key piece of feature release. What I am proposing here is to go one step further. Whenever developing a feature for your software, one should ask himself: how will I know if this feature is necessary or not? How will I test for it?

Implementing measurability

Implementing measurability acknowledges the fact that you are operating in an uncertain environment, which inherently makes its implementation uncertain. That being said, a good starting point is to measure its use and performance and then compare it to the other features you develop. This measurement and analysis can be done using trace or audit mechanisms, which, bonus, you should implement anyway to cater to security. A more robust approach would be to first select the metrics you want to measure for each software feature and have a dedicated module to implement measurability over those metrics. You may think it’s an overkill but with the advent of scalable and cheap storage, why not do it?

Beyond software development

Big Data, monitoring, analysis data science, all of these concepts are design to increase the world’s measurability, and they are definitely what everyone talks about now. And while the idea of being data driven in any aspect of our lives, from corporate management to personal fitness, it has yet to really make an impact within the realm of software development, or at least the tools dedicated to measurability only are scarce. That being said, making rational decisions does not seem to be as appealing to me as it is for the rest of the world, which could explain this scarcity.

Who decided that stored procedures should not be commented?

by paul 0 Comments
Who decided that stored procedures should not be commented?

I’ve been spending the past couple of weeks working on stored procedures. Glimpsing into my career so far, I realize how much stored procedures are the backbone of many organizations dealing with data. Stored procedures are something of a potpourri between magic behavior, bespoke black boxes, and the sedimentation of code layers accumulated over years of feature additions implemented by a battalion of sometimes well-intented PL/SQL programmers with tight deadlines. Furthermore, stored procedures, more than any other type of data manipulation, are what the actual live production systems rely upon. It is not uncommon for a piece of software to have hundreds of store procedures essential for it to work, and for good reason. Indeed, store procedures are extremely efficient. So much so that even unoptimized pieces of code harboring redundant test and an unreasonable amount of nested outer joins still run in a few milliseconds. Efficient they are. But you know what they are not? Commented. Seriously, the packages I worked with recently contain tens of thousands of lines of code but never contain more than 10 lines of comments, mostly containing something along the lines of “– 10/10/2014 added by Jay” or “– requirement R3045”. And as far as I can remember, relying solely on my flawed memory and anecdotal evidence, this is the case with the vast majority of stored procs. Therefore, after I spending some time curved into a ball crying, I asked myself: “why?”.

Common consensus about commenting code

Childishly, I first assumed that every piece of code should be commented, and the only reason for not commenting code would be laziness/lack of time/lack of understanding/hatred for whomever would read your code in the future. I was obviously misguided as one often is when assuming anything to be simple. Indeed, they are many times when commenting renders your code less readable, or is an excuse for bad coding. This article in particular, Common Excuses Used To Comment Code and What To Do About Them does an excellent job at highlighting when commenting is sub-optimal:

The code is not readable without comments. Or, when someone (possibly myself) revisits the code, the comments will make it clear as to what the code does. The code makes it clear what the code does. In almost all cases, you can choose better variable names and keep all code in a method at the same level of abstraction to make is easy to read without comments.

    We want to keep track of who changed what and when it was changed. Version control does this quite well (along with a ton of other benefits), and it only takes a few minutes to set up. Besides, does this ever work? (And how would you know?)
    I wanted to keep a commented-out section of code there in case I need it again. Again, version control systems will keep the code in a prior revision for you – just go back and find it if you ever need it again. Unless you’re commenting out the code temporarily to verify some behavior (or debug), I don’t buy into this either. If it stays commented out, just remove it.
    The code too complex to understand without comments. I used to think this case was a lot more common than it really is. But truthfully, it is extremely rare. Your code is probably just bad, and hard to understand. Re-write it so that’s no longer the case.
    Markers to easily find sections of code. I’ll admit that sometimes I still do this. But I’m not proud of it. What’s keeping us from making our files, classes, and functions more cohesive (and thus, likely to be smaller)? IDEs normally provide easy navigation to classes and methods, so there’s really no need to scan for comments to identify an area you want to work in. Just keep the logical sections of your code small and cohesive, and you won’t need these clutterful comments.
    Natural language is easier to read than code. But it’s not as precise. Besides, you’re a programmer, you ought not have trouble reading programs. If you do, it’s likely you haven’t made it simple enough, and what you really think is that the code is too complex to understand without comments.

Why this consensus does not apply to stored procedures

As much as these arguments make sense I don’t think they apply to store procedures:

    “you can choose better variable names and keep all code in a method at the same level of abstraction”: You can’t easily change table fields and names, nor can you cut a big nested SQL statement gracefully.
    “Version control does this quite well”: Version control is almost never implemented for stored procedures.
    “I wanted to keep a commented-out section of code there in case I need it again.”: OK, that’s just BS.
    “[complex code] is extremely rare.”: Nested SQL queries are inherently complex and MUCH less readable than traditional code.
    “Markers to easily find sections of code.”: I never saw a problem with that.
    “you ought not have trouble reading programs”: Except queries are the opposite of natural language. Please, please, please SQL developer, let me know why you are doing this particular join

To summarize, I still don’t understand why stored procedures are generally not commented, while it would seem they are the type of code that could benefit the most from comments. Maybe NoSQL will change this, but in the meantime, I will start this crusade, and make sure people explain their code, yo!

Choosing the correct DBMS – Gartner Reprint review

by paul 0 Comments
Choosing the correct DBMS – Gartner Reprint review
What do you mean my speed isn't linearly scalable?

If you haven’t had the chance to look at it yet, I encourage you to read this gartner reprint: Critical Capabilities for Operational Database Management Systems. This report is extremely interesting if you’re into data at all. I’m in too deep, so I’m going to talk about this report for the whole length of this marvelous post.

The importance of specific use-cases

The first thing that jumped out to me, was that Gartner uses the word DBMS Gartner, thus highlights the fact that the dichotomy between traditional relational database management systems and what has been labelled “NoSQL” is fading out. Instead, Gartner advises to “Classify the use cases under consideration and map them to the costs, deployment options and skills requirements of the products evaluated here.”. This is extremely important and a departure from some of the preconceptions I witness amongst my fellow professionals. Often enough, I get confronted with consultants trying to categorize DBMS by capabilities (distribution capabilities, support of languages, etc.). More importantly, these platforms are marketed through those capabilities. As I argued before, and confirmed by this report, end users, consultants and software provides alike should select, recommend and market according to the use-cases at which they excel instead of the capabilities inherent to a specific platform (see previous post: The importance of specialization in software sales).

Evaluation criteria

But enough surrendering to my own confirmation biases in order to pat myself on the back and delude myself into thinking that my observations may be going in the same direction as Gartner’s. The report evaluates different vendors (selected themselves base on the set of Inclusion Criteria), using the following criteria:

  • High-Speed Ingest and Processing
  • ACID Support
  • Tunable Consistency
  • Multimodel Support
  • Automated Data Distribution
  • Cloud/Hybrid Deployment
  • Programmability for HTAP
  • Administration and Management
  • Security
  • These criteria are then weighted according to four different use cases:

  • Traditional Transactions
  • Distributed Variable Data
  • Lightweight Events and Observations
  • Hybrid Transactional/Analytical Processing (HTAP)
  • I’m not going to spend time describing the criteria, Gartner put up very readable charts to compare the different vendors. In short, it seems that Oracle is leading the traditional transaction world while DataStax is leading the distributed one. On a personal note, I’m super excited for DataStax, I get to work with many members of their team and the company I am working for leverages their solution, so it’s excellent recognition.

    I would however perhaps have had another two criteria: integration ecosystem and cost. Regarding the latter, I would have created two sets of charts: one considering the cost and one not considering it. Of course, I understand cost is a delicate and fluctuant subject and I understand Gartner’s decision. Integration ecosystem however is very important. Being able to evaluate how easy it is to integrate and use data once it is in these DBMS is extremely important when making an architecture choice.

    Personal Conclusion

    I’m always impressed by the conciseness of Gartner reports. This one does not fail in that regard, and gives a very good basis to one evaluating data management systems. That being said, and to make sure that horse is dead, think of your use-case before going for an RFP. Many DBMS can do many things, but few excel at all use-cases.