paul vidal - pragmatic big data nerd

Tag Archives

4 Articles

True cloud automation for large organizations: a case of leading by example.

by paul 0 Comments
True cloud automation for large organizations: a case of leading by example.
If you're reading this Ashish, this is for you

Introduction

As a solution engineer, it is my mandate to advocate large IT organizations on how to best leverage the tools available in the market to empower their users to be autonomous. In the world of big data, this has many aspects from defining business processes to ensuring security and governance. The key aspect however is the ability to “shift to the left”, meaning that the end user is in control of the infrastructure necessary for his job.

Ultimately this is the unicorn every organization wants: a highly guarded infrastructure that feels like a plain of freedom to the end user.

Some organizations choose to use one vendor (e.g. cloud vendor) to implement this unicorn vision, but soon realize (or will soon) that the limiting yourself to one vendor not only restrict the plain field of end users it paradoxically make the implementation of consistent governance and security harder (because of vendor limitations and a tendency towards restricting cross platform compatibility)

The solution is what I advice my customer every day: building the backbone of an organization on open inter compatible standards that allow growth while maintaining consistency. In today’s world, it’s the hybrid cloud approach.

When I joined Hortonworks (now Cloudera) a few months ago I was impressed by the work done by a few Solution Engineers (Dan Chaffelson and Vadim Vaks). They had built an automation tool that would allow any Solution Engineer in the organization to spin their own demo cluster and load whatever demo they want called Whoville. I led a Pre-Sales organization before: this is the dream. Having consistency over demos for each member of the organization while empowering the herd of cats that Solution Engineers are is no easy feat. Not only that, we were eating our own dog food, leveraging our tools to instantiate what we had been preaching!

I’m still in awe of the intelligence and competency of the team I just joined. But, being as lazy as I am, I decided to try and build an even easier button for Whoville which would:
1. Empower the rest of our organization even more
2. Show true thought leadership to our customers

So with the help of my good friend Josiah Goodson and the support of the Solution Engineering team (allowing us to work on this during a hackathon for instabce) we built Cloudbreak Cuisine.

Introducing Cloudbreak Cuisine

Cuisine is fully featured application, running in containers, that allows, in its alpha version, Solution Engineers to:
1. Get access to Whoville library, deploy and delete demo clusters
2. Monitor clusters deployment via Cloudbreak API
3. Create your own combination of Cloudbreak Blueprints & Recipes (a.k.a. bundles) for your own demo
4. Push those bundles to Cloudbreak

Below is a high level architecture of Cuisine:

Cloudbreak Cuisine Architecture

The deployment of the tool is automated, but requires access to Whoville (restricted within our own organization). All details can be found here: https://github.com/josiahg/cloudbreak-cuisine

The couple of videos below showcase what Cuisine can do, in its alpha version.

Glossary: Cuisine Bundles are a combination of Cloudbreak Blueprints & Recipes.

Features

Push a bundle from Whoville

Push Bundle

Delete a Bundle via Cloudbreak

Delete Bundle

Add Custom Recipe

Add Recipe

Create Custom Bundle

Create Bundle

Download/Push Bundle

Download/Push Bundle

Additional tips/tricks

Tips & Tricks

Parting thoughts

First and foremost: thank you. Thank you Dan and Vadim for Whoville, thank you Josiah for your continuous help and thank you Hortonworks/Cloudera to allow us to be demonstrating such thought leadership.

Secondly, if you’re a Cloudera Solution Engineer, test it and let me know what you think!

Finally, for every other reader out there: this is what an open platform can do to your organization. Truly allow you to leverage any piece of data or any infrastructure available to you; from EDGE to AI :)

Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

by paul 0 Comments
Use Altus Director to launch a Cloudera Data Science Workbench cluster on AWS

My blood recently turned from green to blue (after the Hortonworks-Cloudera merger) and I couldn’t be more excited to play with new toys. What I am particularly excited about is Cloudera Data Science Workbench. But, like in everything I do, I am very lazy. So here is a quick tutorial to install Altus Director, and use it to deploy a CDH 5.15 + CDSW cluster.

Step 1: Install Altus Director

Many ways to do that, but the one I chose was the AWS install, detailed here: https://www.cloudera.com/documentation/director/latest/topics/director_aws_setup_client.html

The installation documentation is very well done, but here are the important excerpts

Create a VPC for your Altus instance

Follow the documentation.

Few important points:

  • In the name of laziness I also recommend to add a 0-65535 rule from your personal IP.
  • Your VPC should have an internet gateway associated with it (you could do it without, but would require you manually pulling the CM/CDH software down and make internal repositories within your subnet)
  • Do not forget to open all traffic to your security group as described here. Your deployment will not work otherwise.

Launch a Redhat 7.3 instance

You can either search communities AMIs, or use this one: ami-6871a115

Install Altus

Connect to your ec2 instance:

ssh -i your_file.pem ec2-user@your_instance_ip

Install JDK and wget

sudo yum install java-1.8.0-openjdk
sudo yum install wget

Install/Start Altus server and client:

cd /etc/yum.repos.d/
sudo wget "http://archive.cloudera.com/director6/6.1/redhat7/cloudera-director.repo"
sudo yum install cloudera-director-server cloudera-director-client
sudo service cloudera-director-server start
sudo systemctl disable firewalld
sudo systemctl stop firewalld

Connect to Altus Director

Go to http://your_instance_ip:7189/ and connect with admin/admin

Step 2: Modify the Director configuration file

CDSW cluster configuration can be found here https://github.com/cloudera/director-scripts/blob/master/configs/aws.cdsw.conf

Modify the configuration file to use:

  • Your AWS accessKeyId/secretAccessKey
  • Your AWS region
  • Your AWS subnetId (same as the one you created for your Director instance)
  • Your AWS securityGroupsIds (same as the one you created for your Director instance)
  • Your private key path (e.g. /home/ec2-user/field.pem)
  • Your AWS image (e.g. ami-6871a115)

Step 3: Launch the cluster via director client

Go to your EC2 instance where Director is installed, and load your modified configuration file as well as the appropriate key.

Finally, run the following:

cloudera-director bootstrap-remote your_configuration_file.conf \
--lp.remote.username=admin \
--lp.remote.password=admin

Step 4: Access Cloudera Manager

You can follow the bootstrapping of the cluster both on command line or in the Director interface; once done, you can connect to Cloudera Manager using: http://your_manager_instance_ip:7180/

Step 5: Configure CDSW domain with your IP

Cloudera Data Science Workbench uses DNS. The correct approach is to setup a wildcard DNS record is required, as described here.

However, for testing purposes I used nip.io. The only parameter to change is the Cloudera Data Science Workbench Domain, from cdsw.my-domain.com as the conf file sets it up to, to cdsw.[YOUR_AWS_PUBLIC_IP].nip.io, as depicted below:

Restart the CDSW service, then you should be able to access CDSW by clicking on the CDSW Web UI link. Register for a new account and you will have access to CDSW:

Essential resources on Machine Learning

by paul 0 Comments
Essential resources on Machine Learning
"Maybe you should be spending some time learning instead of relying on machines" - Some hipster

I’ve always been fascinated by Artificial Intelligence in science fiction. I’m so lucky to live in an era that is seeing the birth of a new kind of Artificial Intelligence, enabled by Big Data, advancement in super computers and Machine Learning. I’m even working in a field that gets to implement that kind of technologies, which continues to excite and fascinate me. Machine learning is today moving out of the realm of pure research to real-world applicability. But like any new cutting-edge technology, we need to beware of products untruthfully using the word Machine Learning in their marketing message or Machine Learning being the cure for all diseases. Therefore, I think it’s important that we spend some time understanding what Machine Learning is, as well as what it does and can do in the industry. Since I’m not an expert on Machine Learning (… yet), I spent some time gathering resources to enhance your Human Learning about Machine Learning. Happy reading!

Introductions

  • First things first, wikipedia: link
  • An excellent visual introduction on Machine Learning from R2D3: link
  • An early draft of a Machine Learning book from Stanford University: link
  • Introduction to Machine Learning from Cambridge University: link
  • Technical courses

  • In-depth videos on Machine Learning, from Data School: link
  • What is Machine Learning, from Data Camp: link
  • Introduction to Machine Learning, from Udacity: link
  • Machine Learning, from Coursera: link
  • Machine Learning in the market

  • Gartner 2015 Hype Cycle: Big Data is Out, Machine Learning is in: link
  • Gartner 2016 top 10 trends: link
  • Machine Learning, What it is & why it matters, from SAS: link
  • Marketplace for Machine Learning Algorithm, Algorithmia: link
  • The future of Machine Learning, from David Karger on Quora: link
  • What is a data scientist and do I need one?

    by paul 0 Comments
    What is a data scientist and do I need one?

    A good friend once told me: “If your profession is not represented by a cartoon animal, your job description is made up and society does not need you”. This is when I realized that my life¬†was a lie and I was condemned to eternal despair. But I digress. On the subject of data scientists, this is a role that has recently been introduced to the market place, so I think it’s important to ask ourselves what this role is and who can benefit from a data scientist.

    The evolution of data

    In the recent years, data has experienced profound changes. Not only the technology behind data storage and management has dramatically evolved from standard relational models to distributed solutions, the place of data in the enterprise and in the mind of people has changed. Suddenly data is becoming a sexy buzzword instead of being a necessary evil. Indeed, data has become its own entity within business organization with entire teams dedicated to it. Companies now no longer ask “do I really need to keep this data?” but “how can I make sure that I keep all the data I have?”. With this advent, new roles started to emerge and this is when “Data Scientists” have been introduced.

    Buzzword or actual role?

    Many argue that data scientists are just a fancier replacement to the role of business/data analysts; “A Data Scientist is a Data Analyst Who Lives in San Francisco” as you can read in this article (a very good read I might add). I agree to a certain extent: data scientists are people that are diving into software to get results that will ultimately help make business decisions. Companies leadership have always relied on this type of analysis from experts called business analysts. Business analysts even use business intelligence software to do data mining and generate statistics and guide business solution, which are some of the principal prerogatives of data scientist.

    But I do think there is a fundamental change to be considered: data platforms are now a separate piece of software. Before the advent of big data, software used data layers. Nowadays, you have data lakes, data virtualization layers, real time data warehousing that are their own entities. Using these platforms require a combined set of skills: know how to use data platforms intimately (skills formerly owned by data administrators) and be able to generate business intelligence data out of them (skills formerly owned by business analysts).

    As such, I think that a new designation for this combined set of skills is fair; and it looks like Wikipedia agrees with me by calling data science an interdisciplinary field: “Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).”

    Do I need to hire a data science team?

    I think that there is a better question to be asked: “what am I doing with my data?”. Don’t get me wrong, the trend of wanting to accumulate as much data as possible is great. Especially great for me that work for a company that provides data management solutions. But I have seen implementations of massive data lakes taking years and month and very little use out of them, and this is a shame.

    New data platforms gives business a tremendous opportunity. Instead of relying on the wisdom of visionaries or accumulated experience to make difficult business decisions, we get to gather evidence and make an informed decision. But you need to know what you want to know first. Once you do, then you can decide which platform is good for you and what type of data scientist you should hire. This will give you much more tangible results than buying a huge data platform, hiring an army of data scientists and do fundamental data research. OK, I made that last part up, fundamental data research is not a real thing… yet!