It’s All About the Cloud: Insights from the 2019 Spark + AI Summit in San Francisco

12/09/2019 Comments off

Apache Spark plays a critical role in adopting and evolving Big Data technologies because it provides sophisticated ways for enterprises to leverage Big Data compared to Hadoop. The increasing amounts of data being analyzed and processed through the framework are massive and continue to push the boundaries of the engine.

Built on the premise that Apache Spark is the only unified analytics engine that combines large-scale data processing with state-of-the-art machine learning and AI algorithms, the 2019 Spark + AI Summit rolled into San Francisco in the last week of April. The event was billed as the world’s largest data and machine learning conference.

This was not a large-scale event, but it was very well attended. Big Data is increasingly important in business operations that rely on massive data to support revenue-generating online applications. Think always about streaming services, on-demand applications, and business-critical transactional processing for retail, travel, banking, insurance, and healthcare services.
There were a lot of latest product announcements for Apache Spark, MLflow, and our newest open-source addition, Delta Lake!

Product Announcements

Reynold Xin, Apache Spark PMC member and number-one code committer to Spark, opened the summit by presenting the upcoming work planned for Spark 3.0(Preview release happened on Nov 6, 2019), with more than 1000 improvements features and bug fixes, ranging from Hydrogen-accelerator aware scheduling, Spark Graph, Spark on Kubernetes, ANSI-SQL Parser, and many more.

He also announced a new release, bringing the Pandas DataFrame API to Spark under a new open-source project called Koalas. Pandas have long been the Python standard for manipulating and analyzing data, particularly for small and medium-sized datasets, and the project opens up a more frictionless progression to large data sets on Spark. With compatible API syntax, data scientists trained on Pandas can now use Koalas to transition easily to working on larger, distributed data sets
geared for production environments

Ali Ghodsi, CEO and Co-founder of Databricks, announced the open-source release of Delta Lake, a storage layer that increases data lakes’ reliability and quality. Previously, data lakes frequently faced garbage-in-garbage-out issues that made data quality too low for data science and machine learning, resulting in large, expensive “data swamps.” This project brings new features to data lakes, including ACID transactions, schema enforcement, and data time travel to help ensure data integrity for downstream analytics and projects. Customers who previously used Databricks Delta gave extremely positive feedback for the core problems it solved for them, and we’re excited to open-source the project for the larger community. The ecosystem of Apache Spark, MLflow, and now Delta Lake continues to expand to solve end-to-end data and ML challenges.

Matei Zaharia, Databricks’ chief technologist, announced the next components for the open-source MLflow project with MLflow Workflows and MLflow Model Registry. These modules extend the end-to-end machine learning lifecycle management with multistep pipelines and model management. Matei also announced the upcoming work around MLflow 1.0, with a stabilized API for long-term usage and additional feature releases. Managed MLflow is also now Generally Available on AWS and Azure.

There were also Keynotes sessions from Turing Award winner David Patterson, Netflix VP of Data Science and Analytics Caitlin Smallwood, Michael I. Jordan, Timnit Gebru of Google Brain and Black in AI, Google’s Anitha Vijayakumar, Jitendra Malik of Facebook AI Research.

The talk I liked the most was delivered by Andrew Clegg.

He talked about efficient joins on Spark. Whoever has worked with Spark for a while knows how painful joins are regarding execution time and resources. These problems usually come around when you are dealing with skewed data. This means you have some keys with a ton of rows, whereas most of the remaining ones have just a few. A good way to pinpoint this problem is by looking at the Spark UI and checking out the time
distributions.

Andrew proposed a way to deal with these situations fairly easily:

  • Suppose you have a dataset D1 with (many) repeated keys skewed and another one with no repeated keys D2. We also assume that D2 is big, so we cannot broadcast it (otherwise, the broadcast will do the trick for you already). Even though this seems very restrictive, most cases are of this type. Therefore, the solution is very applicable.
  • Now, we can add a new column to D1; this column contains random numbers ranging from 1 to R. Then we create a composed key in D1, where we append the new column to the original key; let’s call this new column CK (stands for composed key).
  • The next step is to replicate every row of D2 R times, and for each row, we add a corresponding number ranging from 1 to R. We also generate the CK column in D2. Finally, we perform the join between D1 and D2. Easy, non-intrusive, and very performant.

Isn’t it awesome??

However, it might introduce many new rows for D2, which is undesirable. Thus, instead of doing this for the complete data frame, we can do it for the keys we know are skewed; we perform a regular join for the rest of the keys and then union them. Still easy

I loved the approach overall, and it’s extendable for stranger use cases. Also, it can be tweaked for groups (the other operation that causes many problems).

The second very interesting speech was given by Beck Cronin-Dixon, a data-engineer at
Eventbrite.

She explained how to build basic ETLs that are very performant depending on the use cases. She suggested 4 different ingestion approaches for doing near real-time analytics:

  • Full overwrite: This strategy is very simple; every time, the Spark job reads a batch of data, transforms it, and stores it, overwriting previous results. It is simple to implement but has a high latency since real-time DB has a significant load.
  • Batch incremental merge: In this case, the job gets new/changed rows and then appends them to previous results. If there is inconsistency with duplicated rows, then a process must fix those. It clearly has a lower load in real-time, but it requires reliable incremental fields or a second process must be run, which introduces a high latency.
  • Append-only: A slight version of the previous approach is to query the real-time DB for
    new-changed rows, then coalesce and write new part files. Finally, it runs a compaction job
    hourly.

Good: latency in minutes. Ingestion is easy to implement. Simplifies data lake.
Bad: requires a compaction process. Extra logic in the queries.

  • Key-value store: If our use case is key-value data, then the problem with duplicates might not be such since it can simply overwrite or append to what was already there.

Good: straightforward to implement a good bridge between a data lake and web services
Bad: Batch writes to a key-value store slower than HDFS. Not optimized for large scans

  • Hybrid: This last strategy ingests from DB transaction logs in Kafka, then merges new rows to base rows and stores transaction IDs in the base table. The duplicate rows can be removed on the fly if necessary.

Good: Very fast, relatively easy to implement.
Bad: The streaming merge is complicated, and the processing is required in the reading part.

Here are some of my impressions and takeaways from this event…

Although focused on Spark and AI, the common thread that ran through the summit was adopting and migrating to the cloud. The big cloud service providers were in evidence, including the Big Four: AWS, Azure, Google Cloud, and IBM Cloud. And, of course, Databrick runs its unified analytics platform in the cloud. Almost every exhibitor had booth messaging that referenced the cloud, and
many sessions at the event had a cloud theme. Up, up, and away!

I spoke with many people who are considering managed platforms, ephemeral clusters, or no clusters because they are frustrated with wrestling with large Hadoop platforms. Data science in the cloud is happening at scale, but the cloud can be hard to manage. So, many enterprises are choosing to off-load cloud management to vendors like Databricks and Snowflake.

And if you’re a data scientist or Spark programmer, the many postings cluttering the job board indicated the high demand for your services. Three of the “exhibiting” vendors were at the event for the stated purpose of recruiting.

In summary, the 2019 edition of the Spark + AI Summit in San Francisco was a rich learning experience for me and my team.

Categories: Conferences Tags:

Everything we learned at AWS re: Invent 2018

12/04/2018 Comments off

AWS re: Invent, Amazon’s annual user conference in Las Vegas concluded last week.
The event is only getting bigger and better with each passing year. AWS re: Invent 2018 will be remembered as a milestone for Amazon and the industry.

Before I share the detailed analysis of the event, I want to highlight some of the
observations from the conference:

  • It was really surprising that AWS shared its market share with the
    competition. It’s way ahead, but it is still mindful of Microsoft and Google
    catching up on them.
    It’s moving away from Intel and would be using arm-based processors for their EC2
    instances.
  • AWS doesn’t care about containers. The keynotes had no mention related to ECS,
    EKS or Fargate.
  • They are coming up with new instance types to improve the performance of EC2
    instances.
  • They have often been criticized for consuming open-source software but don’t
    contribute to the Firecracker community; this would just be the stepping
    stone towards it. Firecracker is basically a virtual manager that runs functions at high
    speed. It’s based on the same technology used in AWS Lambda, AWS Fargate, etc.
  • To completely outdo the competition, they are moving towards building the Hybrid cloud.
  • The AWS Outposts deliver fully managed and configurable compute and
    storage racks built with AWS-designed hardware that customers can use to operate
    a seamless hybrid cloud. This would help them to run VMware software as well as
    EC2 and EBS using the same API
  • QLDB(Quantum Ledger Database): cryptographically verifiable ledger for applications
    where multiple parties work with a centralized, trusted authority to maintain a
    verifiable record of transactions. Mind you, this is not “Blockchain” or distributed
    ledger technology. For that, they have the Amazon Managed Blockchain.
  • Amazon Aurora now supports globally distributed databases
  • There were a lot of inroads made on AI-based services. There were enhancements
    to Comprehend now supports vertical industries like Comprehend Medical that
    can extract relevant medical information from unstructured text such as medical
    condition, medication, dosage, strength, and frequency.
  • AWS Inferentia is a machine learning inference chip designed to provide inferencing
    for models developed in TensorFlow, Apache MXNet, and PyTorch deep learning
    framework.
  • There were some unconventional services added to the ecosystem of AWS and
    surprise was about AWS Ground Station, a managed service that lets customers
    control satellite communications, downlink and process satellite data, and scale
    satellite operations. The service is available in all the geographies where AWS has a
    region.
  • The other service that gained attention was AWS RoboMaker – a service to develop,
    test, and deploy intelligent robotics applications at scale. The service is integrated
    with Cloud9 IDE, which comes with the Robotic OS pre-installed and includes sample
    applications.
  • Continuing the tradition of launching a hardware device aligned with its cloud, AWS
    launched DeepRacer, an autonomous toy car that runs a reinforcement learning
    model. Earlier, AWS launched IoT Button and DeepLens at re: Invent.
  • Jassy used his keynote to earmark two distinct classes of AWS users: the traditional
    ‘builder’ class of developers and a growing class of enterprise users that value
    simple solutions over the depth of the product.
  • He described this new set of customers as less “interested in getting into the details
    of all of the services and stitching them together, they are willing to trade some of
    that flexibility in exchange for more prescriptive guidance that allows them to get
    started faster”.
  • In the past, this has included products like the Elastic Beanstalk container for
    web apps or SageMaker for simplifying the design and deployment of machine
    learning algorithms. This year, AWS added a Control Tower, Security Hub, and Lake
    Formation.
  • Control Tower “is the easiest way to set up, govern and secure a compliant, multi-account environment or landing zone on AWS,” along with policy guardrails and
    analytics for visibility into this environment.
  • Security Hub is a “central hub to view and manage security and compliance across
    an entire AWS environment,” which integrates with a bunch of best-of-breed
    vendors, including Splunk, AlertLogic, and IBM Security.
  • Lastly, there is Lake Formation, a tool for simplifying the establishment of an
    enterprise data lake using a range of AWS tools and services. It promises customers
    the ability to set up a data lake in “days, not months” with a point-and-click interface
    to identify data sources before automatically taking care of crawling schemas and
    setting metadata tags, along with a list of prescriptive security policies to put in place
    from day one.
  • Lake Formation is available, while Control Tower and Security Hub are in the
    preview.

Now, with all these observations. Let me tell you what all happens in re: Invent. Different types of sessions happen in re:Invent

  • KeyNotes: This is to hear from AWS leaders, and we get to be the first to learn
    about new product and solution announcements from AWS. This featured Andy
    Jassy is the CEO of Amazon Web Services, and Dr. Werner Vogels is the CTO of Amazon.com.
  • Chalk Talk: Chalk talks are highly interactive content with a smaller
    audience. They begin with a 10–15-minute lecture delivered by an AWS expert, followed by a 45–50-minute Q&A session with the audience. The goal is to foster a
    technical discussion around real-world architecture challenges. Chalk talks are one
    hour long and are presented by AWS experts. They have expert-level content.
  • Workshops: Workshops are two-hour, hands-on sessions where you work in teams
    to solve problems using AWS. Workshops organize attendees into small groups and
    provide scenarios to encourage interaction, giving you the opportunity to learn from
    and teach each other. Each workshop starts with a 10–15-minute lecture by the main
    speaker, and the rest of the time is spent working as a group. There are additional
    AWS experts in the room ensure every group gets the assistance they need.
  • Sessions: These are lecture-style and 60 minutes long. These sessions will occur throughout the re: Invent campus and cover all topics at all levels (200–400).
    Sessions are delivered by AWS experts, customers, and partners, and they typically
    include 10–15 minutes at the end for Q&A.
  • Hands-on Labs: It’s a way to choose a lab that you like from a catalog provided
    to you and learn at your own pace as you walk through scenarios step-by-step. Lab
    topics range in level from introductory to expert and take approximately 30–60
    minutes to complete. (But honestly, this should have been a little more advanced than
    I thought it was, just my opinion, though )
  • Hacks & Jams: Hackathons and Jam Sessions are highly gamified events where
    participants complete tasks that challenge and educate on the use of a wide range
    of AWS services and have fun in the process.
  • Builders Sessions: Each builders session begins with a short explanation
    or demonstration of what you will build. There will not be any formal
    presentation. Once the demonstration is complete, you will use your laptop to
    experiment and build with the AWS expert.
  • Bootcamps: Bootcamps are an opportunity to hone existing skills and learn
    new ways of working with AWS. This year they offered Technical Bootcamps,
    Business Bootcamps, Partner Bootcamps, and AWS Certification Exam Readiness
    Bootcamps.

But then they had things outside that would not be termed as sessions but were useful

  • AWS Certification: This was an opportunity to get yourself AWS Certified at
    re: Invent! They would provide you with special recognition if you cleared it.
  • Community: This was showing the world how much AWS cares about communities.
    Some of them were Mother’s Room, Reflection/Quiet Rooms, Gender neutral
    restrooms, Accessibility
  • Builder’s Fair: It was placed to learn from AWS experts and get hands-on
    experience. There were over 70 projects created by AWS which you could browse.
    Best of all, we got to talk to the experts who built these projects,
    diagram, and problem-solve with them.
  • Content Theater: This was to sit back and watch innovative cloud architectures from
    AWS partners and customers in the ‘This is My Architecture’ video series.
  • Partner Theaters: Learn from AWS sponsors and experts in their demo theaters.
    Technology: Everything we learned at AWS re: Invent 2018
    4
  • Developer Lounge: This is a place for casual Dev Chat, checking out the Amazon
    Sumerian AR/VR experience.
  • AWS Village: This is the place to get your questions answered by AWS engineers
    and product leaders and also enjoy the AWS Launchpad live stream production.
  • The Quad: This was used to host the sponsor activations, AWS Marketplace and
    Service Catalog Experience Hub, Startup Central, overflow breakout content, and a
    hands-on LEGO experience.

As in the world AWS re: Invent, this was all about “Learn” Now talk about “Play”. There is a lot of cool stuff that happened, too, apart from learning new things from AWS. Some of them were below

  • Harley Ride: There were 2 types of rides. One was 145 miles loop circling the
    historic Valley of Fire State Park, and the other was short, 54 miles, taking you around the
    Red Rock Canyon
  • Tatonka Challenge: This a quirky Amazon tradition of eating your way to honor
    through mass consumption of buffalo chicken wings (or celery for our plant-loving
    friends).
  • 4k and 8K Run: This was to run in re: Invent 4K and 8K Charity Fun Run in memory
    of Sam Blackman, CEO of Elemental. This also supported Girls Who Code, a
    nonprofit organization that aims to support and increase the number of women in
    computer science.
  • Midnight Madness: This was a sneak peek into re: Invent, complete with fun, drinks,
    and snacks. This was to hear from AWS executives give a behind-the-scenes take
    on AWS, a marching band, and dance until 1 AM!
  • Pub Crawl: This is to provide you with the opportunity to network and connect with
    fellow re: Invent attendees and AWS sponsors.
  • Giving Back: This was a GiveBack event supporting Three Square’s BackPack for
    Kids program at re: Invent. The way to provide bags of nutritious, single-serving,
    ready-to-eat food items each Friday for children who might otherwise go without
    during weekends and long breaks from school.
  • Hydrate & Help: This was a way to help fund a clean water project in Tanzania
    through WaterAid’s partnership with Cupanion’s Fill It Forward program. They provided
    us with a bottle, and we were to scan it with an App(Fill it Forward) every time
    we filled it.
  • Broomball: This was a fun way to play with traditional broomball and soccer.
  • Re:Play: This was a fun place with a DJ, Live concert, games, and a lot of fun.

If you want to learn more about new product announcements happening in re: Invent, use this https://aws.amazon.com/new/reinvent/

Until the next re: Invent happens from Dec 2-6, 2019. I hope you learned about what happened in AWS regarding Invent 2018. I would end with a quote but a rephrased one: “What happens in Vegas stays in the cloud.”

Categories: Conferences

The Unparalleled Explosion in Cryptocurrencies

09/14/2017 Comments off

Our CIO has always talked about Blockchain and its future. It has always amazed me what currencies are utilized to make it happen. Everybody would have heard of Bitcoin. But its just one of the currency which falls under a category called as Cryptocurrencies. Bitcoins are very costly (present cost USD 3430 and increasing as I am writing this blog).

After the massive Bitcoin price surge which happened in the year 2013, the popularity of launching new cryptocurrencies took off along with it. In fact, if you go back at historical snapshots around that time, you’ll see that there were literally hundreds of new coins available to mine and buy. Somewhere in 2014 – a time when there were only 32 coins that were worth more than $1 million in market cap, and 354 coins that were worth less than $50,000, usually trading for tiny fractions of a cent. It seems like everyone and their dog was launching cryptocurrencies back then, even if they were a longshot to materialize into anything.

But if you fast forward today you would hear of a new mechanism which is called ICO(Initial Coin Offering) similar to IPO. Today, there is real money at play, and in 12 months the number of cryptocurrencies worth >$1 million has soared by 468%. Meanwhile, the total value of all currencies together has skyrocketed by 1,466%. Cryptocurrency is so hot, in fact, that raising money through ICOs has become more effective than traditional early-stage angel and VC funding

And with this ICO activity and a wealth of opportunities emerging, a new breed of Bitcoin millionaire has been born. Like the wealthy tech founders that exit and give back to their local startup ecosystems, these new digital tycoons are using their newfound wealth to invest in upstart crypto projects that show potential – ultimately, further enhancing the ecosystem.

Of course, whenever there is a massive surge in prices and speculation, there are two other players that tend to come out of the woodwork. One is of the scammer and shyster variety, and certainly, crypto-fueled scams are a concern for everyone else in the broader ecosystem.

Perhaps even a bigger threat, however, are the regulators – and in recent times the SEC has voiced concerns about ICO “pump and dump” schemes, while Canadian authorities have clearly stated that “most ICOs need oversight”.

With the market exploding with hundreds of new cryptocurrencies and the total value reaching $177 billion, a new series of questions has emerged: what risk do ICO scams ultimately have on the market? And, could misguided regulation disrupt the momentum of the crypto boom?

A teaser of unrealistic increase of crypto currencies against combined valuation to America’s largest corporations in the year 2016.

What would be future of these currencies only time would tell. But till then this would keep exciting people like me. My next blog would be about what are the big cryptocurrencies available in the market and what are its nuances. Till then stay tuned.

Categories: Cryptocurrencies Tags: ,

Modern approach to managing Infrastructure

01/11/2017 Comments off

We are all moving to a world with many firsts, like API First, Cloud First, and DevOps First. I would like to touch on something which would be really interesting to read for people. Infrastructure as Code(IaC), by definition IaC, is the process of managing and provisioning computing infrastructure and its configuration through machine-processable definition files rather than physical hardware configuration or interactive configuration tools.

Everybody would think what would be a common use case for it. For example, if you receive a notification that a server is unreachable. You follow your usual quick-fix routines (for example, flip through the logs to see what happened), only to discover that the server has crashed. You freeze! Immediately, you get flashbacks of the hustle you had to go through while trying to configure that server. You try to recall every component installed on the dead machine, plus their versions. You cannot even recall the order in which everything was installed, along with the nitty gritty. You request the ground to open up and swallow you, but unfortunately (or fortunately), it cannot hear you.

Infrastructure as code, or programmable infrastructure, means writing code (which can be done using a high level language or any descriptive language) to manage configurations and automate provisioning of infrastructure in addition to deployments. This is not simply writing scripts, but involves using tested and proven software development practices that are already being used in application development. For example: version control, testing, small deployments, use of design patterns etc. In short, this means you write code to provision and manage your server, in addition to automating processes.

So how does a developer contribute on this. There are number of tools like vagrantup, ansible, puppet, docker which makes our life easy. This would be more useful if you are thinking from AWS standpoint. The best part of these tools is we can use the same configuration to run the same procedures multiple times when we want to achieve same results

With the help of these tools and more, development and testing becomes simpler, as one can easily spin up a server and fully configure it even on their development box for use while developing. This eliminates breaking of the common server that is usually shared during development. The QAs can also do the same thing. For example: easily spin up another for his staging environment.

Despite all the different machines involved, there is a sense of confidence that all of them have the same configurations, thus avoiding issues like snowflake servers. The ability to use version control on your infrastructure code implies that you can easily track all the changes in your infrastructure environment. Therefore, you have an easier option of rolling back configuration changes to a previous working configuration in case of a problem. On the other hand, there are a few glitches which I can think of at this moment

  1. Having to plan much before the configuration – such as choosing the right tools
  2. Bad configurations could get duplicated on all the servers
  3. Configuration drift, in which the server configurations are modified on the machines (for example, through hot-fixes without modifying the original templates), makes configurations on the server and in the template to differ. This is especially true if strict discipline is not followed.

Despite these few mishaps, infrastructure as code will surely bring a smile on your face once you try it out.

Categories: Cloud, Infrastructure Tags: , , ,

Detect Dead Code

06/01/2011 Comments off

While trying to learn more on WPF and MVVM. I came across a very very useful tool(Code Analysis) which would help all the developers to write clean and standardized code.This was something I have never seen before but it was really a helpful tool that suffices many of our needs. The first thing I dealt with was removal of all the deadcode in my application.Dead Code includes unused variables in functions, uncalled private methods, unused variables in a class.So I went ahead and implemented it  inside my application.

These are some samples that I have utilized for solving some of the dead code generally went unheard by developers like me

public class SampleApp
{
    /* CA1823 : Microsoft.Performance : 
    * It appears that field 'SampleApp.testSampleApp' is never used 
    * or is only ever assigned to. 
    * Use this field or remove it.	
    */
    private string testSampleApp= string.Empty;
        
    public SampleApp()
    { 
        
    }
 	    
    /* CA1811 : Microsoft.Performance :  
    * 'SampleApp.NotCalledPrivateMethod(string)' 
    * appears to have no upstream public or protected callers.
    */
    private void NotCalledPrivateMethod(string successMethod)
    {
        /* CA1801 : Microsoft.Usage : 
        * Parameter 'successMethod' of 
        * 'SampleApp.NotCalledPrivateMethod(string)' 
        * is never used. 
        * Remove the parameter or use it in the method body. 
        */
        Console.WriteLine("Success");
    }

    public void PublicMethod(string successMethod)
    {
        /* CA1804 : Microsoft.Performance : 'SampleApp.PublicMethod(string)'
        * declares a variable, 'successMethodName', of type 'string', 
        * which is never used or is only assigned to. 
        * Use this variable or remove it. 
        */

        string successMethodName = string.Empty;
        if (string.IsNullOrEmpty(successMethod))
        {
            Console.WriteLine(string.Format("Success:{0}", successMethod));
        }
    }

    /* CA1812 : Microsoft.Performance : 
    * 'SampleApp.SmallSampleApp' is an internal class 
    * that is apparently never instantiated. 
    * If so, remove the code from the assembly. 
    * If this class is intended to contain only static methods, 
    * consider adding a private constructor to prevent the compiler 
    * from generating a default constructor.	
    */
    private class SmallSampleApp
    { }
}
Categories: Code Analysis Tags: