Tag: featured

Quality Assurance (QA) for Data Projects or Data Applications

This post discusses Quality Assurance (QA) activities for data projects.

What is Quality Assurance (QA)?  Simply put, Quality Assurance, also called QA, Testing or Validation, is about testing an application or solution to ensure that all the stated/promised/expected requirements are met. It is a critically important activity for all software application development or implementations. Data applications are no different. They need to be tested to ensure they work as intended.

QA stands between development and deployment. And QA makes the difference between a delivered product and a high quality delivered product.

There are a number of things to keep in mind when you plan your Quality Assurance activities for data solutions. I present some of them in this post as suggestions, considerations, or prompting questions. The things mentioned here will not apply to all data applications but can be used as a guide or a check.

People / Teams

The number of people and teams involved in a project will vary depending on the size, scope and complexity of the project.

The technical team building the application needs to perform an initial level of validation of the solution.

If there is a Quality Assurance team that performs the validation tasks, then that team will need to perform the “official” validation.

The business analysts and end-users of the application also need to validate. Where possible, work with as many end users as efficiently possible. The more real users you have testing the application, the better the chances of finding issues early.

Where it makes sense, Test IDs that simulate various types of users or groups should be used to help test various usage and security scenarios. This is particularly useful in automated testing.

On large projects where there is a lot to be tested, it is best to break up the testing across multiple people or teams. This will help to prevent testing fatigue and sloppy testing and result in higher quality testing.

Plan ahead to ensure that access for all the relevant users is set up in the testing environments.

Communication

With all the teams and people involved, it is important to have a plan for how they will communicate. Things to consider and have a plan for include:

  • How will teams communicate within? Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will the various teams involved communicate with each other? In other words, how will cross-team communication be handled? As above, Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will issues and status be communicated? Weekly meetings, Status emails or documents, Shared files available on shared spaces are options.
  • How will changes and resolutions be tracked? Files, SDLC applications, Change Management applications are options.
  • How will teams and individuals be notified when they need to perform a task? Manual communication or automated notifications from tools are options.

Data

The most important thing to ensure in data projects is that the data is high quality, particularly the “base” data set. If the base data is incorrect, everything built on top of it will be bad. Of course, the correctness of intermediate and user-facing data is also just as important, but the validation of the base data is critical to achieving the correct data all over.

  • Ensure that table counts, field counts and row counts of key data are correct.
  • Does the data warehouse data match the source data?
  • Test detailed, low level records with small samples of data
  • Test to ensure that the data and the values conform to what is expected. For example, ensuring that there is no data older than 3 years old, or ensuring that there are no account values outside a certain range. The Data Governance Team may become involved in these activities across all projects.

Next in line is the “intermediate” data such as derived metrics, aggregates, specialized subsets, and more. These will also need to be verified.

  • Are the calculated values correct?
  • Are the aggregates correct? Test aggregate data with small, medium and large sets of data
  • Verify metric calculations

Then the user-facing data or data prepared for self-service usage needs to be validated.

  • Does the data on the dashboard match the data in the database?
  • Are the KPIs correctly reflecting the status?

Test the full flow of the data. The validity of the data should be verified at each stage of the data flow – from the source, to the staging, to the final tables in the data warehouse, to aggregates or subsets, to the dashboard.

Take snapshots of key datasets or reports so you can compare results post data migration.

Some additional data prep might be needed in some cases.

  • These include making sure that you have sourced adequate data for testing. For example, if you need to test an annual trend, then it might be best to have at least a year’s worth of data, preferably two.
  • You may need to scramble or redact some data for testing. Often Test data is taken from the Production environment and then scrambled and/or redacted in order to not expose sensitive information.
  • You may need to temporarily load in data for testing. For various reasons, you may need to load some Production data into the QA environment just to test the solution or a particular feature and then remove the data after the testing is complete. While this can be time consuming, sometimes it’s necessary, and it’s good to be aware of the need early and make plans accordingly.

Aesthetics & Representation of Data

Presentation matters. Although the most critical thing is data correctness, how the data is presented is also very important. Good presentation helps with understanding, usability, and adoption. A few things to consider include:

  • Does the application, such as dashboard, look good?  Does it look right? 
  • Are the components laid out properly so that there is no overcrowding?
  • Are the logos, colors and fonts in line with company expectations?
  • Are proper chart options used to display the various types of data and metrics?
  • Is the information provided in a way that users can digest?

Usage

The data application or solution should be user friendly, preferably intuitive or at least have good documentation. The data must be useful to the intended audience, in that, it should help them to understand the information and make good decisions or take sensible actions based on it.

The application should present data in a manner that is effective – easy to access, and easy to understand.

The presentation should satisfy the analytic workflows of the various users. Users should be able to logically step through the application to find information at the appropriate level of detail that they need based on their role.

A few things that affect usability include:

  • Prompts – ensure that all the proper prompts or selections are available to users to slice and filter the data as necessary. And of course, verify that they work.
  • Drill downs and drill throughs – validate that users can drill-down and across data to find the information they need in a simple, logical manner.
  • Easy interrogation of the data – if the application is ad-hoc in nature, validate that users can navigate it or at least verify that the documentation is comprehensive enough for users to follow.

Security

Securing the application and its data so that only authorized users have access to it is critical.

Application security comprises of “authentication”– access to the application, and “authorization” – what a user is authorized to do when he or she accesses the application.

Authorization (what a user is authorized to do within the application) can be broken into “object security” – what objects or features a user has access to, and “data security” – what data elements a user has access to within the various objects or features.

For example, a user has access to an application (authenticated / can log in), and within the application the user has access to (authorized to see and use) 3 of 10 reports (object-level security). The user is not authorized to see the other 7 reports (object-level security) and, therefore, will not have access to them. Now, within the 3 reports that the user has access to, he or she can only see data related to 1 of 5 departments (data-level security).

All object-level and data-level security needs to be validated. This includes negative testing. Not only test to make sure that users have the access they need, but testing should also ensure that users do not have access that they should not have.

  • Data for testing should be scrambled or redacted as appropriate to protect it.
  • Some extremely sensitive data may need to be filtered out entirely.
  • Can all the appropriate users access the application?
  • Are non-authorized users blocked from accessing the application?
  • Can user see the data they should be able to see to perform their jobs?

Performance

Performance of the data solution is important to user efficiency and user adoption. If users cannot get the results they need in a timely manner, they will look elsewhere to get what they need. Even if they have no choice, a poorly performing application will result in wasted time and dollars.

A few things to consider for ensuring quality around performance:

  • Application usage – is the performance acceptable? Do the results get returned in an acceptable time?
  • Data Integration – is the load performance acceptable?
  • Data processing – can the application perform all the processing it needs to do in a reasonable amount of time?
  • Stress Testing – how is performance with many users? How is it with a lot data?
  • How is performance with various selections or with no selections at all?
  • Is ad-hoc usage setup to be flexible but avoid rogue analyses that may cripple the system?
  • Is real-time analysis needed and is the application quick enough?

These items need to be validated and any issues need to be reported to the appropriate teams for performance tuning before the application is released for general usage.

Methodology

Each organization, and even each team within an organization, will have a preferred methodology for application development and change management, including how they perform QA activities.

Some things to consider include:

  • Get QA resources involved in projects early so that they gain an early understanding of the requirements and the solutions to assess and plan how best to test.
  • When appropriate, do not wait until all testing is complete before notifying development teams of issue discovered. By notifying them early, this could make the difference between your project being on-time or late.
  • Create a test plan and test scripts – even if they are high-level.
  • Where possible, execute tasks in an agile, iterative manner.
  • Each environment will have unique rules and guidelines that need to be validated. For example, your application may have a special naming convention, color & font guidelines, special metadata items, and more. You need to validate that these rules and guidelines are followed.
  • Use a checklist to ensure that you validate with consistency from deliverable to deliverable
  • When the solution being developed is replacing an existing system or dataset, use the new and old solutions in parallel to validate the new against the old.
  • Document test results. All testing participants should document what has been tested and the results. This may be as simple as a checkmark or a “Done” status, but may also include things like data entered, screenshots, results, errors, and more.
  • Update the appropriate tracking tools (such as your SDLC or Change Management tools) to document changes and validation. These tools will vary from company to company, but it is best to have a trail of the development, testing, and release to production.
  • For each company and application, there will a specific, unique set of things that will need to be done. It is best if you have a standard test plan or test checklist to help you confirm that you have tested all important aspects and scenarios of the application.

This is not an all-encompassing coverage of Quality Assurance for data solutions, but I hope the article gives you enough information to get started or tips for improving what you currently have in place. You can share your questions, thoughts and input via comments to this post. Thanks for reading!

Creating a Business Intelligence (BI) & Analytics Strategy and Roadmap

This post provides some of my thoughts on how to go about creating a Business Intelligence (BI) & Analytics Strategy and Roadmap for your client or company.  Please comment with your suggestions from your experience for improving this information.

 

When creating or updating the BI & Analytics Strategy and Roadmap for a company, one of the first things to understand is:

Who are all the critical stakeholders that need to be involved?

Understanding who needs and uses the BI & Analytics systems is critical for starting the process of understanding and documenting the “who needs what, why, and when”.

These are some of the roles that are typically important stakeholders:

  • High-level business executives that are paying for the projects
  • Business directors involved in the usage of the systems
  • IT directors involved in the developing and support of the systems
  • Business Subject Matter Experts (SME’s) & Business Analysts
  • BI/Analytics/Data/System Architects
  • BI/Analytics/Data/System Developers and Administrators

 

Then, you need to ask all these stakeholders, especially those from the business:

What are the drivers for BI & Analytics? And what is the level of importance for each of these drivers?

This will help you to understand and document what business needs are creating the need for new or modified BI & Analytics solutions. You should then go deeper to understand … what are the business objectives and goals that are driving these business needs.  This will help you to understand and document the bigger picture so that a more comprehensive strategy and roadmap can be created.

The questions and discussions surrounding the above will require deep and broad business involvement. Getting the perspective of a wide range of users from all business areas that are using the BI & Analytics Systems is critical.  The business should be involved throughout the process of creating the strategy and roadmap, and all decisions should tie back to support for business objectives and goals. And the trail leading to all these decisions must be documented.

Some examples of business drivers include:

  • Gain more insight into who our best customers are and how best to acquire them.
  • Understand how weather affects our sales/revenue.
  • Determine how we can sell more to our existing customers.
  • Understand what causes employee turnover.
  • Gain insight into how we can improve staffing schedules.

 

And examples of business objectives and goals may include things like:

  • Increase corporate revenues by 10%
  • Grow our base of recurring customers
  • Stabilize corporate revenues over all seasons
  • Create an environment where employees love to work
  • Reduce payroll costs without a reduction in staff, for example, reduce turnover.

 

Then, turn to understanding and documenting the current scenario (if not already known). Identify what systems (including data sources) are in place, who are using them (and why and how), what capabilities do they offer, what are the must-haves, and what are the pain points and positive highlights.

Also, you will need to determine the current workload (and future workload if it can be determined) of the primary team members involved in developing, testing, and implementing BI & Analytics solutions.

This will help you understand a few things:

  • Some of the highest priority needs of the users
  • Gaps in capabilities and data between what is needed and what is currently in place (including an understanding of what is liked and disliked about the current systems)
  • Current user base knowledge and engagement
  • IT knowledge and skills
  • Resource availability – when are people available to work on new initiatives

 

What are the options and limitations?

  • Can existing systems be customized to meet the requirements?
  • Can they be upgraded to a new version that has the needed functionality?
  • Do we need to consider adding a new platform or replacing one or more of the existing systems with a new platform?
  • Can we migrate from/integrate one system to/with another system that we already have up and running?
  • Are any of our current systems losing vendor support or require an upgrade for other reasons? Has the pricing changed for any of our software applications?
  • What options does our budget permit us to explore?
  • What options do our knowledge and skills permit us to explore?

 

Once you have identified these items …

  • Identify and engage stakeholders, and document these roles and the people
  • Identify and document business drivers, objectives and goals
  • Understand and document the current landscape – needs (including must-haves), technology, gaps, users, IT staff, resource availability, and more
  • Identify and document options – based on current landscape, technology, budget, staff resources, etc.

… you can develop a “living” Strategy and Roadmap for BI & Analytics. And when I say “living”, I mean it will not be a static document, but will be fine-tuned over time as new information emerge and as changes arise in business needs, technology, and staff resources.

 

Your Strategy and Roadmap for BI & Analytics should include, but is not limited to:

  • BI & Analytics that will be used to satisfy business drivers, objectives and goals
  • Data acquisition and storage plan for meeting the analytics needs
  • Technology platforms that will be used to process and store data, and deliver the analytics
  • Information about any new technologies that needs to be acquired or implemented, and schedules
  • Roles and Responsibilities for all stakeholders involved in BI & Analytics projects
  • Planned staffing allocations and schedules
  • Planned staffing changes and schedules
  • User training (business users) and Delivery team training (technical implementers & developers for example)
  • List dependencies for each item or set of items

The Apache Hadoop Ecosystem

Apache Hadoop, simply termed Hadoop, is an increasingly popular open-source framework for distributed computing.  It has had a major impact on the business intelligence / data analytics / data warehousing space, spawning a new practice in this space, referred to as Big Data.  Hadoop’s core architecture consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce.   It provides a reliable, scalable, and cost-effective means for storing and processing large data sets, and it does so like no other software frameworks before its time.

It is cost-effective and scalable because it is designed to run on commodity hardware servers that can be scaled from one to hundreds, or even thousands, therefore avoiding the cost of the expensive super-computers (which eventually hits limits).  With Hadoop, you are able to add commodity servers as needed without much difficulty at minimal costs.

It is reliable because all the modules in Hadoop are designed with a fundamental assumption that hardware failures will occur and these failures should be automatically handled in software by the Hadoop framework.

Beyond the core components, the Hadoop eco-system has grown to include a number of additional packages that run on top of or alongside the core Hadoop components, including but not limited to, Apache Hive, Apache Pig, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm, Apache Mahout, Ambari, Apache Drill, Tez, and others.  This post will serve as a quick look-up for the components of the eco-system to allow you to quickly identify what the components are and understand what they do.

Hadoop component Component Category Purpose / Usage
Hadoop The ecosystem The core Apache Hadoop framework is composed of the following modules:

  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • Hadoop YARN
  • Hadoop MapReduce
Hadoop Common Software Libraries shared across the ecosystem Hadoop Common contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) Distributed Storage HDFS is a distributed file system that is the foundational storage component of Hadoop and it sits on top of the file system of the commodity hardware that Hadoop runs on. It stores data on these commodity servers and provides high bandwidth and throughput across the cluster of servers.
Hadoop YARN Resource Management & Scheduling YARN (which stands for Yet Another Resource Negotiator) is a resource-management platform that manages computing resources in Hadoop clusters and uses them to schedule users’ applications.
Hadoop MapReduce Distributed Processing MapReduce is a programming and processing paradigm that pairs with HDFS for large scale data processing.

It is a distributed computational algorithm comprised of a Map() procedure and a Reduce() procedure that pushes computation down to each server in the Hadoop cluster.

The Map procedure performs functions such as filtering and sorting of data; while the Reduce() procedure performs summary / aggregate type operations on the data.

Hive MapReduce Abstraction / Analysis / Querying Apache Hive is a data warehouse infrastructure that provides an abstraction layer on top of MapReduce.  It provides a SQL-like language called HiveQL and transparently converts queries to MapReduce, Apache Tez, and Apache Spark jobs.

It can handle analysis of large datasets and provides functionality for indexing, data summarization, query, and analysis of the data stored in HDFS or other compatible file systems.

Pig MapReduce Abstraction / Analysis / Querying Pig is a functional programming interface that allows you to use a higher level scripting language (called Pig Latin) to create MapReduce code for Hadoop. Pig is similar to PL/SQL and can be extended using UDF’s written in Java, Python and other languages.

It was originally developed to provide analysts an ad-hoc way of creating and executing map-reduce jobs on very large data sets.

Ambari Monitoring & Management of Clusters Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It includes support for many of the key components of the Hadoop eco-system, such as, Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.

Ambari also provides a user-friendly dashboard for viewing cluster health and MapReduce, Pig and Hive applications, And it also provides features to diagnose the performance of the various components.

HBase Storage / database HBase is a non-relational (NoSQL) distributed, fast, and scalable database that runs on top of HDFS.  It is modeled after Google’s Big Table, providing BigTable-like capabilities to Hadoop, and is written in Java.

It provides fault-tolerant storage and retrieval of huge quantities of sparse data – such as top 10 out of 10 billion records or the 0.1% of recrods that are non-zero.

HBase features include compression, and in-memory operation.  HBase tables can be used as the input and output for MapReduce jobs run in Hadoop, and are accessed through APIs.

HBase can be integrated with BI and Analytics applications through drivers and through Apache Phoenix’s SQL layer. However, HBase is not a RDBMS replacement.

Hue Web GUI Hue is an open-source Web interface for end users that supports Apache Hadoop and its ecosystem.

Hue provides a single interface for the most common Apache Hadoop components with an emphasis on user experience. Its main goal is to have the users make the most of Hadoop without worrying about the underlying complexity or using a command line.

Sqoop Data Integration Sqoop, named from a combination of SQL+Hadoop, is an application with a command-line interface that pulls and pushes data from/to relational data sources, to/from Hadoop.

It supports compression, incremental loads of a single table, or a free form SQL query. You can also save jobs which can be run multiple times to perform the incremental loads. Imports can also be used to populate tables in Hive or HBase.

Exports can be used to put data from Hadoop into a relational database.

Several software vendors provide Sqoop-based functionality into their database and BI/analytics products.

Flume Data Integration Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data (such as web logs or sensor data) into and out of Hadoop (HDFS).

It features include fault tolerance and a simple extensible data model that supports streaming data flows and allows for online analysis.

Impala Analysis / Querying Cloudera Impala is a massively parallel processing, low-latency SQL Query engine that runs on Hadoop and communicates directly with HDFS, bypassing MapReduce.

It allows you to run SQL queries in lower data volume scenarios on data stored in HDFS and HBase, and returns results much quicker than Pig and Hive.

Impala is designed and integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software, which allows for both large scale data processing and interactive queries to be done on the same system.

Impala is great for data analysts and scientists to perform analytics on data stored in Hadoop via SQL or other business intelligence tools.

Avro Data Integration Avro is a data interchange protocol/framework that provides data serialization and de-serialization in a compact binary format.

Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Storm Data Integration Apache Storm is a distributed computation framework, written predominantly in the Clojure programming language that moves streaming data into and out of Hadoop.

It allows for the definition of information sources and manipulations to allow batch, distributed processing of streaming data.

Storm’s architecture acts as a data transformation pipeline. At a very high level the general architecture is similar to a MapReduce job, with the main difference being that data is processed in real-time as opposed to in individual batches.

Oozie Workflow Builder Apache Oozie is a server-based workflow scheduling system, built using Java, to manage Hadoop jobs. It chains together MapReduce jobs, and data import/export scripts.

Workflows in Oozie are defined as a collection of control flow and action nodes. Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for different types of actions including Hadoop MapReduce, HDFS operations, Pig, SSH, and email; and it can be extended to support additional types of actions.

Mahout Machine Learning Apache Mahout is a set of libraries for distributed, scalable machine learning; data mining; and mathematical algorithms that run primarily on the Hadoop, and focused primarily in the areas of collaborative filtering, clustering and classification.

Mahout’s core algorithms for clustering, classification and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm, but is not restricted to Hadoop-based implementations.

ZooKeeper Coordination ZooKeeper is a high-performance, high-availability coordination service for distributed applications. It provides a distributed configuration service, synchronization service, and naming registry for large distributed systems.

ZooKeeper is used by open source enterprise search systems like Solr.

Spark Data Integration, Processing, Machine Learning Apache Spark sits directly on top of HDFS, bypassing MapReduce, and is a fast, general compute engine for Hadoop.  It is said that Spark could eventually replace MapReduce because it provides solutions for everything MapReduce does, plus adds a lot more functionality.

It uses a different paradigm from MapReduce (synonymous to rows vs sets processing in SQL), and uses more in-memory capabilities which makes it typically faster than MapReduce.  In contrast to MapReduce’s two-stage disk-based paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications.

Spark is very versatile and provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

Spark requires a cluster manager and supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos; and also requires a distributed storage system such as Hadoop Distributed File System (HDFS), Cassandra, Amazon S3, or even custom systems; but it does support a pseudo-distributed local mode (for development and testing).

Spark is one of the most active projects in the Apache Software Foundation.

Phoenix Data Manipulation Apache Phoenix is a massively parallel, relational database layer on top of noSQL stores such as Apache HBase.

Phoenix provides a JDBC driver that hides the complexities of the noSQL store enabling users to use the familiar SQL to create, delete, and alter SQL tables, views, indexes, and sequences; insert, update, and delete rows singly and in bulk; and query data.

Phoenix compiles queries and other statements into native noSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of noSQL stores.

Cassandra Storage / database Apache Cassandra is an open source, scalable, multi-master, high-performance, distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Cassandra supports clusters spanning multiple datacenters, with asynchronous masterless replication allowing for low latency operations.

Solr Search Solr (pronounced “solar”) is an open source enterprise search platform, written in Java, and runs as a standalone full-text search server.

Its features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL capabilities, and rich document (e.g., Word, PDF) handling. Solr is designed for scalability and Fault tolerance with distributed search and index replication. Solr is a very popular enterprise search engine.

Solr has APIs and a plugin architecture that makes it customizable using various programming languages.

There is the flexibility to root the data being brought in by SQOOP and FLUME directly into SOLR to do indexing on the fly.  But you can also tell SOLR to index the data in batches.

MongoDB Storage / database MongoDB (from humongous) is an open-source, cross-platform document-oriented, NoSQL database. It uses a JSON-like structure, called BSON, with dynamic schemas which makes the integration of data in certain types of applications easier and faster.

MongoDB is one of the most popular type of database management systems, and is said to be the most popular for document stores.

Kafka Data Integration Apache Kafka is an open-source message broker project written in Scala. It provides a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.

Apache Kafka was originally developed by LinkedIn, and was subsequently open sourced in early 2011.

Accumulo Storage / database Apache Accumulo is a data store with a sorted, distributed key/value and, like HBase, is based on the BigTable technology from Google.  It is written in Java, and is built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Accumulo is said to be third most popular NoSQL wide column store behind Apache Cassandra and HBase as of 2015.
Chukwa Data Integration Chukwa is a data collection system for managing large distributed systems.
Tez Processing Tez is a flexible data-flow programming framework, built on Hadoop YARN, that processes both in batch and interactive modes. It is being adopted by Hive, Pig and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop MapReduce as the underlying execution engine.
Drill Processing Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets, even across multiple data stores, at amazing speed.

It is the open source version of Google’s Dremel system which is available as a service called Google BigQuery, and it supports a variety of NoSQL databases and file systems, including HBase, MongoDB, HDFS, Amazon S3, Azure Blob Storage, local files, and more.

Apache Sentry Security Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.

 

You can get more information about Apache Hadoop here: http://hadoop.apache.org/

It’s all about the users – Identifying Users for your BI applications / dashboards

One of the first things you will need to do before developing your Business Intelligence (BI) applications or dashboards is … identify who will use it.  You need to identify who will be using the application – what business areas they belong to, what groups they belong to, what are the various functions or roles within those groups, and eventually, who are the actual people.  After identifying the various roles (groups of users typically associated with a business process or function), then you can identify their needs.  Starting any development before knowing who will be using the system could result in a lot of wasted time and effort or a sub-optimal system.  The grouping of information on dashboards, the available functionality and security will be driven by these roles and their respective needs.

After identifying the various functions or roles that users posses, then it is important to understand how each role performs their job functions.  You need to understand what information they need and in what order, how it’s used, and the level of detail required at various stages. With this information, you will determine the dashboards, dashboard pages and their order, the information on each dashboard page and its precedence and level of detail, and what detailed information is needed via drill down. Basically, you will be creating the analytic workflows for the identified roles and the various processes, functions and tasks that they perform.

When performing the above exercise, please be as discrete as possible.  For example, even if someone doubles as an AP/AR Analyst, you should still analyze and plan for 2 separate roles – AP Analyst and AR Analyst – because those are 2 separate functions.  Later, the individual or group can be granted permissions to both roles.  From a security standpoint in general, you will create the necessary BI application roles to support your business roles.  And then assign security based on these roles.

In general, always keep the focus on the users, what they need to accomplish, and the most efficient ways to help them perform their jobs.  When you build the BI security and dashboards to meet those needs and usage scenarios, it will result in higher and faster user adoption.  This will take time, so do not rush the process.  Get detailed information about all the steps in their workflow upfront, document it, and then build around it.  However, on the other hand, you do not have to document to perfection upfront, you can take a more agile approach of developing based on fairly good user profiles to give users working prototypes, and then adjusting as new information and feedback is received from the users.

Good luck identifying your users and their needs as you get your BI project rolling.  And remember, it’s all about the users!