Category: Data Development

Oracle Error when table or other object has the same name as the schema name

We recently had a situation where a procedure was running fine in 2 environments but was failing in another. During debugging, it was determined that if the schema prefix was removed from the procedure call, it would run fine, otherwise it fails.

The following error was produced:

ERROR at line 1:

ORA-06550: line 1, column 14:

PLS-00302: component ‘MyProcedure’ must be declared

ORA-06550: line 1, column 7:

PL/SQL: Statement ignored

After some research, the DBA found a web post that indicated that this error is generated if you have an object with the same name as the schema.

You can check if you have any such objects by running this SQL command:

SQL> select * from dba_objects where object_name = ‘Your_Schema_Name’;

(of course, where “Your_Schema_Name” is the actual name of your schema)

If you do, then you should rename the object or remove it if it is no longer needed. Of course, if it is a valid object that is being used, you will need to rename it in all the places in which it is being used.

Thanks for reading! Good luck on your data journey!

SQL tips: Get last day and first day of month – Oracle SQL

At times you may need to dynamically determine the first or last day of a month based on the current date or some date in your data. This post provides a few options for deriving the First day of month, and Last day of month. And also, it provides a couple tweaks to get Last day of the next or previous month.

Get the current date (today’s date):
select sysdate from dual;

SYSDATE

16-MAY-22

Get the last day of the current month:
select trunc(last_day(sysdate)) as LastDayOfMonth from dual;

LASTDAYOFMONTH

31-MAY-22

Side note: The last_day() function can also be useful for dynamically determining leap years (that is, whether February has 28 or 29 days)
select
last_day(date ‘2020-02-01’) LastDayOfFeb2020, — leap year
last_day(date ‘2021-02-01’) LastDayOfFeb2021 — not a leap year
from dual;

LASTDAYOFFEB2020 LASTDAYOFFEB2021

29-FEB-20 28-FEB-21

Get the last day of the next month:
select add_months(trunc(last_day(sysdate)), 1) as LastDayOfNextMonth from dual;

LASTDAYOFNEXTMONTH

30-JUN-22

Get the last day of the previous month:
select add_months(trunc(last_day(sysdate)), -1) as LastDayOfPreviousMonth from dual;

LASTDAYOFPREVIOUSMONTH

30-APR-22

Get the first day of the current month:

select trunc(sysdate,’month’) as FirstDayOfMonth from dual;
select trunc(sysdate,’MM’) as FirstDayOfMonth from dual;

FIRSTDAYOFMONTH

01-MAY-22

Get the first day of the next month:

select add_months(trunc(sysdate,’MM’), 1) as FirstDayOfNextMonth from dual;

FIRSTDAYOFNEXTMONTH

01-JUN-22

You can also get the first day of the next month using this:
select trunc(last_day(sysdate)+1) as FirstDayOfNextMonth from dual;

FIRSTDAYOFNEXTMONTH

01-JUN-22

Get the first day of the previous month:

select add_months(trunc(sysdate,’MM’), -1) as FirstDayOfPreviousMonth from dual;

FIRSTDAYOFPREVIOUSMONTH

01-APR-22

Here are 3 resources for getting the first and last day of the month values on a SQL Server database:

https://zarez.net/?p=2462

Thanks for reading! I hope you found this information useful!

Good luck on your data journey.

SQL Tips: How to replace a character or a part of a string in a column value – Oracle SQL

If you frequently work with data, you will occasionally need to replace a part of the values in a column for some reason. In our case, we recently had some data that was loaded into a table from a file source and all the values had an unwanted leading and trailing tab character.

We needed to remove these tabs and preferably without resourcing and reloading the data. The solution was to use the REPLACE SQL function.

This is the form of the UPDATE statement that uses the REPLACE function

Update TABLENAME Set COLUMN_TO_BE_UPDATED = REPLACE(COLUMN_TO_BE_UPDATED, ‘Char_or_String_to_replace_in_column_values’, ‘Char_or_String_to_replace_it_with’);

Here is an example that replaces the long street names in the ADDRESS column with their abbreviated corresponding name:

update EMPLOYEE set ADDRESS = REPLACE(ADDRESS, ‘Street’, ‘St’ ), ADDRESS = REPLACE(ADDRESS, ‘Avenue’, ‘Ave’ ), ADDRESS = REPLACE(CITY, ‘Circle’, ‘Cir’ ) where COUNTRY = ‘United States’;

To solve the issue we had, the SQL statement below was used to update the EMPLOYEE column values by replacing the tab character with NULL – as shown below.

Important Note: Before making any major updates to a table, it’s a best practice to back up the table first.

update EMPLOYEE_EXPENSE_DATA set EMPLOYEE = REPLACE(EMPLOYEE, ‘    ‘, NULL) where EMPLOYEE like ‘  %’;

Important Note: Verify your update before you commit the changes.

We repeated this SQL for each column that had the “bad” (unwanted leading and trailing tabs) values.

Thanks for reading! I hope you found this information useful.

SQL Tips: How to generate a delimited list from the values in an Oracle database table column

This is a quick post that shows how to generate the values from a table column in a delimited list format.

If you have a table named CONTACTS that has the following columns and values:

select * from CONTACTS;

And you wanted to generate a delimited list of the names in the NAMES column like this:

Adam, Jane, John, Lisa, Mark, Mary, Noah

Or like this:

Adam | Jane | John | Lisa | Mark | Mary | Noah

You would use SQL statements like these to generate the above output – note the delimiter in each case, and also note that you can sort the output:

— comma delimiter example …

SELECT listagg(NAME, ', ') within group (order by NAME) FROM CONTACTS;

Adam, Jane, John, Lisa, Mark, Mary, Noah

— pipe delimiter example …

SELECT listagg(NAME, ' | ') within group (order by NAME) FROM CONTACTS;

Adam | Jane | John | Lisa | Mark | Mary | Noah

You can find more information about the listagg function here.

Thanks for reading. I hope you found this useful.

Python Libraries for Data Science

Python has grown quickly to become one of the most widely used programming languages. While it’s a powerful, multi-purpose language used for creating just about any type of application, it has become a go-to language for data science, rivaling even “R”, the longtime favorite language and platform for data science.

Python’s popularity for data-based solutions has grown because of the many powerful, opensource, data-centric libraries it has available. Some of these libraries include:

NumPy

A library used for creating and manipulating multi-dimensional data arrays and can be used for handling multi-dimensional data and difficult mathematical operations.

Pandas

Pandas is a library that provides easy-to-use but high-performance data structures, such as the DataFrame, and data analysis tools.

Matplotlib

Matplotlib is a library used for data visualization such as creating histograms, bar charts, scatter plots, and much more.

SciPy

SciPy is a library that provides integration, statistics, and linear algebra packages for numerical computations.

Scikit-learn

Scikit-learn is a library used for machine learning. It is built on top of some other libraries including NumPy, Matplotlib, and SciPy.

There are many other data-centric Python libraries and some will be introduced in future articles. More can be learned here: https://www.python.org/

Quality Assurance (QA) for Data Projects or Data Applications

This post discusses Quality Assurance (QA) activities for data projects.

What is Quality Assurance (QA)?  Simply put, Quality Assurance, also called QA, Testing or Validation, is about testing an application or solution to ensure that all the stated/promised/expected requirements are met. It is a critically important activity for all software application development or implementations. Data applications are no different. They need to be tested to ensure they work as intended.

QA stands between development and deployment. And QA makes the difference between a delivered product and a high quality delivered product.

There are a number of things to keep in mind when you plan your Quality Assurance activities for data solutions. I present some of them in this post as suggestions, considerations, or prompting questions. The things mentioned here will not apply to all data applications but can be used as a guide or a check.

People / Teams

The number of people and teams involved in a project will vary depending on the size, scope and complexity of the project.

The technical team building the application needs to perform an initial level of validation of the solution.

If there is a Quality Assurance team that performs the validation tasks, then that team will need to perform the “official” validation.

The business analysts and end-users of the application also need to validate. Where possible, work with as many end users as efficiently possible. The more real users you have testing the application, the better the chances of finding issues early.

Where it makes sense, Test IDs that simulate various types of users or groups should be used to help test various usage and security scenarios. This is particularly useful in automated testing.

On large projects where there is a lot to be tested, it is best to break up the testing across multiple people or teams. This will help to prevent testing fatigue and sloppy testing and result in higher quality testing.

Plan ahead to ensure that access for all the relevant users is set up in the testing environments.

Communication

With all the teams and people involved, it is important to have a plan for how they will communicate. Things to consider and have a plan for include:

  • How will teams communicate within? Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will the various teams involved communicate with each other? In other words, how will cross-team communication be handled? As above, Email, Microsoft Teams, SharePoint, Shared Files, are some options.
  • How will issues and status be communicated? Weekly meetings, Status emails or documents, Shared files available on shared spaces are options.
  • How will changes and resolutions be tracked? Files, SDLC applications, Change Management applications are options.
  • How will teams and individuals be notified when they need to perform a task? Manual communication or automated notifications from tools are options.

Data

The most important thing to ensure in data projects is that the data is high quality, particularly the “base” data set. If the base data is incorrect, everything built on top of it will be bad. Of course, the correctness of intermediate and user-facing data is also just as important, but the validation of the base data is critical to achieving the correct data all over.

  • Ensure that table counts, field counts and row counts of key data are correct.
  • Does the data warehouse data match the source data?
  • Test detailed, low level records with small samples of data
  • Test to ensure that the data and the values conform to what is expected. For example, ensuring that there is no data older than 3 years old, or ensuring that there are no account values outside a certain range. The Data Governance Team may become involved in these activities across all projects.

Next in line is the “intermediate” data such as derived metrics, aggregates, specialized subsets, and more. These will also need to be verified.

  • Are the calculated values correct?
  • Are the aggregates correct? Test aggregate data with small, medium and large sets of data
  • Verify metric calculations

Then the user-facing data or data prepared for self-service usage needs to be validated.

  • Does the data on the dashboard match the data in the database?
  • Are the KPIs correctly reflecting the status?

Test the full flow of the data. The validity of the data should be verified at each stage of the data flow – from the source, to the staging, to the final tables in the data warehouse, to aggregates or subsets, to the dashboard.

Take snapshots of key datasets or reports so you can compare results post data migration.

Some additional data prep might be needed in some cases.

  • These include making sure that you have sourced adequate data for testing. For example, if you need to test an annual trend, then it might be best to have at least a year’s worth of data, preferably two.
  • You may need to scramble or redact some data for testing. Often Test data is taken from the Production environment and then scrambled and/or redacted in order to not expose sensitive information.
  • You may need to temporarily load in data for testing. For various reasons, you may need to load some Production data into the QA environment just to test the solution or a particular feature and then remove the data after the testing is complete. While this can be time consuming, sometimes it’s necessary, and it’s good to be aware of the need early and make plans accordingly.

Aesthetics & Representation of Data

Presentation matters. Although the most critical thing is data correctness, how the data is presented is also very important. Good presentation helps with understanding, usability, and adoption. A few things to consider include:

  • Does the application, such as dashboard, look good?  Does it look right? 
  • Are the components laid out properly so that there is no overcrowding?
  • Are the logos, colors and fonts in line with company expectations?
  • Are proper chart options used to display the various types of data and metrics?
  • Is the information provided in a way that users can digest?

Usage

The data application or solution should be user friendly, preferably intuitive or at least have good documentation. The data must be useful to the intended audience, in that, it should help them to understand the information and make good decisions or take sensible actions based on it.

The application should present data in a manner that is effective – easy to access, and easy to understand.

The presentation should satisfy the analytic workflows of the various users. Users should be able to logically step through the application to find information at the appropriate level of detail that they need based on their role.

A few things that affect usability include:

  • Prompts – ensure that all the proper prompts or selections are available to users to slice and filter the data as necessary. And of course, verify that they work.
  • Drill downs and drill throughs – validate that users can drill-down and across data to find the information they need in a simple, logical manner.
  • Easy interrogation of the data – if the application is ad-hoc in nature, validate that users can navigate it or at least verify that the documentation is comprehensive enough for users to follow.

Security

Securing the application and its data so that only authorized users have access to it is critical.

Application security comprises of “authentication”– access to the application, and “authorization” – what a user is authorized to do when he or she accesses the application.

Authorization (what a user is authorized to do within the application) can be broken into “object security” – what objects or features a user has access to, and “data security” – what data elements a user has access to within the various objects or features.

For example, a user has access to an application (authenticated / can log in), and within the application the user has access to (authorized to see and use) 3 of 10 reports (object-level security). The user is not authorized to see the other 7 reports (object-level security) and, therefore, will not have access to them. Now, within the 3 reports that the user has access to, he or she can only see data related to 1 of 5 departments (data-level security).

All object-level and data-level security needs to be validated. This includes negative testing. Not only test to make sure that users have the access they need, but testing should also ensure that users do not have access that they should not have.

  • Data for testing should be scrambled or redacted as appropriate to protect it.
  • Some extremely sensitive data may need to be filtered out entirely.
  • Can all the appropriate users access the application?
  • Are non-authorized users blocked from accessing the application?
  • Can user see the data they should be able to see to perform their jobs?

Performance

Performance of the data solution is important to user efficiency and user adoption. If users cannot get the results they need in a timely manner, they will look elsewhere to get what they need. Even if they have no choice, a poorly performing application will result in wasted time and dollars.

A few things to consider for ensuring quality around performance:

  • Application usage – is the performance acceptable? Do the results get returned in an acceptable time?
  • Data Integration – is the load performance acceptable?
  • Data processing – can the application perform all the processing it needs to do in a reasonable amount of time?
  • Stress Testing – how is performance with many users? How is it with a lot data?
  • How is performance with various selections or with no selections at all?
  • Is ad-hoc usage setup to be flexible but avoid rogue analyses that may cripple the system?
  • Is real-time analysis needed and is the application quick enough?

These items need to be validated and any issues need to be reported to the appropriate teams for performance tuning before the application is released for general usage.

Methodology

Each organization, and even each team within an organization, will have a preferred methodology for application development and change management, including how they perform QA activities.

Some things to consider include:

  • Get QA resources involved in projects early so that they gain an early understanding of the requirements and the solutions to assess and plan how best to test.
  • When appropriate, do not wait until all testing is complete before notifying development teams of issue discovered. By notifying them early, this could make the difference between your project being on-time or late.
  • Create a test plan and test scripts – even if they are high-level.
  • Where possible, execute tasks in an agile, iterative manner.
  • Each environment will have unique rules and guidelines that need to be validated. For example, your application may have a special naming convention, color & font guidelines, special metadata items, and more. You need to validate that these rules and guidelines are followed.
  • Use a checklist to ensure that you validate with consistency from deliverable to deliverable
  • When the solution being developed is replacing an existing system or dataset, use the new and old solutions in parallel to validate the new against the old.
  • Document test results. All testing participants should document what has been tested and the results. This may be as simple as a checkmark or a “Done” status, but may also include things like data entered, screenshots, results, errors, and more.
  • Update the appropriate tracking tools (such as your SDLC or Change Management tools) to document changes and validation. These tools will vary from company to company, but it is best to have a trail of the development, testing, and release to production.
  • For each company and application, there will a specific, unique set of things that will need to be done. It is best if you have a standard test plan or test checklist to help you confirm that you have tested all important aspects and scenarios of the application.

This is not an all-encompassing coverage of Quality Assurance for data solutions, but I hope the article gives you enough information to get started or tips for improving what you currently have in place. You can share your questions, thoughts and input via comments to this post. Thanks for reading!

Connecting to Microsoft SQL Server database from Oracle SQL Developer

If you work primarily with Oracle databases, you may use SQL Developer. But you may also need to connect to Microsoft SQL Server databases and not necessarily want to install a new front-end database tool, such as Microsoft SQL Server Management Studio (SSMS).  You can connect to SQL Server from SQL Developer.

First, download the appropriate JDBC Driver for the version of SQL Server that you need to connect to. Then follow the steps in the video at the link below on the Oracle website.

https://www.oracle.com/technetwork/developer-tools/sql-developer/sql-server-connection-viewlet-swf-089886.html

Good luck.

 

InfatoODI – Informatica to ODI conversion tool

We are currently in the process of upgrading Oracle Business Intelligence Applications (OBIA) from version 7.9.6 to OBIA 11g.  Oracle has replaced Informatica as the data integration tool in the platform with it’s own tool, Oracle Data Integrator (ODI). This was a selfish, profit-driven move on Oracle’s part with no consideration for the impact on customers, but it is what it is.

Because of this, as a part of the upgrade to the new OBIA release, we need to convert all our hundreds of Informatica mappings to ODI.  As you can imagine, this is a lot of work.  We are getting help from a company that has developed a specialized conversion tool called InfatoODI, which converts Informatica mappings to ODI interfaces.

We are performing the conversions specifically for an OBIA application, but the tool can be used as a straight conversion tool for Informatica-to-ODI for any type of application.

We are in the beginning stages of the project, but early indications are that the tool will save us time, but I am not sure how significant as yet. I will post updates as we progress through the conversions with my experience and opinion of the tool.

Working with built-in datasets in R

This is a quick cheat sheet of commands for working with built-in datasets in R.

R has a number of base datasets that come with the install, and there are many packages that also include additional datasets.  These are very useful and easy to work with.

 

> ?datasets                                  # to get help info on the datasets package

> library(help=”datasets”)     # provides detailed information on the datasets package, including listing and description of the datasets in the package

> data()                                       # lists all the datasets currently available by package

data(package = .packages(all.available = TRUE))  # lists all the available datasets by package even if not installed

> data(EuStockMarkets)            # loads  the dataset EuStockMarkets into the workspace

> summary(AirPassengers)      # provides a summary of the dataset AirPassengers

> ?airquality                                    # outputs in the Help window, information about the dataset “airquality”

> str(airmiles)                                 # shows the structure of the dataset “airmiles”

> View(airmiles)             # shows the data of the dataset “airmiles” in spreadsheet format

 

 

Working with special character letters in Oracle database (grave accent, acute accent, circumflex accent, tilde, and umlaut)

We have some external data feeds that we receive as csv files, and we load them into our data warehouse and process them.    It turns out that data values that include special characters were getting messed up at some point before or during the load to the Oracle database.

In this post, I will go over one way to select, insert or update data values that contain these special characters.  The “special characters” I am referring to are the grave accent, acute accent, circumflex accent, tilde, and umlaut.

The most common one we come across is the “acute accent” which is present in words such as café or entrée (accent over the e in both words).

If you want to insert these words without the accents, into an example table, WORDS, with a single column, SPECIAL_CHAR_WORD, it would simply be:

insert into WORDS (SPECIAL_CHAR_WORD) values ('cafe');
insert into WORDS (SPECIAL_CHAR_WORD) values ('entree');

But if you want to insert these words with the accents, then you would need to do this:

insert into WORDS (SPECIAL_CHAR_WORD) values UNISTR('cafe301');
insert into WORDS (SPECIAL_CHAR_WORD) values UNISTR('entre301e');

To select the word café with the accent, run this statement:

select UNISTR('cafe301') from dual;

Once a column value is already in a table in the proper format, you can simply select the column name using a normal SQL select statement, and the output will show properly (maybe with just the need for some formatting as you will later in the article).

And for update – to update the word entree with the word entrée, run this statement:

update WORD set SPECIAL_CHAR_WORD = UNISTR('entre301e') where SPECIAL_CHAR_WORD = 'entree';

To see several special characters (grave accent, acute accent, circumflex accent, tilde, and umlaut), run this statement …

select UNISTR('bare300ge') from dual   -- barège    -- grave accent
union
select UNISTR('entre301e') from dual   -- entrée    -- acute accent
union
select UNISTR('pa302turer') from dual  -- pâturer   -- circumflex accent
union
select UNISTR('jalapen303o') from dual -- jalapeño  -- tilde
union 
select UNISTR('fu308r') from dual;          -- für    --umlaut

… the output would look like this … which seems not quite right …
Special_Characters_UNISTR

Add the COMPOSE statement (which is just for formatting) as shown below …

select COMPOSE(UNISTR('bare300ge')) as SPECIAL_CHAR_WORD from dual   -- barège    -- grave accent
union
select COMPOSE(UNISTR('entre301e')) from dual   -- entrée    -- acute accent
union
select COMPOSE(UNISTR('pa302turer')) from dual  -- pâturer   -- circumflex accent
union
select COMPOSE(UNISTR('jalapen303o')) from dual -- jalapeño  -- tilde
union 
select COMPOSE(UNISTR('fu308r')) from dual;     -- führer    --umlaut

and the output will look like this …

Special_Characters_UNISTR_and_COMPOSE2

As  you can see, the key to this is knowing the code for the special character you need, and then using the UNISTR function to add the special character to the rest of the text, and if necessary, use COMPOSE for display purposes.

Thanks for reading. Hope you found this helpful.