Skip to main content

Open Source tool for Data cleansing and Master Data Management

Last weekend SQL Power released an improved version of SQL Power DQguru (formerly known as SQL Power MatchMaker), one of the few open source tools for data cleansing and master data management (MDM) available. Version 0.96 brings a new feature that allows you to run SQL Power DQguru from command line. This allows you to integrate it into batch scripts and your ETL jobs.

As a BI consultant for SQL Power I have used SQL Power DQguru in different projects and it has made my job a lot easier. Some of the features I like the most are:

  • Easy connection to any database with JDBC drivers, incl. SQL Server, Oracle, MySQL, Postgres
  • Lets you create complex merge rules so your dependent data will always be updated when you merge records.
  • You can combine over 25 steps to find possible duplicate data with a match rule, for example:
    • Word Count
    • Regular Expressions
    • Substrings
    • Retain certain characters
    • Translate Words, you can create your own translation rules.
  • You can preview how your data will look like when you apply the match rules
  • Automatic Address correction (for Canadian addresses, Premium version)
Here is a example how a simple match rule could look like using some of the available steps:


Even the user interface is mostly straight forward, it might be useful to take advantage of the user guide which is available for a small fee. You will see SQL Power DQguru is very powerful if you know how to use it.

Comments

Popular posts from this blog

Pentaho Data Integration - Multi-part Form submission with file upload using the User Defined Java Class Step

I recently needed to use Pentaho Data Integration (PDI) to send a file to a server for processing using HTTP Post. I spent several hours trying to use the existing steps HTTP Post, HTTP Client & Rest Client but I couldn't get it to work. After some more research I came across the issue PDI-10120 - Support for Multi-part Form Submittal In Web Service Steps  and I thought I was out of luck. I previously wrote a small Java client for a similar use case and remembered the PDI has a step called User Defined Java Class  (UDJC). After reading this great tutorial I created the following basic transaction. I have a dataset with the URL and the full file path and use the UDJC to make the HTTP call. HTTP Post using User Defined Java Class The Java class handles the actual HTTP Post. It uses 2 input variables, the URL (url) which is used for the call and the file name (longFileName). The HTTP call then contains the file (line 30) and the file name (line 31). I included some basi

Products you don't expect to be 'Made in China' - Del Monte fruit cups

Since I moved to Canada back in March I have started to realize how many products are actually made in China. Back in Germany you could also buy lots of stuff from China but you mostly had the choice between German or Europe products and Chinese products. When I went to Food Basics in Oakville a couple weeks ago to get some apples I stood in front of a huge tray of Chinese apples! Aren't there enough apples in Ontario, Canada or the US? Even Mexico would probably be closer than China. Another day my wife bought Del Monte fruit cups in the grocery store. I checked the label when I was going to eat it and i decided to leave it in the fridge. First of all it is 'Made in China' (again I guess no other country in this world has fruit) and second it contains artificial flavor. How bad must the fruit inside be that you need artificial flavor (and does anybody in China controls how it is made)? For my part I'll check the labels more closely whenever I buy any kind of product

Creating YTD transformation tables

The other day I had to setup a new data warehouse that will be used for reporting with MicroStrategy. Part of it was setting up the date dimension including the transformation tables. I had a quick look online and couldn't find any script doing the work for me so I created them myself (with the help of a colleague). All you need is an existing date dimension with date_id, year_id, quarter_id, month_id and week_id, you can find plenty of scripts for that online. YTD table select t1.day_id, t2.day_id INTO YTD_DAY from LU_DAY t1, LU_DAY t2 where t1.day_id >= t2.day_id and t1.year_id = t2.year_id QTD table  select t1.day_id, t2.day_id as qtd_day_id INTO QTD_DAY from LU_DAY t1, LU_DAY t2 where t1.day_id >= t2.day_id and t1.QUARTER_id = t2.QUARTER_id  MTD table select t1.day_id, t2.day_id as mtd_day_id INTO MTD_DAY from LU_DAY t1, LU_DAY t2 where t1.day_id >= t2.day_id and t1.month_id = t2.month_id  WTD table select t1.day_id, t2.day_id as wtd_day_id INTO WTD_D