Skip to main content

Open Source tool for Data cleansing and Master Data Management

Last weekend SQL Power released an improved version of SQL Power DQguru (formerly known as SQL Power MatchMaker), one of the few open source tools for data cleansing and master data management (MDM) available. Version 0.96 brings a new feature that allows you to run SQL Power DQguru from command line. This allows you to integrate it into batch scripts and your ETL jobs.

As a BI consultant for SQL Power I have used SQL Power DQguru in different projects and it has made my job a lot easier. Some of the features I like the most are:

  • Easy connection to any database with JDBC drivers, incl. SQL Server, Oracle, MySQL, Postgres
  • Lets you create complex merge rules so your dependent data will always be updated when you merge records.
  • You can combine over 25 steps to find possible duplicate data with a match rule, for example:
    • Word Count
    • Regular Expressions
    • Substrings
    • Retain certain characters
    • Translate Words, you can create your own translation rules.
  • You can preview how your data will look like when you apply the match rules
  • Automatic Address correction (for Canadian addresses, Premium version)
Here is a example how a simple match rule could look like using some of the available steps:


Even the user interface is mostly straight forward, it might be useful to take advantage of the user guide which is available for a small fee. You will see SQL Power DQguru is very powerful if you know how to use it.
Post a Comment

Popular posts from this blog

Creating YTD transformation tables

The other day I had to setup a new data warehouse that will be used for reporting with MicroStrategy. Part of it was setting up the date dimension including the transformation tables. I had a quick look online and couldn't find any script doing the work for me so I created them myself (with the help of a colleague). All you need is an existing date dimension with date_id, year_id, quarter_id, month_id and week_id, you can find plenty of scripts for that online. YTD tableselect t1.day_id, t2.day_id
INTO YTD_DAY
from LU_DAY t1, LU_DAY t2
where t1.day_id >= t2.day_id
and t1.year_id = t2.year_id QTD table select t1.day_id, t2.day_id as qtd_day_id
INTO QTD_DAY
from LU_DAY t1, LU_DAY t2
where t1.day_id >= t2.day_id
and t1.QUARTER_id = t2.QUARTER_id  MTD tableselect t1.day_id, t2.day_id as mtd_day_id
INTO MTD_DAY
from LU_DAY t1, LU_DAY t2
where t1.day_id >= t2.day_id
and t1.month_id = t2.month_id  WTD tableselect t1.day_id, t2.day_id as wtd_day_id
INTO WTD_DAY
from LU_DAY t1, LU_DAY t2
where …

Dynamic cell references in spreadsheets with Google Docs

During my former internships in a consulting company I had to work A LOT with Microsoft Excel and often had to use dynamic cell references over multiple worksheets. Recently I started using the spreadsheets of Google Docs to track my bank account balance and to figure out where all my money goes. I decided to have one sheet for every month + one sheet of the month I want to analyze. But how do I dynamically change the reference to the sheet (the monthly sheet) I want to analyze without editing every single formula? Here is my solution: Create the target sheets and your overview sheet
- I gave my sheets the names Month + Year (July 09)In the overview choose one cell that you want to contain the reference sheet and enter the sheet name
- cell D24 in the example
- Using the month names you might have to write 'July 09, otherwise Google will think it's a date.In the overview you can now dynamically reference to a detail sheet using the following formula:=INDIRECT("'"&…

Pentaho BI Server: Using action sequences as a web service with PHP

For my masterthesis I had to figure out, how to use the action sequences as webservice with PHP. According to the documentation you can receive soap messages but the action sequences don't offer a WSDL that would help you building your client. I also had problems with the http basic authentication, that Pentaho uses.
After a couple hours of research and try and error, I found a solution. I doubt thats the best way to go, but at least it works. All you need is the PEAR HTTP Request class.
Here is the code:

//PEAR Request
require_once 'Request.php';
$response = $req->sendRequest();

if (PEAR::isError($response)) {
echo $response->getMessage();
} else {
$req->clearPostData();
$req->setURL("localhost:8080/pentaho/ServiceAction");
$req->addQueryString("solution", "bi-developers");
$req->addQueryString("path", "reporting");
$req->addQueryString("action", "Testreport.xaction");