Archive for the ‘Data warehouse’ Category

Considering Key errors while cube processing

Considering Key errors while OLAP (SSAS) Cube processing:

In designing a data mart we often come across a situation while defining relationship between the fact and the dimension tables that is it feasible or not to have foreign keys NULLable? Are the built-in foreign keys of SQL Server are sufficient to handle relationship.

Foreign Key Null ability

It is not a good practice to leave the foreign key NULLable , because while moving data into the cubes causes relationship failure between the fact and

Read full story »

Natural and surrogate Keys

Natural and Surrogate Keys:-

According to the definition by Hall, Owlett and Todd (1976), a surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible to the user or application. From another practitioner perspective namely Wieringa and De Jonge in his point view a surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or application, this indicate that the surrogate key has business meaning, in the fact table all natural keys are used as surrogate key.

Types of Fact Tables

An ETL developer must know the types of fact tables he/she will be using during dimensional modeling and ETL development.



Mid table or Factless table:

A table used to resolve many-to-many relationship between two dimension tables called a factless table, in Analysis Services these table are modeled as measure groups that bridges in different dimension tables. Read full story »

Slowly Changing Dimensions — SCD

Slowly Changing Dimension:-

For an ETL developer this is must to know that it is possible that the dimension might changes over time, like wise a product may changes his price and a customer changes his contact information, some of the changes are needed to be recorded and some of them may not, to track these changes  a technique used called Slowly Changing Dimensions(SCDs)

Reason to Model SCDs

If a product price changes over time we will lost the previous one that would result in the stats that the prices is remain same.

Degenerated Dimensions

Degenerated Dimension:-

The columns in the fact that are participated in some sort of analysis but do not have any relation with any of the existing dimension, they are kept in the degenerate dimension and used as and when needed, while during ETL process remember that normally the degenerate dimension has the  same cardinality as of fact table.

Junk Dimensions

Junk Dimension:-

Often at the end of the dimension modeling process, we come across with some attributes that do not belong to any specific diemsion these attributes are either have a very limited range or they seems to be not important enough to identified as dimension. One of the advantage of junk dimension is that if we join all these similar dimension in a single dimension we can reduce the number of fields in the fact table.

Dimensional Modeling- Star Schema

Dimension Modeling Methodologies:

The below are most used methodologies that are used for building a data mart.

Data warehouse schemas:

Collection of database objects, including tables, views, indexes, and synonyms are termed as schema.

Star Schema:

When you have identified your dimension and fact tables now it is needed to relate them, star schema methodology bound these tables in such a way that the fact table is kept center surrounded by any number of dimension tables representing a star style and called star schema. The Fact table is normally in the third normal form because the dependency is either on one dimension or all of them but not on combination of few dimensions.

The reason to choose a star schema is its simplicity from the users’ point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies.

Snowflake Schema:

A Kind of Star schema that is more complex data warehouse schema option to choose than its predecessor, It is called a snowflake schema because the diagram of the schema resembles a snowflake. The focus of this schema is to normalize the data warehouse to reduce the query performance, the stated schema normalized a dimension table into more sub tables although it saves space as it increases the dimension tables and require more foreign key to join tables but leaving the extraction process little bit complex as it increases the usage of complex SQL queries.

Dimensions and Facts

Data Mart:-

Regardless of choosing the data warehousing methodology your frond end using Analysis Services would ultimately be a data mart. In simple a data mart can be understand as  is a database that is organized according to the methodology you have chosen to your warehouse and is composed of fact and dimension tables.

  • Data Mart Structure:-

The structure of the data mart separate the entire database into to two distinct entities.

  • Dimension table:-

Dimension table hold attributes of different entities that describe fact records that are part of fact tables, among these some are representing descriptive information while rest of the other are used to specify how attributes of fact table should be modeled and sketched that would be helpful to the analyst. Dimension table normally consists of hierarchies of attributes that support summarization phase. Let us consider an example of a dimension containing customer  information that contain a hierarchy that divides customer in different categories  and subcategories until it reaches at the lowest level.

  • Fact Table:

Normally a data warehouse contain one or more fact table in their data mart structure, a fact table confined of measures, fact table normally contain huge  number of  rows as they contain historical data    for number of years. To acquire the information about different part of the organization the fact table contain the numeric(fact) that are summarized to fulfill the task to analyze.

Fact tables should not hold descriptive information or any data other than the numerical measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables.

The fact table is the composition of foreign and primary keys that are related to relevant dimension table.

Data warehousing Approaches

Data Warehousing:-

A centralized data repository designed with enterprise-wide usage in mind. Data warehouse provides facility for getting quick, accurate, and often insightful information. A Data Warehouse is designed so that its users can recognize the information they want and access that information using simple tools. A Data Warehouse integrates operational data from various sources into a single and consistent architecture that supports analysis and decision-making within the organization.

Data warehouses may consist of large amount of data, sometimes in smaller logical units called Data marts. Often the schemas of data marts are stored in what are known as “star schemas”, or dimensional modeling form; however there is no industry standard requiring that the schemas of data marts be in any particular form. Data warehouses are usually accessed (queried) via “data marts”, which are purpose-specific access points to or sub-sets of the warehouse. Data marts are designed to answer the probable queries of a given kind of user.

Data warehousing Approaches:-

With Data Warehousing we have two schools of thought.

Bill Inmon’s paradigm:-

Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary.

Ralph Kimball Methodology-

Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtain from the dimension modeling on a local departmental level.

ETL Pentaho Data Integration Tutorial

ETL Extract, transform and Load is the corner stone for Business Intelligence and Data warehousing projects. Learning ETL is not always an easy job and that is the reason ETL developers are highly paid.
However, tools like Pentaho Data Integration Suite (Kettle) has made this job easy for newbies learning ETL.

I thought to share this simple and easy to learn Pentaho (Kettle) ETL  starter tutorial.

Einfobuzz Site Map