MS SSAS 2008 Unleashed - Chapter 2. Multidimensional Space

2012-8-16 14:40| 发布者: demo| 查看: 295| 评论: 0

摘要: Chapter 2. Multidimensional Space In This Chapter Describing Multidimensional SpaceDimension AttributesCellsMeasuresAggregation FunctionsSubcubes Working with relational databases, we’re used to a t ...

Chapter 2. Multidimensional Space

In This Chapter

Working with relational databases, we’re used to a two-dimensional space—the table, with its records (rows) and fields (columns). We use the term cube to describe a multidimensional space, but it’s not a cube in the geometrical sense of the word. A geometrical cube has only three dimensions. A multidimensional data space can have any number of dimensions; and those dimensions don’t have to be the same (or even similar) size.

One of the most important differences between geometric space and data space is that a geometric line is made up of an infinite number of contiguous points along it, but our multidimensional space is discrete and contains a discrete number of values on each dimension.

Describing Multidimensional Space

We’re going to define the terms that we use to describe multidimensional space. To a certain extent, they are meaningful only in relation to each other:

A dimension describes some aspect of the data that the company wants to analyze. For example, your company would have a data with time element in it—the Time could become a dimension in your model.
A member corresponds to one point on a dimension. For example, in the Time dimension, Monday would be a dimension member.
A value is a unique characteristic of a member. For example, in the Time dimension, 5/12/2008 might be the value of the member with the caption “Monday.”
An attribute is the full collection of members. For example, all the days of the week would be an attribute of the Time dimension.
The size, or cardinality, of a dimension is the number of members it contains. For example, a Time dimension made up of the days of the week would have a size of 7.

To illustrate, we’ll start with a three-dimensional space for the sake of simplicity. In Figure 2.1 , we have three dimensions: (1) Time in months, (2) Products described by name, and (3) Customers described by their names. We can use these three dimensions to define a space of the sales of a specific product to specific customers over a specific period of time, measured in months.

Figure 2.1. A three-dimensional data space describes sales of products to customers over a time period.

In Figure 2.1 , we have only one sales transaction represented by a point in the data space. If we represent every sales transaction of the product by a point on the multidimensional space, those points, taken together, constitute a “fact space” or “fact data.”

It goes without saying that actual sales are much less than the number of sales possible if we were to sell each of our products to all our customers each month of the year. That’s the dream of every manager, of course, but in reality it doesn’t happen.

The total number of possible points creates a theoretical space. The size of the theoretical space is defined mathematically by multiplying the size of one dimension by the product of the sizes of the other two. In a case where you have a large number of dimensions, our theoretical space can became huge; but no matter how large the space gets, it remains limited because each dimension is distinct and is limited by the distinct number of its members.

The following list defines some more of the common terms we use in describing a multidimensional space:

A tuple is a coordinate in multidimensional space.
A slice is a section of multidimensional space that can be defined by a tuple.

Each point of a geometric space is defined by a set of coordinates, in a three-dimensional space: x, y, and z. Just as a geometric space is defined by a set of coordinates, multidimensional space is also defined by a set of coordinates. This set is called a tuple.

For example, one point of the space shown in Figure 2.1 is defined by the tuple ([Club 2% Milk], [Edward Melomed], [March]).

An element on one or more dimensions in a tuple could be replaced with an asterisk (*) indicating a wildcard. In our terminology, that is a way to specify not a single member but all the members of this dimension. By specifying an asterisk in the tuple, we turn the tuple from a single point into a subspace (actually, a normal subspace). This sort of normal subspace is called a slice.

You might think of an example of a slice for the sales of all the products in January to all customers as written (*, *, [January]). But for simplicity, the wildcards in the definitions of slice are not written; in our case, it would be simply ([January]). Figure 2.2 shows the slice that contains the sales that occurred during January.

Figure 2.2. A slice of the sales from January.

You can think of many other slices, such as the sales of all the products to a specific customer ([Edward Melomed]), the sales of one product to all customers ([Club 2% Milk]), and so on.

Dimension Attributes

But how would you define the space of sales by quarter rather than by month? As long as you have a single attribute (months) for your Time dimension, you would have to manually (or in our imaginations) group the months into quarters. When you’re looking at multiple years, your manual grouping starts to be unwieldy.

What you need is some way to visualize the months, quarters, and years (and any other kind of division of time, maybe days) in relation to each other—sort of like a ruler enables us to visualize the various divisions of a foot or a yard, and the inches and standard fractions of inches along the way.

In essence, what you need is additional attributes (quarters, years, and so forth). Now you can use months as your key attribute and relate the other attributes (related attributes) to the months—3 months to a quarter, 12 to a year.

So, back to our example. We want to “see” the individual months in each quarter and year. To do this, we’ll add two related attributes to the Time dimension (quarter and year) and create a relationship between those related attributes and the key attribute (month). Now we can create a “ruler,” like the one in Figure 2.3, for the dimension: year-quarter-month.

Figure 2.3. Related attributes (year, quarter) are calibrated relative to the key attribute (month).

Now we have a hierarchical structure for our “ruler”—a dimension hierarchy. The dimension hierarchy contains three hierarchy levels—Years, Quarters, and Months. Each level corresponds to an attribute. If you look at Figure 2.4 , which appears a little later, you can see our ruler, with its hierarchical structure, within our multidimensional space.

Figure 2.4. Related attributes create new points in multidimensional space.

A dimension can have more than one hierarchy. For example, if we count time in days, we could add another attribute: the days of the week. And we could remove the key attribute designation from month and give it to day.

Now we can have two dimension hierarchies: year, quarter, month, day; and year, week, day.

Note

We sneaked an additional attribute in there—week. We had to do that because a month doesn’t divide nicely into weeks. So, in the second dimension hierarchy, we dropped month and substituted week (by ordinal number).

Cells

With our ruler added to this multidimensional space, we can see (in Figure 2.4 ) some new positions on the ruler that correspond to the members of the related attributes (quarter, year) that were added. These members, in turn, create a lot of new points in the multidimensional space. However, you don’t have any values for those new points because the data from our external source contained only months. You won’t have values for those points until (or unless) you calculate them.

At this point, you have a new space—the logical space—as opposed to the fact space, which contains only the points that represent actual sales and the theoretical space that represents all possible sales transactions could happen.

Your cube, then, is made up of the collection of points of both the theoretical (including fact space) and logical spaces (in other words, the “full space” of the multidimensional model). Each point in the cube’s space is called a cell.

Therefore, a cell in the cube can fall into one of the three spaces. The cell in the fact space is associated with an actual sale of a product to a customer. In Figure 2.5 , we can see a fact cell that represents an actual sale: It contains the amount that a customer paid for the product. If the sale wasn’t made (that is, a potential sale), our cell is just a theoretical point in the cube (a theoretical cell). We don’t have any data in this cell. It’s an empty cell with a value of NULL. For the fact cell, where we have the amount that the customer paid, that amount is the cell value.

Figure 2.5. This cube diagram shows two cells: one with a real value and one with a NULL value.

Measures

The value in a cell is called a measure. Figure 2.5 shows the amount the customer paid for the product. To tell the truth, we arbitrarily chose the amount paid as the value for that cell. We could have used some other value that describes the sale—such as the number of items (of that product) the customer bought. As a matter of fact, that’s a good idea. We’ll just add another measure so that we have two: the amount the customer paid and the quantity of items of the product that she bought.

These measures, taken together, can be seen as a dimension of measures—a measure dimension. Each member of this dimension (a measure) has a set of properties, such as data type, unit of measure, and (this is the most important one) the calculation type for the data aggregation function.

Aggregation Functions

The type of calculation is the link that binds together the theoretical (fact) and logical space of the cube. It is the data aggregation function that enables us to calculate the values of cells in the logical space from the values of the cells in the fact space; we cannot calculate values the based on the empty values in the theoretical space.

An aggregation function can be either simple (additive) or complex (semi-additive). The list of additive aggregation functions is pretty limited—the sum of the data, the minimum and maximum values of the data, and a calculation of the count, which is really just a variation on the sum. All other functions are complex and use complex formulas and algorithms, which we discuss in Chapter 12, “Cube-Based MDX Calculations.”

As opposed to geometric space in which the starting point is the point at which all the coordinates equal 0, the starting point for multidimensional space is harder to define. For example, if one of our dimensions is month, we don’t have a value of 0 anywhere along the dimension. Therefore, you can define the beginning of multidimensional space by the attribute that unites all the members of the dimension; that attribute contains only one member, All. For simple aggregation function, such as sum, the member All is equivalent to the sum of the values of all the members of the factual space; for complex aggregation functions, All is calculated by the formula associated with the function.

Subcubes

An important concept in the multidimensional data model is a subspace or subcube. A subcube represents a part of the full space of the cube as some multidimensional figure inside the cube. Because the multidimensional space of the cube is discreet and limited, the subcube is also discreet and limited. The slice that we discussed earlier is a case of a subcube in which the boundaries are defined by a single member in the dimension.

The subcube can be either normal or of an arbitrary shape. Subcube consists of the points in the multidimensional space. In a normal subcube, a coordinate that exists on one dimension must be present for every coordinate on the other dimensions among subcube points. An arbitrary shape subcube doesn’t have this limitation and can include points with any coordinates. In Figure 2.6 , you can see examples of a normal- and an arbitrary-shaped subcube.

Figure 2.6.

Describing multidimensional space requires a new vocabulary:

Aggregation function— A function that enables us to calculate the values of cells in the logical space from the values of the cells in the fact space
Attribute— A collection of similar members of a dimension
Cell value— A measure value of a cell
Dimension— An element in the data that the company wants to analyze
Dimension hierarchy— An ordered structure of dimension members
Dimension size— The number of members a dimension contains
Measure— The value in a cell
Member— One point on a dimension
Member value— A unique characteristic of a member
Tuple— A coordinate in multidimensional space
Slice— A section of multidimensional space that can be defined by a tuple
Subcube— A portion of the full space of a cube