Microsoft SSAS logistic regression mining model node types

For the DMX queries here is the reference of the node types for Microsoft SSAS logistic regression mining model:

NODE_TYPE
A logistic regression model outputs the following node types:

Node Type ID Description
1 Model.
17 Organizer node for the subnetwork.
18 Organizer node for the input layer.
19 Organizer node for the hidden layer. The hidden layer is empty.
20 Organizer node for the output layer.
21 Input attribute node.
23 Output attribute node.
24 Marginal statistics node.

So the DMX query:

SELECT * FROM [Model_Name].Content
WHERE NODE_TYPE = 23

Mining Model Content for Logistic Regression Models

Advertisements

microsoft decision trees algorithm parameters COMPLEXITY_PENALTY SCORE_METHOD SPLIT_METHOD

The Microsoft Decision Trees algorithm supports parameters that affect the performance and accuracy of the resulting mining model. You can also set modeling flags on the mining model columns or mining structure columns to control the way that data is processed.

Setting Algorithm Parameters

The following table describes the parameters that you can use with the Microsoft Decision Trees algorithm.

COMPLEXITY_PENALTY
Controls the growth of the decision tree. A low value increases the number of splits, and a high value decreases the number of splits. The default value is based on the number of attributes for a particular model, as described in the following list:

  • For 1 through 9 attributes, the default is 0.5.
  • For 10 through 99 attributes, the default is 0.9.
  • For 100 or more attributes, the default is 0.99.
FORCE_REGRESSOR
Forces the algorithm to use the specified columns as regressors, regardless of the importance of the columns as calculated by the algorithm. This parameter is only used for decision trees that are predicting a continuous attribute.

SCORE_METHOD

Determines the method that is used to calculate the split score. The following options are available:

ID Name
1 Entropy
3 Bayesian with K2 Prior
4 Bayesian Dirichlet Equivalent (BDE) with uniform prior

(default)

The default is 4, or BDE.

SPLIT_METHOD

Determines the method that is used to split the node. The following options are available:

ID Name
1 Binary: Indicates that regardless of the actual number of values for the attribute, the tree should be split into two branches.
2 Complete: Indicates that the tree can create as many splits as there are attribute values.
3 Both: Specifies that Analysis Services can determine whether a binary or complete split should be used to produce the best results.

The default is 3.

MAXIMUM_INPUT_ATTRIBUTES

Defines the number of input attributes that the algorithm can handle before it invokes feature selection.

The default is 255.

Set this value to 0 to turn off feature selection.

[Available only in some editions of SQL Server]

MAXIMUM_OUTPUT_ATTRIBUTESDefines the number of output attributes that the algorithm can handle before it invokes feature selection.

The default is 255.

Set this value to 0 to turn off feature selection.

[Available only in some editions of SQL Server]

MINIMUM_SUPPORTDetermines the minimum number of leaf cases that is required to generate a split in the decision tree.

The default is 10.

You may need to increase this value if the dataset is very large, to avoid overtraining.

 

Microsoft Decision Trees Algorithm Technical Reference

SSAS DMX Query for Dependency Network pane of the Microsoft Naive Bayes Viewer

SELECT NODE_CAPTION, MSOLAP_NODE_SCORE
FROM TM_NaiveBayes.CONTENT
WHERE NODE_TYPE = 10
ORDER BY MSOLAP_NODE_SCORE DESC

The node caption is used to identify the node, rather than the node name, because the caption shows both the attribute name and attribute value.

The MSOLAP_NODE_SCORE is a special value provided only for the input attribute nodes, and indicates the relative importance of this attribute in the model. You can see much the same information in the Dependency Network pane of the viewer; however, the viewer does not provide scores.

The NODE_TYPE = 10 represents the input attribute, and NODE_TYPE = 11 for each value of the attribute.

Reference: Naive Bayes Model Query Examples

Choosing Microsoft Prediction Model or Algorithm

From Micosoft.. “An algorithm in data mining (or machine learning) is a set of heuristics and calculations that creates a model from data. To create a model, the algorithm first analyzes the data you provide, looking for specific types of patterns or trends. The algorithm uses the results of this analysis over many iterations to find the optimal parameters for creating the mining model. These parameters are then applied across the entire data set to extract actionable patterns and detailed statistics.”

Reference : SSAS has 9 data mining models or algorithms

Choosing an Algorithm by Type

SQL Server Data Mining includes the following algorithm types:

  • Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset.
  • Regression algorithms predict one or more continuous numeric variables, such as profit or loss, based on other attributes in the dataset.
  • Segmentation algorithms divide data into groups, or clusters, of items that have similar properties.
  • Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
  • Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a series of clicks in a web site, or a series of log events preceding machine maintenance.

However, there is no reason that you should be limited to one algorithm in your solutions. Experienced analysts will sometimes use one algorithm to determine the most effective inputs (that is, variables), and then apply a different algorithm to predict a specific outcome based on that data. SQL Server Data Mining lets you build multiple models on a single mining structure, so within a single data mining solution you could use a clustering algorithm, a decision trees model, and a Naïve Bayes model to get different views on your data. You might also use multiple algorithms within a single solution to perform separate tasks: for example, you could use regression to obtain financial forecasts, and use a neural network algorithm to perform an analysis of factors that influence forecasts. You can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to reduce the number of columns in a dataset, because the decision tree can identify columns that do not affect the final mining model.

Choosing an Algorithm by Task

To help you select an algorithm for use with a specific task, the following table provides suggestions for the types of tasks for which each algorithm is traditionally used.

Examples of tasks Microsoft algorithms to use
Predicting a discrete attribute:

Flag the customers in a prospective buyers list as good or poor prospects.

Calculate the probability that a server will fail within the next 6 months.

Categorize patient outcomes and explore related factors.

Microsoft Decision Trees Algorithm

Microsoft Naive Bayes Algorithm

Microsoft Clustering Algorithm

Microsoft Neural Network Algorithm

Predicting a continuous attribute:

Forecast next year’s sales.

Predict site visitors given past historical and seasonal trends.

Generate a risk score given demographics.

Microsoft Decision Trees Algorithm

Microsoft Time Series Algorithm

Microsoft Linear Regression Algorithm

Predicting a sequence:

Perform clickstream analysis of a company’s Web site.

Analyze the factors leading to server failure.

Capture and analyze sequences of activities during outpatient visits, to formulate best practices around common activities.

Microsoft Sequence Clustering Algorithm
Finding groups of common items in transactions:

Use market basket analysis to determine product placement.

Suggest additional products to a customer for purchase.

Analyze survey data from visitors to an event, to find which activities or booths were correlated, to plan future activities.

Microsoft Association Algorithm

Microsoft Decision Trees Algorithm

Finding groups of similar items:

Create patient risk profiles groups based on attributes such as demographics and behaviors.

Analyze users by browsing and buying patterns.

Identify servers that have similar usage characteristics.

Microsoft Clustering Algorithm

Microsoft Sequence Clustering Algorithm

Reference Microsoft

SSAS configuration file msmdsrv.ini

This file resides at: ?:\Program Files\Microsoft SQL Server\MSAS??.Instance_Name\OLAP\Config

Other way to find this path is:

a)      Go to Services Console (Services.msc)

b)      Find SSAS Service

c)       Right click SSAS Service and go to its properties

d)       In General  section look for Path to executable

e)      Below Path to executable you will find complete path which looks something like this for default configuration –

“C:\Program Files\Microsoft SQL Server\MSAS11.TABULAR\OLAP\bin\msmdsrv.exe” -s “C:\Program Files\Microsoft SQL Server\MSAS11.TABULAR\OLAP\Config”

f)        Second part of the Path got location of Config directory where you will find msmdsrv.ini file

In case you want to rebuild this msmdsrv.ini:

a) STOP SQL Analysis Services

b) Go to SSAS Config folder (Default Path – ?:\Program Files\Microsoft SQL Server\MSAS??.Instance_Name\OLAP\Config)

c) In config folder you will find two files msmdsrv.ini and msmdsrv.bak, move these two files in any temporary folder

d) Start SSAS

e) After starting SSAS service you will find copy of newly created msmdsrv.ini and msmdsrv.bak files in config folder

f) File is created with default settings for all properties

e) To start with you can provide right path of Data, Temp, Log and Backup Directory, otherwise root directly will be used for these folders. In a newly created msmdsrv.ini file, path looks something like this – <DataDir>\data</DataDir> which means use root directory for creating Data Folder

Thanks to this Reference.

So long AdventureWorks, and hello Wide World Importers!

Any future development on Microsoft sample databases should use Wide World Importers instead of AdventureWorks since it will not be updated or maintained after 2016.

The Wide World Importers can be downloaded from here.

Can I still use AdventureWorks with SQL Server 2016?

Yes, the final AdventureWorks build will be updated on the official download site here and you can still use AdventureWorks with your SQL Server 2016 instance.

Microsoft Reference

What is Slowly Changing Dimension

A data warehouse dimensional model comprises of facts and measures defined in the context of their dimensions.
Dimensions and their attributes, are relatively constant, but they do change over time.

The term Slowly Changing Dimension (SCD) is about tracking the variation in dimensional attributes over time.
The word slowly in this context might seem incorrect.
For example, a salesperson or store dimension, might change rapidly if a company reorganizes.
But in general, when compared to a measure in a fact table, changes to dimensional data occur slowly.

Within a data warehouse, you can handle slowly changing dimensions in several ways.

Type 0 – The passive method
In this method no special action is performed upon dimensional changes.
Some dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 (overwrite):
A Type One change overwrites an existing dimensional attribute with new information.
In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost.
A Type One change updates only the attribute, doesn’t insert new records, and affects no keys.
No history information is stored. Existing data is overridden by new values.

Before the change:

Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate

After the change:

Customer_ID Customer_Name Customer_Type
1 Cust_1 Retail

Type 2 (add a row):
A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional data.
Type Two changes let us preserve historical data.
The history of data changes is preserved.
A new record is inserted each time a change is made.
Every data row has a valid from date and valid to date indicating the time period of the data’s validity, and each row usually has as isCurrent type of field that is set to Yes for the active record with the others set to No.
When a fact table record is inserted, it will be given the appropriate surrogate key of the dimension record.
Since Type Two changes add records, they can increase the database’s size.

Before the change:

Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 31-12-9999 Y

After the change:

Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 17-05-2012 N
2 Cust_1 Retail 18-05-2012 31-12-9999 Y

Type 3 (add a column):
This method traces changes using separate columns (but no new rows).
This means there is a limit to history preservation based on the number of columns in each row that are designated for storing historical data.
For example, a record may have the fields Territory1, Territory1EffectiveDate, Territory2, Territory2EffectiveDate, etc.
Type Three changes is implemented only if we have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name.
Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record.
We can create multiple fields to hold distinct values for separate points in time.
In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs.
This method preserves the change.
But how would we handle a second name change, or a third, and so on?
The side effects of this method are increased table size and, more important,
increased complexity of the queries that analyze historical values from these old fields.
After more than a couple of iterations, queries become impossibly complex, and ultimately you’re constrained by the maximum number of attributes allowed on a table.

Before the change:

Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Corporate Corporate

After the change:

Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Retail Corporate

Type 4 – Using historical table.
In this method a separate historical table is used to track all dimension’s attribute historical changes for each of the dimension.

Current table:

Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate

Historical table:

Customer_ID Customer_Name Customer_Type Start_Date End_Date
1 Cust_1 Retail 01-01-2010 21-07-2010
1 Cust_1 Oher 22-07-2010 17-05-2012
1 Cust_1 Corporate 18-05-2012 31-12-9999

Type 6 – Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
current_type – for keeping current value of the attribute. All history records for given item of attribute have the same current value.
historical_type – for keeping historical value of the attribute. All history records for given item of attribute could have different values.
start_date – for keeping start date of ‘effective date’ of attribute’s history.
end_date – for keeping end date of ‘effective date’ of attribute’s history.
current_flag – for keeping information about the most recent record.

In this method to capture attribute change we add a new record as in type 2.
The current_type information is overwritten with the new one as in type 1.
We store the history in a historical_column as in type 3.
The ‘main’ dimension table keeps only the current data e.g. customer and customer_history tables.

Customer_ID Customer_Name Current_Type Historical_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate Retail 01-01-2010 21-07-2010 N
2 Cust_1 Corporate Other 22-07-2010 17-05-2012 N
3 Cust_1 Corporate Corporate 18-05-2012 31-12-9999 Y

References:

SQLMag Reference

Datawarehouse4u Reference

jamesserra blog reference