DBTimes
Overriding
Configuration on Command Line
stgSource.isDefaultForEffectiveDate
dimAuthority.dimensionNamePattern
dimAuthority.dimensionsDestinationFile
dimAuthority.dimensionsDestinationFile.parquet
dimAuthority.dimensionsDestinationFile.path
dimAuthority.dimensionsDestinationFile.previousCopyPath
dimension.dimensionStgSources.moniker
dimension.dimensionStgSources.name
dimension.dimensionStgSources.schema
dimension.schema.column.colName
dimension.schema.column.colType
dimension.schema.column.isSurrogateKey
dimension.schema.column.unknownValue
dimension.schema.column.isNaturalKey
dimension.schema.column.isTypeTwo
dataMart.factsDestinationFile.parquet
dataMart.factsDestinationFile.path
dataMart.factsDestinationFile.pathWithKeySetters
fact.schema.column.isForPartition
fact.schema.column.isForSettingForeignKey
fact.schema.column.underlyingDim
fact.schema.column.underlyingDimCol
fact.schema.column.isEffDateForTypeTwo
Using previous
version of a dimension to restore a field value on initial load
Enable/disable
Debug flag without creating separate configuration file
Data Warehouse ETL (DW ETL) is a library to maintain dimensional model in the file system, HDFS.
The library implements ETL process for the dimensional model created using Kimball methodology (see “The Data Warehouse Toolkit” by Kimball/Ross).
The library’s Dim and Fact classes generically handle various aspects of dimension design and lifecycle.
Dim class:
- Load dimension from multiple sources
- Historical load of data on initial load for both Type 1 and Type 2 dimensions
- Re-load dimension from specific point in time
- Type 1 and Type 2 changes, and by extension Type 3 and Type 4. Type 2 dimension allows for type 1 columns, so it more like Type 7, but we still call it Type 2
- Generation of surrogate keys
- Algorithmic surrogate keys
- Determination of dates for historical load. The dimension may need to be processed for multiple dates on model re-generation (on initial load or when the design changed) or to catch-up when several days are skipped for any reason
- Load dimension for multiple effective dates in bulk or one at a time.
- Setting keys on a fact table with the option to override the default implementation
- Full and incremental dimension load
Fact class:
- Assignment of foreign keys
- Load fact table from multiple sources
- Load multiple dates in bulk or one at a time.
- Full and incremental fact table load
- Supports different strategies of incremental fact table processing
- Creation of table with key setting columns to troubleshoot Unknown keys and setting keys in general
Object ModelObject holds library’s entry point.
package com.dbtimes.dw.etl
object ModelObject
def runEtl(appConfig: Config): Unit
The method is used to execute the ETL process. The parameter holds the configuration that describes the dimensional model. Only one invocation of runEtl is allowed per application execution.
import org.apache.spark.sql.SparkSession
import com.typesafe.config.{Config, ConfigFactory}
import scala.collection.JavaConverters._
import com.dbtimes.dw.etl.ModelObject
object DwNFL {
def main(args: Array[String]) {
val configFileName = if (args.length == 1) args(0) else "dwEtlDefault.json"
val appConfig = ConfigFactory.load(configFileName)
val sparkSessionBuilder = DwNFL.configureSparkSession(appConfig)
val spark = sparkSessionBuilder.getOrCreate()
// Do initialization specific to a given DW
…
// ETL process to build or update dimensions and fact tables
ModelObject.runEtl(appConfig)
spark.stop()
}
private def configureSparkSession(appConfig: Config): SparkSession.Builder = {
var sparkSessionBuilder = SparkSession.builder()
sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionAppName")) sparkSessionBuilder.appName(appConfig.getString("sparkParams.sessionAppName")) else sparkSessionBuilder
sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionConfigMaster")) sparkSessionBuilder.config("spark.master", appConfig.getString("sparkParams.sessionConfigMaster")) else sparkSessionBuilder
sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionConfigDriverHost")) sparkSessionBuilder.config("spark.driver.host", appConfig.getString("sparkParams.sessionConfigDriverHost")) else sparkSessionBuilder
sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionConfigLocalDir")) sparkSessionBuilder.config("spark.local.dir", appConfig.getString("sparkParams.sessionConfigLocalDir")) else sparkSessionBuilder
sparkSessionBuilder = sparkSessionBuilder.config("spark.driver.maxResultSize", "5g")
sparkSessionBuilder
}
}
package com.dbtimes.dw.etl
class Dim
In the most common scenario there are only two steps required to implement a new dimension:
1. Create a new class derived from Dim
2. Override loadDim method
Additional customization can be done by overriding appropriate methods of a Dim class.
protected val dimName: String
Name of the dimension.
protected val dimKeyUnknownValue: String
Unknown value for dimension key. Populated from
configuration dimension schema attribute "unknownValue" for dimension
key – see dimension schema attributes.
protected val spark: SparkSession
A copy of SparkSession. The value is the same as calling SparkSession.builder().getOrCreate()
protected val isInitialLoad: Boolean
This flag is populated from configuration path "dwEtl.isInitialLoad". Flag is used to indicate if the dimension created from scratch – initial load - or is being updated with new data. The initial load is executed when dimension is created for the first time or when the dimensional model has changed.
protected val effDateColumnNameInDatesToProcess: String
The name of the date column in the view viewWithDatesToProcess (see next).
protected val viewWithDatesToProcess: String
The name of the view with effective dates. The view holds dates for the current run of the ETL process.
Code example:
class DimDate(dimName: String) extends Dim(dimName) {
override def loadDim(stgSrcTempView: String): Option[DataFrame] = {
val sqlStgSource
=
s"""| SELECT
| CAST( DATE_FORMAT($effDateColumnNameInDatesToProcess, 'yyyyMMdd') AS INT) AS DateKey,
| $effDateColumnNameInDatesToProcess AS Date,
| DATE_FORMAT($effDateColumnNameInDatesToProcess, 'yyyy-MM-dd') AS DateDesc,
| DATE_FORMAT($effDateColumnNameInDatesToProcess, 'MMM') AS MonthName,
| CAST( DATE_FORMAT($effDateColumnNameInDatesToProcess, 'yyyy') AS INT) AS Year,
| QUARTER($effDateColumnNameInDatesToProcess) AS Quarter,
| CONCAT( 'Q', CAST( QUARTER($effDateColumnNameInDatesToProcess) AS STRING ) ) AS QuarterDesc
| FROM $viewWithDatesToProcess """.stripMargin
val dfStgSource = spark.sql(sqlStgSource)
Some(dfStgSource)
}
}
protected def getLastProcessedStgSourceTimestamp(stgSourceTempView: String): Option[SqlTimestamp]
The method returns the maximum timestamp for the staging data source that was processed in the previous ETL run for the current dimension. It is used to load data into dimensional model object incrementally.
protected def isStgSourceChangedSinceLastLoad(stgSourceTempView: String): Boolean
The method returns false if the data source has not changed since was processed in the previous ETL run for the current dimension, and true otherwise. The method is used to load data into dimensional model object incrementally.
protected def loadDim(stgSrcTempView: String): Option[DataFrame]
Override this method to load new data in bulk or for all effective dates. This is one of the two versions of loadDim. Either this one of the other must be overridden, but not both.
protected def loadDim(effDateYYYY_MM_DD: String, stgSrcTempView: String): Option[DataFrame]
Override this method to load new data for each effective date. This version of loadDim must be used to load a dimension with Type 2 columns. This is one of the two versions of loadDim. Either this one or the other must be overridden, but not both.
protected def enrichDim(dfSrcDim: DataFrame): DataFrame
Optional override. If implemented, it will be called after all the sources have been loaded using loadDim(stgSrcTempView: String) or all sources loaded for one effective date using loadDim(effDateYYYY_MM_DD: String, stgSrcTempView: String). Implement this override to perform transformation on the source data before is is added to the dimension. The returned DatFrame has to have the same schema as the one returned from loadDim method.
protected def setForeignKeyOnFactTable(fact: DataFrame, mapDimColsToFactCols: Map[String, String], effDateColAsOption: Option[String]): DataFrame
Optional override. Used when setting a key is other than an equi-join on natural keys in fact table template and dimension. For, example when setting a key s done in some priority order, or when there multiple ways to join on natural keys.
protected def preProcess(stgSourceTempViews: List[String], datesToProcess: List[Date]): Unit
Optional override. It is called before sources are loaded. Can be used to create views later used in the load process.
protected def postProcess(stgSourceTempViews: List[String], datesToProcess: List[Date], dfModelObject: DataFrame): DataFrame
Optional override. It is called after dimension processing is complete right before the dimension is persisted into a parquet file. One possible use is to do some do logging or for debugging.
protected def getCustomDatesToProcess(dfDatesToProcessBasedOnConfig: DataFrame): Option[DataFrame]
Optional override. It is called after effective dates are determined from the sources based on configuration. In this method these dates can be modified or completely replaced based on custom logic.
package com.dbtimes.dw.etl
class Fact
In the most common scenario there are only two steps required to implement a new fact table:
1. Create a new class derived from Fact
2. Override loadFact method
Additional customization can be done by overriding appropriate methods of a Fact class.
protected val factName: String
Name of the dimension.
protected val spark: SparkSession
A copy of
SparkSession. The value is the same as calling SparkSession.builder().getOrCreate()
protected val isInitialLoad: Boolean
This flag is populated from configuration path "dwEtl.isInitialLoad". Flag is used to indicate if the dimension created from scratch – initial load - or is being updated with new data. The initial load is executed when dimension is created for the first time or when the dimensional model has changed.
protected val effDateColumnNameInDatesToProcess: String
The name of the date column in the view viewWithDatesToProcess (see next).
protected val viewWithDatesToProcess: String
The name of the view with effective dates. The view holds dates for the current run of the ETL process.
Code example:
class FactPlays(factName: String) extends FactsNFL(factName)
{
override def loadFact(stgSrcTempView:
String): Option[DataFrame] = {
if (
!isStgSourceChangedSinceLastLoad(stgSrcTempView) ) {
dwEtlLog.info(s"--
Skipping loading fact $factName from source $stgSrcTempView as is has not
changed since last load")
None
}
else {
val dfFact = if (stgSrcTempView
== "PlayByPlay") {
val sqlStgSource =
s"""
|SELECT
|
src.SeasonYear AS SeasonYear,
| "$stgSrcTempView" AS DataSourceMoniker,
| CAST(
src.GameId AS STRING ) AS
GameId,
| CAST(
src.GameDate AS DATE ) AS
PlayDate,
| CAST(
src.PenaltyYards AS DOUBLE ) AS PenaltyYards
|FROM $stgSrcTempView AS src
| INNER JOIN $viewWithDatesToProcess AS dates ON src.SeasonYear = udfSeasonYear( dates.$effDateColumnNameInDatesToProcess )
|
| """.stripMargin
dwEtlLog.info(s"--
Loading fact $factName from source $stgSrcTempView using sql:\n" + sqlStgSource)
val dfStgSource = spark.sql(sqlStgSource)
dfStgSource.dropDuplicates()
}
else {
throw new RuntimeException(s"""Etl
ERROR: Unknown stg source while loading fact $factName""")
}
if (DwNFL.getIsDebug)
{
dfFact.printSchema()
dfFact.show(40)
}
Some(dfFact)
}
}
}
protected def getLastProcessedStgSourceTimestamp(stgSourceTempView: String): Option[SqlTimestamp]
The method returns the maximum timestamp for the staging data source that was processed in the previous ETL run for the current dimension. It is used to load data into dimensional model object incrementally.
protected def isStgSourceChangedSinceLastLoad(stgSourceTempView: String): Boolean
The method returns false if the data source has not changed since was processed in the previous ETL run for the current dimension, and true otherwise. The method is used to load data into dimensional model object incrementally.
protected def loadFact(stgSrcTempView: String): Option[DataFrame]
Override this method to load new data in bulk or for all effective dates. This is one of the two versions of loadFact. Either this one of the other must be overridden, but not both.
protected def loadFact(effDateYYYY_MM_DD: String, stgSrcTempView: String): Option[DataFrame]
Override this method to load new data for each effective date. This is one of the two versions of loadFact. Either this one or the other must be overridden, but not both.
protected def preProcess(stgSourceTempViews: List[String], datesToProcess: List[Date]): Unit
Optional override. It is called before sources are loaded. Can be used to create views later used in the load process.
protected def postProcess(stgSourceTempViews: List[String], datesToProcess: List[Date], dfModelObject: DataFrame): DataFrame
Optional override. It is called after fact processing is complete right before the fact table is persisted into a parquet file. One possible use is to do some do logging or for debugging.
protected def getCustomDatesToProcess(dfDatesToProcessBasedOnConfig: DataFrame): Option[DataFrame]
Optional override. It is called after effective dates are determined from the sources based on configuration. In this method these dates can be modified or completely replaced based on custom logic.
Code example:
override def getCustomDatesToProcess(dfDatesToProcessBasedOnConfig:
DataFrame): Option[DataFrame] = {
dfDatesToProcessBasedOnConfig.createOrReplaceTempView("DatesToProcessBasedOnConfig")
val sqlDatesToProcess
=
s"""|SELECT
MAX( $effDateColumnNameInDatesToProcess ) AS
SeasonYearDate
|FROM DatesToProcessBasedOnConfig
|GROUP BY udfSeasonYear( $effDateColumnNameInDatesToProcess )
|""".stripMargin
val dfDatesToProcess = spark.sql(sqlDatesToProcess)
Some(dfDatesToProcess)
}
Configuration file describes dimensional model and parameters required to run the ETL process. The section used by the etl library must be called "dwEtl". Configuration file can have additional attributes required by application. Additional attributes can be added to any section of configuration.
Optional attributes are highlighted:
· "effectiveDateRule" : "ALL", highlighted in grey and are shown with default values, or
· "rerunEtlAfter", highlighted in yellow and are not required, or
· “isDebug”, highlighted in magenta and are additional application specific attributes that are not used by the library. These attributes can be organized in any way. The additional attributes can also be placed inside "dwEtl" section, like "description" and "transactionName" in the configuration structure below.
{
"isDebug" : "false",
"sparkParams" :
{
"sessionAppName" : "Spark NFL DW",
"sessionConfigMaster" : "local",
"sessionConfigDriverHost" : "localhost",
"sessionConfigLocalDir" : "C:\\DATA\\Temp"
},
"dwEtl" :
{
"jobType" : "LoadDimsFacts",
"isInitialLoad" : "true",
"rerunEtlAfter" : "2021-04-28 16:33:00.119",
"logFileDir" : "C:\\LOGS\\Etl\\",
"etlLogFilePath" : "C:\\DATA\\DW.ETL\\etlDwNFL.parquet\\",
"stgSources" :
[
{ "moniker" : "PlayByPlay",
"description" : "Play by play s/sheet for all games",
"transactionName" : "play",
"effectiveDateColumn" : "GameDate",
"isDefaultForEffectiveDate" : "true",
"effectiveDateRule" : "ALL",
"timestampColumn" : "_SrcDt_RowTimestamp",
"fileSource" : { "path" : "C:\\ASCODE\\DATA\\NFL\\STAGING\\stgPbp.parquet", "parquet" : { } }
},
…
],
"dimAuthority" :
{
"packageName" : "com.dbtimes.nfldw.dimauth",
"dimensionNamePattern" : "(?<dimName>Dim[A-Z][a-zA-Z0-9]+)",
"dimensionsDestinationFile" :
{
"parquet" : {},
"path" : "C:\\DATA\\NFL\\DW\\DimNamePlaceholder.parquet",
"previousCopyPath" : "C:\\ASCODE\\DATA\\NFL\\DimAuthorityPrev\\DimNamePlaceholder.prev.parquet"
}
},
"__dimension_name__comment" : "dimension name must match the dimensionNamePattern.It will be used to replace name placeholder with actual dimension name",
"dimensions" :
[
{
"name" : "DimMyDimension",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker" : "someMoniker" },
…
],
"dimensionDimSources"
:
[
{ "name"
: "DimDate",
"schema"
:
[
{ "colName" : "Date"
},
{ "colName" : "DateDesc"
},
{ "colName" : "MonthName"
}
]
}
],
"schema" :
[
{ "colName" : "MyDimensionKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
{ "colName" : "MyDimensionNaturalKeyOne","colType" : "String", "isNaturalKey" : "true", "unknownValue" : "Unknown" },
{ "colName" : "MyDimensionNaturalKeyTwo","colType" : "String", "isNaturalKey" : "true", "unknownValue" : "Unknown" },
{ "colName" : "SomeStringDimensionAttr","colType" : "String", "isNaturalKey" : "false", "unknownValue" : "Unknown" },
{ "colName" : "SomeIntDimensionAttr","colType" : "Integer", "unknownValue" : "-1" },
{ "colName" : "SomeDateDimensionAttr ","colType" : "Date", "unknownValue" : "1900-01-01" }
{ "colName" : "SomeType2Attr", "colType" : "String", "isTypeTwo" : "true", "unknownValue" : "" },
]
},
…
],
"dataMart" :
{
"packageName" : "com.dbtimes.nfldw.datamartgames",
"factNamePattern" : "(?<factName>Fact[A-Z][a-zA-Z0-9]+)",
"factsDestinationFile" :
{
"parquet" : {},
"path" : "C:\\DATA\\NFL\\DW\\FactNamePlaceholder.parquet",
"pathWithKeySetters" : "C:\\DATA\\NFL\\DW.ETL\\FactNamePlaceholder.WithKeySetters.parquet"
}
},
"facts" :
[
{ "name" : "FactPlaysForMerge",
"isLoad" : "false",
"processingMode"
: "MERGE_PARTITION",
"factStgSources" :
[
{ "moniker" : "PlayByPlay"
}
],
"schema" :
[
{ "colName" : "SeasonYear", "colType"
: "Integer", "isForPartition"
: "true"
},
{ "colName" : "DataSourceMoniker", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "DataSourceKey", "underlyingDim"
: "DimDataSource", "underlyingDimCol"
: "DataSourceMoniker"
},
{ "colName"
: "GameId", "colType"
: "String", "isMergeKey"
: "true", "isForSettingForeignKey"
: "true", "foreignKey"
: "GameKey", "underlyingDim"
: "DimGame", "underlyingDimCol"
: "GameId"
},
{ "colName"
: "PlayDate", "colType"
: "Date", "isMergeKey"
: "true", "isForSettingForeignKey"
: "true", "isEffDateForTypeTwo"
: "true", "foreignKey"
: "PlayDateKey", "underlyingDim"
: "DimDate", "underlyingDimCol"
: "Date"
},
{ "colName" : "OffenseTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "OffenseTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName" : "DefenseTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "DefenseTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName" : "PenaltyTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PenaltyTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName" : "TimeoutTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "TimeoutTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName" : "Formation", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "Formation"
},
{ "colName" : "PlayType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PlayType"
},
{ "colName" : "PassType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PassType"
},
{ "colName" : "PenaltyType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PenaltyType"
},
{ "colName" : "ToGo", "colType"
: "Double", "isMeasure" : "true"
},
{ "colName" : "PenaltyYards", "colType"
: "Double", "isMeasure" : "true"
}
]
},
…
]
}
}
Ability to override configuration file can be useful, for example, to load different objects from the same configuration by disabling load of specific objects in specific runs.
Use this code in main method of a Data Warehouse implementation to load configuration file and apply overrides
val configFileName = if (args.length >= 1) args(0) else throw new RuntimeException("""ERROR: Need to pass a parameter
- configuration file name """)
val appConfig = MiscHelper.loadConfigAndApplyOverrides( configFileName, args.drop(1) )
instead of this that only loads the configuration from a file
val configFileName = if (args.length == 1) args(0) else throw new RuntimeException("""ERROR: Need to pass a parameter
- configuration file name """)
val appConfig = ConfigFactory.load(configFileName)
The ETL process consists of two parts – ETL demensions and ETL facts. jobType is used to run either both or one of these parts. The allowed values for this attribute are "LoadDimsFacts", or "LoadDims", or "LoadFacts". The main scenario for running ETL dimensions separate from ETL facts is loading dimensions in parallel by defining separate configurations for one or more dimensions (using isLoadFlag) and running ETL facts after - serially or in parallel.
The flag indicates if the load is initial or subsequent. If the load is initial the existing files for dimensions and/or facts are deleted. Which ones are deleted depends on the jobType - for "LoadDimsFacts" both dimensions and fact files are deleted, for "LoadDims" - dimensions are deleted, and for "LoadFacts" - fact files are deleted.
There is one use case when this flag should be set to “true” for ongoing loads: when dimension or fact has to be recreated every time based on current sources, i.e., when rows for non-existent natural keys deleted and all attributes are Type 1.
The attribute is used to rerun subsequent load for effective dates processed earlier after the timestamp value specified based on the log.
A full path of the directory for log files. Used primarily for development. It is not available on HDFS. There are two log files created – one with library messages and, another, with Spark messages.
An array of specifications for staging files. The staging files are a source for dimensions and facts. We will use stgSource to refer to one element of stgSources array.
A name or label of the staging source. The moniker must be unique within a set of staging sources.
An optional attribute to specify a column in a staging source with effective date. Effective dates are used to load dimensional model objects incrementally, to process Type 2 dimensions, to process Date dimension and to set Date key on a fact table. Only one out of all sources for a dimension can have stgSource.effectiveDateColumn . If more than one sources have effective date defined, only the first one will be used.
stgSource.isDefaultForEffectiveDate
An optional attribute to designate a staging source as a default source for effective date. The staging source that is default for effective date is used when none of the sources for a specific model object have effective date column.
An attribute to specify how to derive effective date from a column stgSource.effectiveDateColumn. The rule can have following predefined values:
· ALL (default) – all calendar dates between minimum and maximum dates in the source,
· DISTINCT – all distinct dates in the source,
· WEEKDAYS - distinct weekdays (Monday-Friday) between minimum and maximum dates in the source.
An optional attribute to specify a column with the timestamp. If staging sources are loaded using source load library then every source will have a timestamp column. Timestamp is used for incremental processing in subsequent ETL runs.
An attribute to specify a file as a type of a staging source.
An attribute to specify a full path of a staging source file.
An attribute to specify that file is in parquet format.
A section of specifications for attributes common to all dimensions.
All dimension classes described in a configuration must be defined in the same package. Dimension classes can be split up between different configurations for parallel load. In that case dimension classes can be implemented in different packages. Package name is used to dynamically create an instance of a dimension class.
dimAuthority.dimensionNamePattern
A regular expression pattern for dimension name. Each dimension’s name must match this pattern.
dimAuthority.dimensionsDestinationFile
A section to describe the dimension destination file. The file will be created and maintained by the ETL library.
dimAuthority.dimensionsDestinationFile.parquet
Defines dimension destination file as parquet.
dimAuthority.dimensionsDestinationFile.path
Defines destination file path for all dimensions. For
example, "path" : "/dw/data/dm/DimNamePlaceholder.parquet". Text 'DimNamePlaceholder' in path must match the
dimensionNamePattern above. Text 'DimNamePlaceholder' will be replaced with
actual dimension name which also has to match the pattern.
dimAuthority.dimensionsDestinationFile.previousCopyPath
Defines path for the previous copy of the dimension. The
previous copy can be used to re-run the ETL in case of the latest version got
corrupt and has unfixable data issues.
An array of specifications for dimensions. We will use dimension to refer to one element of dimensions array.
The name of the dimension. The name must match the pattern defined in dimAuthority.dimensionNamePattern attribute.
If set to false the dimension will not be loaded. Can be used to create separate configurations to load specific dimensions or during development to skip the load of already tested dimensions.
An array of staging source monikers used to load this dimension. The Etl library will call loadDim for each staging source passing the moniker name as parameter. The array can be empty. If it is empty the loadDim will be called once with “N/A” as the value for staging source moniker.
dimension.dimensionStgSources.moniker
A staging source moniker from the stgSources array.
An array of dimensions that are a source for loading current
dimension. In some instances the dimension can be used to load another
dimension for performance reasons and to minimize code duplication. A dimension
that is a source will be loaded prior to loading this dimension. The array can
be empty. A loadDim will not be called for each dimSource and will not be
passed as a parameter to loadDim, but dimension source can be referenced as a
view with the same name in the SQL to load dimension.
dimension. dimensionDimSources.name
A dimension Name that is a source for loading current dimension.
dimension. dimensionDimSources.schema
A
dimension source can optionally have a schema that list a subset of columns
used to load current dimension. If schema is absent all
columns can be used in loading current dimendion.
An array of specifications for dimension columns. We will use dimension.schema.column to refer to one element of dimension.schema array.
A name of the dimension column.
A data type of the dimension column. colType must be Short, Integer or Long if this column is a dimension key (i.e., "isKey" : "true")
If true, this column is a dimension key.
dimension.schema.column.isSurrogateKey
isSurrogateKey can only be defined for dimension key column, i.e., when "isKey" : "true". If true, this column is a surrogate key and its value will be generated using increasing – but not sequential - distinct values. The data type for a surrogate key must be “Long”.
The alternative to a surrogate key can be a key generated using some algorithm. For example, the key for Date dimension can be based on the calendar date member and generated using YYYYMMDD format. The key like that will satisfy all the requirements for dimension key. If the key is algorithmically generated, specify "isSurrogateKey" : "false" and include dimension key in the result set created in loadDim override (see loadDim override for DimDate or DimTeamRole dimensiona in sample implementation of NFL Data Warehouse).
Limitation: the maximum number of rows
in a dimension with surrogate key cannot exceed 1 Billion.
dimension.schema.column.unknownValue
Unknown value must be the valid value for column data type. Use yyyy-mm-dd format for Date type. If unknown value is not specified it will be set to NULL to any column but the dimension key – the default unknown value for dimension key is 0 (zero).
dimension.schema.column.isNaturalKey
A flag to indicate that a column is a natural key for the dimension. Multiple columns can be defined as natural key to create a compound natural key.
dimension.schema.column.isTypeTwo
A flag to indicate that a column is a Type 2 column. Multiple columns can be defined as Type 2. If the flag is not present or set to false the column is Type 1 column.
A section of specifications for attributes common to all facts or fact tables.
All fact classes described in a configuration must be defined in the same package. Fact classes can be split up between different configurations for parallel load. In that case fact classes can be implemented in different packages. Package name is used to dynamically create an instance of a fact class.
A regular expression pattern for fact table name. Each fact table’s name must match this pattern.
A section to describe the fact destination file. The file will be created and maintained by the ETL library.
dataMart.factsDestinationFile.parquet
Defines fact destination file as parquet.
dataMart.factsDestinationFile.path
Defines destination file path for all facts. For example, "path" : "/dw/data/dm/FactNamePlaceholder.parquet". Text FactNamePlaceholder in path must match the
factNamePattern above. Text 'FactNamePlaceholder' will be replaced with actual
fact table name which also has to match the pattern.
dataMart.factsDestinationFile.pathWithKeySetters
Defines path for the fact table with columns used to set keys.
This file is used mostly in development to insure the correct setting on
dimension keys on a fact table and troubleshoot the unknown keys. The help in
troubleshooting the unknown keys can make this file useful in production.
Creating of this file has relatively low performance cost.
An array of specifications for fact tables or facts (fact table and fact are used interchangeably). We will use fact to refer to one element of facts array.
The name of the fact table. The name must match the pattern defined in dataMart.factNamePattern attribute.
If set to false the fact table will not be loaded. Can be used to create separate configurations to load specific facts or during development to skip the load of already tested facts.
The following fact processing modes are supported on subsequent ETL runs:
· REPLACE – new fact table replaces the old one. Fact table can have partition columns (one or more), i.e., columns with isForPartition flag set to true. If partition columns are defined the fact table will be partitioned.
· REPLACE_PARTITION – this mode is similar to REPLACE, but requires to have partition columns. In this mode new partitions replace old partitions with the same name, but if new partition does not exist for some existing old partitions – the old partitions will remain.
· MERGE – old fact table rows are merged with new based on the merge key columns. All rows in old fact table with the same merge key are replaced with new ones. The old rows with merge keys that do not exist in the new fact table will remain.
· MERGE_PARTITION – this mode is similar to MERGE, only in addition to merge key takes partition column into consideration.
· ADD – new rows are added to existing ones. Fact table can have one or more partition columns. If partition columns are defined the new rows are added to correct partitions.
An array of staging source monikers used to load this fact. The Etl library will call loadFact for each staging source passing the moniker name as parameter. The array can be empty. If it is empty the loadFact will be called once with “N/A” as the value for staging source moniker.
A staging source moniker from the stgSources array.
An array of specifications for fact columns. We use fact.schema.column to refer to one element of fact.schema array.
The fact table is constructed from schema using following rule:
· If "isForPartition" column or columns are defined the fact table will be partioned on this columns using Spark partitioning feature.
· Distinct "foreignKey" values will be used to create fact table foreign keys. If multiple columns are needed to set a foreign key, the schema would have multiple elements with the same value of a foreign key – one for each column needed to asset a key. For example, if there are two columns from the underlying dimension are needed to set a key, the fact table schema will have two elements with the same value of "foreignKey".
· Columns with "isMeasure" attribute set will be added to a fact table as measures
For example the following schema
"schema"
:
[
{ "colName"
: "SeasonYear", "colType" : "Integer", "isForPartition"
: "true" },
{ "colName"
: "DataSourceMoniker", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "DataSourceKey", "underlyingDim"
: "DimDataSource", "underlyingDimCol" : "DataSourceMoniker"
},
{ "colName"
: "GameId", "colType" : "String", "isMergeKey"
: "true", "isForSettingForeignKey"
: "true", "foreignKey" : "GameKey", "underlyingDim"
: "DimGame", "underlyingDimCol" : "GameId"
},
{ "colName"
: "PlayDate", "colType" : "Date", "isMergeKey"
: "true", "isForSettingForeignKey"
: "true", "isEffDateForTypeTwo" : "true", "foreignKey"
: "PlayDateKey", "underlyingDim" : "DimDate", "underlyingDimCol"
: "Date" },
{ "colName"
: "OffenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "OffenseTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName"
},
{ "colName"
: "DefenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "DefenseTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName"
},
{ "colName"
: "PenaltyTeamAbbrevName", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "PenaltyTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName"
},
{ "colName"
: "TimeoutTeamAbbrevName", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "TimeoutTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName"
},
{ "colName"
: "Formation", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol" : "Formation"
},
{ "colName"
: "PlayType", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol" : "PlayType"
},
{ "colName"
: "PassType", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol" : "PassType"
},
{ "colName"
: "PenaltyType", "colType" : "String", "isForSettingForeignKey"
: "true", "foreignKey" : "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol" : "PenaltyType"
},
{ "colName"
: "ToGo", "colType" : "Double", "isMeasure" : "true"
},
{ "colName"
: "PenaltyYards", "colType" : "Double", "isMeasure" : "true"
}
]
will result in a fact table with following columns
· DataSourceKey (FK)
· GameKey (FK)
· PlayDateKey (FK)
· OffenseTeamKey (FK)
· PlayTypeKey (FK)
· ToGo
· PenaltyYards
The
fact able will be partitioned on SeasonYear.
In this example there are four columns in the schema needed to set foreign key PlayTypeKey.
A name of the fact column.
A data type of the fact column. colType must be Double , Short, Integer or Long if this column is a measure, i.e., "isMeasure" : "true"
fact.schema.column.isForPartition
A flag to indicate that this column is used for partitioning fact data. For example, for column { "colName" : "SeasonYear", "colType" : "Integer", "isForPartition" : "true" } the fact directory structure would look like below
To query partitioned Fact tables in , for example Apache Drill, use syntax like this
select * from dfs.`fact-file-path\FactPlay.parquet` where dir0=’SeasonYear=2016’
limit 5;
A flag to indicate that this column is used for merge of fact data during subsequent loads. All fact table records that have a merge key value that is present in a new fact table will be deleted and replaced with new values.
fact.schema.column.isForSettingForeignKey
When isForSettingForeignKey is set, this column is used to set a foreign key from a dimension on a fact row. The column is either a part of natural dimension key or some other column in a dimension that is used in a custom logic to set a foreign dimension key.
If this attribute of a column is set to true, there should be three more properties defined: foreignKey, underlyingDim, and underlyingDimCol. The underlyingDimCol is optional.
For example, if some dimension has three natural key columns that comprise a unique compound key, in the schema there would be three columns - one for each natural key – to set one foreign key on the resulting fact table.
If underlyingDimCol is not defined the setForeignKeyOnFactTable method must be overridden to implement custom logic for setting the foreign key on the fact table.
The name of the column on the fact table to be set using this field.
fact.schema.column.underlyingDim
The name of the dimension with the primary key.
fact.schema.column.underlyingDimCol
The name of the column in the underlying dimension. If all underlying dimension columns have isNaturalKey set to true, the default implementation of setForeignKeyOnFactTable will be used, otherwise this method will have to be overridden in the corresponding dimension class.
fact.schema.column.isEffDateForTypeTwo
This flag can be set on a column of Date type to designate a column as an effective date for Type 2 dimension. The column will be used to set the key of Type 2 dimension via equation: eff-date-for-type-two-column between dimension-row-start-date and dimension-row-end-date. Only one column in fact table schema can have this attribute set to true.
This flag designates a column as a fact table measure or fact. The measure column must be one of the numeric types – Float, Short, Integer, or Long.
{
"isDebug" : "false",
"sparkParams" :
{
"sessionAppName" : "Spark NFL
DW"
},
"dwEtl" :
{
"jobType" : "LoadDimsFacts",
"isInitialLoad" : "true",
"optional rerunEtlAfter" : "2021-04-28
16:33:00.119",
"logFileDir" : "/dw/data/logs/etl",
"etlLogFilePath" : "/dw/data/etl/etldwnfl.parquet",
"__moniker_comment" : "moniker
must be unique for each etl source, the sources will be searched by their
monikers",
"stgSources" :
[
{ "moniker" : "PlayByPlay",
"description" : "Play by
play s/sheet for all games",
"__transactionName" : "transactionName
is an optional attribute",
"transactionName" : "play",
"__effectiveDateColumn"
: "effectiveDateColumn
is an optional attribute",
"effectiveDateColumn" : "GameDate",
"isDefaultForEffectiveDate"
: "true",
"__timestampColumn" : "timestampColumn
is optional and can be used for incremental load",
"timestampColumn" : "_SrcDt_RowTimestamp",
"__effectiveDateDays_comment1_"
: "Effective
Date\n|2014-01-11|\n|2013-12-15|\n|2014-10-12|\n rule selected
from Stg Sources to define Data dimension and to load dimensions with type 2
columns. The values can be \"distinct\", i.e., only
the values in the stg source used in the etl, \"weekdays\", i.e., all
days between min and max values excluding weekends, or \"all\" - all days
bewenn min and max values",
"__effectiveDateDays_comment2_"
: "EffectiveDateRule
default = all",
"effectiveDateRule" : "DISTINCT",
"fileSource" : { "path"
: "/dw/data/staging/stgpbp.parquet", "parquet" : { } }
},
{ "moniker" : "Teams",
"description" : "Team
names historically",
"fileSource" : { "path"
: "/dw/data/staging/team.parquet", "parquet" : { } }
}
],
"dimAuthority" :
{
"packageName" : "com.dbtimes.nfldw.DataWarehouseNFL.DimAuthNFL",
"__effectiveDateStart" : "2015-01-01",
"__effectiveDateEnd" : "2020-01-01",
"dimensionNamePattern" : "(?<dimName>Dim[A-Z][a-zA-Z0-9]+)",
"dimensionsDestinationFile" :
{
"parquet" : {},
"__path__comment" : "text
'DimNamePlaceholder' in path must match the dimensionNamePattern above. Text
'DimNamePlaceholder' will be replaced with actual dimension name which also has
to match the pattern",
"path" : "/dw/data/dm/dimnameplaceholder.parquet",
"previousCopyPath" : "/dw/data/dmprev/dimnameplaceholder.prev.parquet"
}
},
"__dimension_name__comment" : "dimension
name must match the dimensionNamePattern.It will be used to replace name
placeholder with actual dimension name",
"dimensions" :
[
{
"name" : "DimDataSource",
"isLoad" : "true",
"dimensionStgSources" :
[
],
"schema" :
[
{ "colName"
: "DataSourceKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "DataSourceMoniker","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "Unknown"
},
{ "colName"
: "DataSourceDescription","colType"
: "String", "unknownValue"
: "Unknown"
},
{ "colName"
: "TransactionName","colType"
: "String", "unknownValue"
: "Unknown"
}
]
},
{
"__dimension_name__comment"
: "dimension
name must match the dimensionNamePattern. It will be used to replace name
placeholder with actual dimension name",
"name" : "DimDate",
"isLoad" : "true",
"__dimensionStgSources__comment_1"
: "dimensionStgSources
is optional and can be empty",
"__dimensionStgSources__comment_2"
: "dimensionStgSources
can have secondary sources to join to primary source to build the
dimenion",
"dimensionStgSources" :
[
],
"schema" :
[
{ "colName"
: "DateKey","colType"
: "Integer", "isKey"
: "true", "isSurrogateKey"
: "false", "unknownValue"
: "0" },
{ "colName"
: "Date","colType"
: "Date", "isNaturalKey"
: "true", "unknownValue"
: "1900-01-01"
},
{ "colName"
: "DateDesc","colType"
: "String", "unknownValue"
: "Unknown"
},
{ "colName"
: "MonthName","colType"
: "String", "unknownValue"
: "Unknown"
},
{ "colName"
: "Year","colType"
: "Integer", "unknownValue" : "1900"
},
{ "colName"
: "Quarter","colType"
: "Integer", "unknownValue"
: "0" },
{ "colName"
: "QuarterDesc","colType"
: "String", "unknownValue"
: "Unknown"
},
{ "colName"
: "SeasonYear","colType"
: "Date", "unknownValue"
: "1900-01-01" },
{ "colName"
: "SeasonDescription","colType"
: "String", "unknownValue"
: "Unknown"
}
]
},
{
"name" : "DimGame",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker"
: "PlayByPlay"
},
{ "moniker"
: "Teams"
}
],
"schema" :
[
{ "colName"
: "GameKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "GameId","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "Unknown"
},
{ "colName"
: "GameTeams","colType"
: "String", "unknownValue"
: "Unknown"
}
]
},
{
"name" : "DimFieldPosition",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker"
: "PlayByPlay"
}
],
"schema" :
[
{ "colName"
: "FieldPositionKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "Down","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "YardLine","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "RushDirection","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "YardLineFixed","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "YardLineDirection","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" }
]
},
{
"name" : "DimPlayType",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker"
: "PlayByPlay"
}
],
"schema" :
[
{ "colName"
: "PlayTypeKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "Formation","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "PlayType","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "PassType","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "PenaltyType","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" }
]
},
{
"name" : "DimPlayTime",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker"
: "PlayByPlay"
}
],
"schema" :
[
{ "colName"
: "PlayTimeKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "Quarter","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "Minute","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "Second","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" }
]
},
{
"name" : "DimPlayDetails",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker"
: "PlayByPlay"
}
],
"schema" :
[
{ "colName"
: "PlayDetailsKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "IsRush","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsPass","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsIncomplete","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsTouchdown","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsSack","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsChallenge","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsChallengeReversed","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsMeasurement","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsInterception","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsFumble","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsPenalty","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsTwoPointConversion","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsTwoPointConversionSuccessful","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsPenaltyAccepted","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "IsNoPlay","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" }
]
},
{
"name" : "DimTeam",
"isLoad" : "true",
"dimensionStgSources" :
[
{ "moniker"
: "Teams"
}
],
"schema" :
[
{ "colName"
: "TeamKey","colType"
: "Long", "isKey"
: "true", "isSurrogateKey"
: "true", "unknownValue"
: "0" },
{ "colName"
: "TeamOriginalName","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" },
{ "colName"
: "TeamName","colType"
: "String", "isNaturalKey" : "false", "unknownValue"
: "" },
{ "colName"
: "TeamAbbrevName","colType"
: "String", "unknownValue"
: "" },
{ "colName"
: "TeamConference","colType"
: "String", "unknownValue" : "" },
{ "colName"
: "HistTeamName","colType"
: "String", "isTypeTwo"
: "true", "unknownValue"
: "" },
{ "colName"
: "HistTeamAbbrevName","colType"
: "String", "isTypeTwo"
: "true", "unknownValue"
: "" }
]
},
{
"name" : "DimTeamRole",
"__comment" : "This is
static dimension. Its population is not needed in incremental load",
"isLoad" : "true",
"dimensionStgSources" :
[
],
"schema" :
[
{ "colName"
: "TeamRoleKey","colType"
: "Integer", "isKey"
: "true", "isSurrogateKey"
: "false", "unknownValue"
: "0" },
{ "colName"
: "TeamRole","colType"
: "String", "isNaturalKey" : "true", "unknownValue"
: "" }
]
}
],
"dataMart" :
{
"packageName" : "com.dbtimes.nfldw.DataWarehouseNFL.DataMartNFLGames",
"factNamePattern" : "(?<factName>Fact[A-Z][a-zA-Z0-9]+)",
"factsDestinationFile" :
{
"parquet" : {},
"__path__comment" : "text
'FactNamePlaceholder' in path must match the factNamePattern above. Text
'FactNamePlaceholder' will be replaced with actual fact table name which also
has to match the pattern",
"path" : "/dw/data/dm/factnameplaceholder.parquet",
"pathWithKeySetters" : "/dw/data/etl/factnameplaceholder.withkeysetters.parquet"
}
},
"facts" :
[
{ "name" : "FactPlays",
"isLoad" : "true",
"processingMode" : "REPLACE_PARTITION",
"factStgSources" :
[
{ "moniker"
: "PlayByPlay"
}
],
"schema" :
[
{ "colName"
: "SeasonYear", "colType"
: "Integer", "isForPartition"
: "true"
},
{ "colName"
: "DataSourceMoniker", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "DataSourceKey", "underlyingDim"
: "DimDataSource", "underlyingDimCol"
: "DataSourceMoniker"
},
{ "colName"
: "GameId", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "GameKey", "underlyingDim"
: "DimGame", "underlyingDimCol"
: "GameId"
},
{ "colName"
: "PlayDate", "colType"
: "Date", "isForSettingForeignKey"
: "true", "isEffDateForTypeTwo"
: "true", "foreignKey"
: "PlayDateKey", "underlyingDim"
: "DimDate", "underlyingDimCol"
: "Date"
},
{ "colName"
: "OffenseTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "OffenseTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName" : "DefenseTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "DefenseTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName"
: "PenaltyTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PenaltyTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName"
: "TimeoutTeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "TimeoutTeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName"
: "Formation", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "Formation"
},
{ "colName"
: "PlayType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PlayType"
},
{ "colName"
: "PassType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PassType"
},
{ "colName"
: "PenaltyType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PenaltyType"
},
{ "colName"
: "Down", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "Down"
},
{ "colName"
: "YardLine", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "YardLine"
},
{ "colName"
: "RushDirection", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "RushDirection"
},
{ "colName"
: "YardLineFixed", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "YardLineFixed"
},
{ "colName"
: "YardLineDirection", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "YardLineDirection"
},
{ "colName"
: "Quarter", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTimeKey", "underlyingDim"
: "DimPlayTime", "underlyingDimCol"
: "Quarter"
},
{ "colName"
: "Minute", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTimeKey", "underlyingDim"
: "DimPlayTime", "underlyingDimCol"
: "Minute"
},
{ "colName"
: "Second", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTimeKey", "underlyingDim"
: "DimPlayTime", "underlyingDimCol"
: "Second"
},
{ "colName"
: "IsRush", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsRush"
},
{ "colName"
: "IsPass", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsPass"
},
{ "colName"
: "IsIncomplete", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsIncomplete"
},
{ "colName"
: "IsTouchdown", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsTouchdown"
},
{ "colName"
: "IsSack", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsSack"
},
{ "colName"
: "IsChallenge", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsChallenge"
},
{ "colName"
: "IsChallengeReversed", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsChallengeReversed"
},
{ "colName"
: "IsMeasurement", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsMeasurement"
},
{ "colName"
: "IsInterception", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsInterception"
},
{ "colName" : "IsFumble", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsFumble"
},
{ "colName"
: "IsPenalty", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsPenalty"
},
{ "colName"
: "IsTwoPointConversion", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsTwoPointConversion"
},
{ "colName"
: "IsTwoPointConversionSuccessful", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsTwoPointConversionSuccessful"
},
{ "colName"
: "IsPenaltyAccepted", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsPenaltyAccepted"
},
{ "colName"
: "IsNoPlay", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayDetailsKey", "underlyingDim"
: "DimPlayDetails", "underlyingDimCol"
: "IsNoPlay"
},
{ "colName"
: "Score", "colType"
: "Double", "isMeasure"
: "true"
},
{ "colName"
: "Yards", "colType"
: "Double", "isMeasure" : "true"
},
{ "colName"
: "ToGo", "colType"
: "Double", "isMeasure" : "true"
},
{ "colName"
: "PenaltyYards", "colType"
: "Double", "isMeasure" : "true"
}
]
},
{ "name" : "FactBridgeTeamsPlays",
"isLoad" : "true",
"processingMode" : "REPLACE",
"factStgSources" :
[
{ "moniker"
: "PlayByPlay"
}
],
"schema" :
[
{ "colName"
: "DataSourceMoniker", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "DataSourceKey", "underlyingDim"
: "DimDataSource", "underlyingDimCol"
: "DataSourceMoniker"
},
{ "colName" : "GameId", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "GameKey", "underlyingDim"
: "DimGame", "underlyingDimCol"
: "GameId"
},
{ "colName"
: "PlayDate", "colType"
: "Date", "isForSettingForeignKey"
: "true", "isEffDateForTypeTwo"
: "true", "foreignKey"
: "PlayDateKey", "underlyingDim"
: "DimDate", "underlyingDimCol"
: "Date"
},
{ "colName"
: "TeamAbbrevName", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "TeamKey", "underlyingDim"
: "DimTeam", "underlyingDimCol"
: "HistTeamAbbrevName"
},
{ "colName"
: "TeamRole","colType"
: "String", "isForSettingForeignKey" : "true", "foreignKey"
: "TeamRoleKey", "underlyingDim"
: "DimTeamRole", "underlyingDimCol"
: "TeamRole"
},
{ "colName"
: "Formation", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "Formation"
},
{ "colName"
: "PlayType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PlayType"
},
{ "colName"
: "PassType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PassType"
},
{ "colName"
: "PenaltyType", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTypeKey", "underlyingDim"
: "DimPlayType", "underlyingDimCol"
: "PenaltyType"
},
{ "colName"
: "Down", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "Down"
},
{ "colName"
: "YardLine", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "YardLine"
},
{ "colName"
: "RushDirection", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "RushDirection"
},
{ "colName"
: "YardLineFixed", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "YardLineFixed"
},
{ "colName"
: "YardLineDirection", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "FieldPositionKey", "underlyingDim"
: "DimFieldPosition", "underlyingDimCol"
: "YardLineDirection"
},
{ "colName"
: "Quarter", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTimeKey", "underlyingDim"
: "DimPlayTime", "underlyingDimCol"
: "Quarter"
},
{ "colName"
: "Minute", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTimeKey", "underlyingDim"
: "DimPlayTime", "underlyingDimCol"
: "Minute"
},
{ "colName"
: "Second", "colType"
: "String", "isForSettingForeignKey"
: "true", "foreignKey"
: "PlayTimeKey", "underlyingDim"
: "DimPlayTime", "underlyingDimCol"
: "Second"
}
]
}
]
}
}
In some cases, for example to preserve a value of a field on initial load, the previous version of a dimension can be used. The previous version of a dimension is a standard feature of a library. To use previous version of a dimension specify it as a source in stgSources configuration section and then reference it in dimStgSources
Use “Overriding Configuration on Command Line” feature