DBTimes

Home

Products

Frameworks

Contact us


 

Data Warehouse ETL library. 3

Object ModelObject 4

Value Members. 4

Example. 4

Class Dim.. 5

Value Members. 5

Overrides. 6

Class Fact 7

Value Members. 7

Overrides. 9

Configuration file. 10

Overriding Configuration on Command Line. 12

Configuration Attributes. 13

jobType. 13

isInitialLoad. 13

rerunEtlAfter 13

logFileDir 13

stgSources. 13

stgSource.moniker 13

stgSource.effectiveDateColumn. 13

stgSource.isDefaultForEffectiveDate. 14

stgSource.effectiveDateRule. 14

stgSource.timestampColumn. 14

stgSource.fileSource. 14

stgSource.fileSource.path. 14

stgSource.fileSource.parquet 14

dimAuthority. 14

dimAuthority.packageName. 14

dimAuthority.dimensionNamePattern. 14

dimAuthority.dimensionsDestinationFile. 14

dimAuthority.dimensionsDestinationFile.parquet 14

dimAuthority.dimensionsDestinationFile.path. 14

dimAuthority.dimensionsDestinationFile.previousCopyPath. 15

dimensions. 15

dimension.name. 15

dimension.isLoad. 15

dimension.dimensionStgSources. 15

dimension.dimensionStgSources.moniker 15

dimension.dimensionDimSources. 15

dimension.dimensionStgSources.name. 15

dimension.dimensionStgSources.schema. 15

dimension.schema. 15

dimension.schema.column.colName. 15

dimension.schema.column.colType. 15

dimension.schema.column.isKey. 15

dimension.schema.column.isSurrogateKey. 16

dimension.schema.column.unknownValue. 16

dimension.schema.column.isNaturalKey. 16

dimension.schema.column.isTypeTwo. 16

dataMart 16

dataMart.packageName. 16

dataMart.factNamePattern. 16

dataMart.factsDestinationFile. 16

dataMart.factsDestinationFile.parquet 16

dataMart.factsDestinationFile.path. 16

dataMart.factsDestinationFile.pathWithKeySetters. 17

facts. 17

fact.name. 17

fact.isLoad. 17

fact.processingMode. 17

fact.factStgSources. 17

fact.factStgSources.moniker 17

fact.schema. 17

fact.schema.column.colName. 18

fact.schema.column.colType. 19

fact.schema.column.isForPartition. 19

fact.schema.column.isMergeKey. 19

fact.schema.column.isForSettingForeignKey. 19

fact.schema.column.foreignKey. 19

fact.schema.column.underlyingDim.. 19

fact.schema.column.underlyingDimCol 19

fact.schema.column.isEffDateForTypeTwo. 19

fact.schema.column.isMeasure. 20

Sample configuration file. 20

Spark configuration. 28

Scenarios and solutions. 28

Using previous version of a dimension to restore a field value on initial load. 28

Enable/disable Debug flag without creating separate configuration file. 28

Data Warehouse ETL library

 

Data Warehouse ETL (DW ETL) is a library to maintain dimensional model in the file system, HDFS.

 

The library implements ETL process for the dimensional model created using Kimball methodology (see “The Data Warehouse Toolkit” by Kimball/Ross).

 

The library’s Dim and Fact classes generically handle various aspects of dimension design and lifecycle.

 

Dim class:

-    Load dimension from multiple sources

-    Historical load of data on initial load for both Type 1 and Type 2 dimensions

-    Re-load dimension from specific point in time

-    Type 1 and Type 2 changes, and by extension Type 3 and Type 4. Type 2 dimension allows for type 1 columns, so it more like Type 7, but we still call it Type 2

-    Generation of surrogate keys

-    Algorithmic surrogate keys

-    Determination of dates for historical load. The dimension may need to be processed for multiple dates on model re-generation (on initial load or when the design changed) or to catch-up when several days are skipped for any reason

-    Load dimension for multiple effective dates in bulk or one at a time.

-    Setting keys on a fact table with the option to override the default implementation

-    Full and incremental dimension load

 

Fact class:

-    Assignment of foreign keys

-    Load fact table from multiple sources

-    Load multiple dates in bulk or one at a time.

-    Full and incremental fact table load

-    Supports different strategies of incremental fact table processing

-    Creation of table with key setting columns to troubleshoot Unknown keys and setting keys in general

 

Object ModelObject holds library’s entry point.

Object ModelObject

package com.dbtimes.dw.etl
 
object ModelObject

Value Members

def runEtl(appConfig: Config): Unit

               The method is used to execute the ETL process. The parameter holds the configuration that describes the dimensional model. Only one invocation of runEtl is allowed per application execution.

Example

 
import org.apache.spark.sql.SparkSession
import com.typesafe.config.{Config, ConfigFactory}
import scala.collection.JavaConverters._
import com.dbtimes.dw.etl.ModelObject
 
object DwNFL {
  def main(args: Array[String]) {
    val configFileName = if (args.length == 1) args(0) else "dwEtlDefault.json"
    val appConfig = ConfigFactory.load(configFileName)
    val sparkSessionBuilder = DwNFL.configureSparkSession(appConfig)
    val spark = sparkSessionBuilder.getOrCreate()
 
    // Do initialization specific to a given DW
   
    // ETL process to build or update dimensions and fact tables
    ModelObject.runEtl(appConfig)
    spark.stop()
  }
 
  private def configureSparkSession(appConfig: Config): SparkSession.Builder = {
    var sparkSessionBuilder = SparkSession.builder()
    sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionAppName")) sparkSessionBuilder.appName(appConfig.getString("sparkParams.sessionAppName")) else sparkSessionBuilder
    sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionConfigMaster")) sparkSessionBuilder.config("spark.master", appConfig.getString("sparkParams.sessionConfigMaster")) else sparkSessionBuilder
    sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionConfigDriverHost")) sparkSessionBuilder.config("spark.driver.host", appConfig.getString("sparkParams.sessionConfigDriverHost")) else sparkSessionBuilder
    sparkSessionBuilder = if (appConfig.hasPath("sparkParams.sessionConfigLocalDir")) sparkSessionBuilder.config("spark.local.dir", appConfig.getString("sparkParams.sessionConfigLocalDir")) else sparkSessionBuilder
    sparkSessionBuilder = sparkSessionBuilder.config("spark.driver.maxResultSize", "5g")
    sparkSessionBuilder
  }
}

Class Dim

package com.dbtimes.dw.etl
 
class Dim

 

In the most common scenario there are only two steps required to implement a new dimension:

 

1.    Create a new class derived from Dim

2.    Override loadDim method

 

Additional customization can be done by overriding appropriate methods of a Dim class.

Value Members

protected val dimName: String

Name of the dimension.

 

protected val dimKeyUnknownValue: String

               Unknown value for dimension key. Populated from configuration dimension schema attribute "unknownValue" for dimension key – see dimension schema attributes.

         
protected val spark: SparkSession

               A copy of SparkSession. The value is the same as calling SparkSession.builder().getOrCreate()

 

protected val isInitialLoad: Boolean 

               This flag is populated from configuration path "dwEtl.isInitialLoad". Flag is used to indicate if the dimension created from scratch – initial load - or is being updated with new data. The initial load is executed when dimension is created for the first time or when the dimensional model has changed.

 
protected val effDateColumnNameInDatesToProcess: String 

               The name of the date column in the view viewWithDatesToProcess (see next).

 
protected val viewWithDatesToProcess: String

               The name of the view with effective dates. The view holds dates for the current run of the ETL process.

 

Code example:

 
class DimDate(dimName: String) extends Dim(dimName) {
 
  override def loadDim(stgSrcTempView: String): Option[DataFrame] = {
 
    val sqlStgSource
    =
      s"""| SELECT
          |   CAST( DATE_FORMAT($effDateColumnNameInDatesToProcess, 'yyyyMMdd') AS INT) AS DateKey,
          |   $effDateColumnNameInDatesToProcess AS Date,
          |   DATE_FORMAT($effDateColumnNameInDatesToProcess, 'yyyy-MM-dd') AS DateDesc,
          |   DATE_FORMAT($effDateColumnNameInDatesToProcess, 'MMM') AS MonthName,
          |   CAST( DATE_FORMAT($effDateColumnNameInDatesToProcess, 'yyyy') AS INT) AS Year,
          |   QUARTER($effDateColumnNameInDatesToProcess) AS Quarter,
          |   CONCAT( 'Q', CAST( QUARTER($effDateColumnNameInDatesToProcess) AS STRING ) ) AS QuarterDesc
          | FROM $viewWithDatesToProcess """.stripMargin
 
    val dfStgSource = spark.sql(sqlStgSource)
    Some(dfStgSource)
  }
}
         
protected def getLastProcessedStgSourceTimestamp(stgSourceTempView: String): Option[SqlTimestamp]
         The method returns the maximum timestamp for the staging data source that was processed in the previous ETL run for the current dimension. It is used to load data into dimensional model object incrementally.
 
protected def isStgSourceChangedSinceLastLoad(stgSourceTempView: String): Boolean
         The method returns false if the data source has not changed since was processed in the previous ETL run for the current dimension, and true otherwise. The method is used to load data into dimensional model object incrementally.

Overrides

protected def loadDim(stgSrcTempView: String): Option[DataFrame]

               Override this method to load new data in bulk or for all effective dates. This is one of the two versions of loadDim. Either this one of the other must be overridden, but not both.


protected def loadDim(effDateYYYY_MM_DD: String, stgSrcTempView: String): Option[DataFrame]

               Override this method to load new data for each effective date. This version of loadDim must be used to load a dimension with Type 2 columns. This is one of the two versions of loadDim. Either this one or the other must be overridden, but not both.

 

protected def enrichDim(dfSrcDim: DataFrame): DataFrame

               Optional override. If implemented, it will be called after all the sources have been loaded using loadDim(stgSrcTempView: String)  or all sources loaded for one effective date using loadDim(effDateYYYY_MM_DD: String, stgSrcTempView: String). Implement this override to perform transformation on the source data before is is added to the dimension. The returned DatFrame has to have the same schema as the one returned from loadDim method.

 

protected def setForeignKeyOnFactTable(fact: DataFrame, mapDimColsToFactCols: Map[String, String], effDateColAsOption: Option[String]): DataFrame

               Optional override. Used when setting a key is other than an equi-join on natural keys in fact table template and dimension. For, example when setting a key s done in some priority order, or when there multiple ways to join on natural keys.

 
protected def preProcess(stgSourceTempViews: List[String], datesToProcess: List[Date]): Unit

Optional override. It is called before sources are loaded. Can be used to create views later used in the load process.

 
protected def postProcess(stgSourceTempViews: List[String], datesToProcess: List[Date], dfModelObject: DataFrame): DataFrame

Optional override. It is called after dimension processing is complete right before the dimension is persisted into a parquet file. One possible use is to do some do logging or for debugging.

 
protected def getCustomDatesToProcess(dfDatesToProcessBasedOnConfig: DataFrame): Option[DataFrame]

Optional override. It is called after effective dates are determined from the sources based on configuration. In this method these dates can be modified or completely replaced based on custom logic.

Class Fact

package com.dbtimes.dw.etl
 
class Fact

 

In the most common scenario there are only two steps required to implement a new fact table:

 

1.    Create a new class derived from Fact

2.    Override loadFact method

 

Additional customization can be done by overriding appropriate methods of a Fact class.

Value Members

protected val factName: String

Name of the dimension.

 

protected val spark: SparkSession

               A copy of SparkSession. The value is the same as calling SparkSession.builder().getOrCreate()

 

protected val isInitialLoad: Boolean 

               This flag is populated from configuration path "dwEtl.isInitialLoad". Flag is used to indicate if the dimension created from scratch – initial load - or is being updated with new data. The initial load is executed when dimension is created for the first time or when the dimensional model has changed.

 
protected val effDateColumnNameInDatesToProcess: String 

               The name of the date column in the view viewWithDatesToProcess (see next).

 
protected val viewWithDatesToProcess: String

               The name of the view with effective dates. The view holds dates for the current run of the ETL process.

 

Code example:

 

class FactPlays(factName: String) extends FactsNFL(factName) {

 

  override def loadFact(stgSrcTempView: String): Option[DataFrame] = {

 

    if ( !isStgSourceChangedSinceLastLoad(stgSrcTempView) ) {

      dwEtlLog.info(s"-- Skipping loading fact $factName from source $stgSrcTempView as is has not changed since last load")

      None

    }

    else {

      val dfFact = if (stgSrcTempView == "PlayByPlay") {

        val sqlStgSource =

          s"""

             |SELECT

             |  src.SeasonYear                             AS SeasonYear,

             |  "$stgSrcTempView"                          AS DataSourceMoniker,

             |  CAST( src.GameId AS STRING )               AS GameId,

             |  CAST( src.GameDate AS DATE )               AS PlayDate,

             |  CAST( src.PenaltyYards      AS DOUBLE )    AS PenaltyYards

             |FROM $stgSrcTempView AS src

             |   INNER JOIN $viewWithDatesToProcess AS dates   ON  src.SeasonYear = udfSeasonYear( dates.$effDateColumnNameInDatesToProcess )

             |

             | """.stripMargin

 

        dwEtlLog.info(s"-- Loading fact $factName from source $stgSrcTempView using sql:\n" + sqlStgSource)

 

        val dfStgSource = spark.sql(sqlStgSource)

 

        dfStgSource.dropDuplicates()

      }

      else {

        throw new RuntimeException(s"""Etl ERROR: Unknown stg source while loading fact $factName""")

      }

 

      if (DwNFL.getIsDebug) {

        dfFact.printSchema()

        dfFact.show(40)

      }

 

      Some(dfFact)

    }

  }

}

         
protected def getLastProcessedStgSourceTimestamp(stgSourceTempView: String): Option[SqlTimestamp]
         The method returns the maximum timestamp for the staging data source that was processed in the previous ETL run for the current dimension. It is used to load data into dimensional model object incrementally.
 
protected def isStgSourceChangedSinceLastLoad(stgSourceTempView: String): Boolean
         The method returns false if the data source has not changed since was processed in the previous ETL run for the current dimension, and true otherwise. The method is used to load data into dimensional model object incrementally.

Overrides

protected def loadFact(stgSrcTempView: String): Option[DataFrame]

               Override this method to load new data in bulk or for all effective dates. This is one of the two versions of loadFact. Either this one of the other must be overridden, but not both.


protected def loadFact(effDateYYYY_MM_DD: String, stgSrcTempView: String): Option[DataFrame]

               Override this method to load new data for each effective date. This is one of the two versions of loadFact. Either this one or the other must be overridden, but not both.

 

protected def preProcess(stgSourceTempViews: List[String], datesToProcess: List[Date]): Unit

Optional override. It is called before sources are loaded. Can be used to create views later used in the load process.

 
protected def postProcess(stgSourceTempViews: List[String], datesToProcess: List[Date], dfModelObject: DataFrame): DataFrame

Optional override. It is called after fact processing is complete right before the fact table is persisted into a parquet file. One possible use is to do some do logging or for debugging.

 
protected def getCustomDatesToProcess(dfDatesToProcessBasedOnConfig: DataFrame): Option[DataFrame]

Optional override. It is called after effective dates are determined from the sources based on configuration. In this method these dates can be modified or completely replaced based on custom logic.

 

Code example:

 

override def getCustomDatesToProcess(dfDatesToProcessBasedOnConfig: DataFrame): Option[DataFrame] = {

 

  dfDatesToProcessBasedOnConfig.createOrReplaceTempView("DatesToProcessBasedOnConfig")

 

  val sqlDatesToProcess

  =

    s"""|SELECT MAX( $effDateColumnNameInDatesToProcess ) AS SeasonYearDate

        |FROM DatesToProcessBasedOnConfig

        |GROUP BY udfSeasonYear( $effDateColumnNameInDatesToProcess )

        |""".stripMargin

  val dfDatesToProcess = spark.sql(sqlDatesToProcess)

 

  Some(dfDatesToProcess)

}

 

 

Configuration file

 

Configuration file describes dimensional model and parameters required to run the ETL process. The section used by the etl library must be called "dwEtl". Configuration file can have additional attributes required by application. Additional attributes can be added to any section of configuration.

 

Optional attributes are highlighted:

·         "effectiveDateRule" : "ALL", highlighted in grey and are shown with default values, or

·         "rerunEtlAfter", highlighted in yellow and are not required, or

·         isDebug, highlighted in magenta and are additional application specific attributes that are not used by the library. These attributes can be organized in any way. The additional attributes can also be placed inside "dwEtl" section, like "description" and "transactionName" in the configuration structure below.

 

{
  "isDebug" : "false",
  "sparkParams" :
  {
    "sessionAppName" : "Spark NFL DW",
    "sessionConfigMaster" : "local",
    "sessionConfigDriverHost" : "localhost",
    "sessionConfigLocalDir" : "C:\\DATA\\Temp"
  },
  "dwEtl" :
  {
    "jobType" : "LoadDimsFacts",
    "isInitialLoad" : "true",
    "rerunEtlAfter" : "2021-04-28 16:33:00.119",
    "logFileDir" : "C:\\LOGS\\Etl\\",
    "etlLogFilePath" : "C:\\DATA\\DW.ETL\\etlDwNFL.parquet\\",
    "stgSources" :
    [
      { "moniker" : "PlayByPlay",
        "description" : "Play by play s/sheet for all games",
        "transactionName" : "play",
        "effectiveDateColumn" : "GameDate",
        "isDefaultForEffectiveDate" : "true",
        "effectiveDateRule" : "ALL",
        "timestampColumn" : "_SrcDt_RowTimestamp",
        "fileSource" : { "path" : "C:\\ASCODE\\DATA\\NFL\\STAGING\\stgPbp.parquet",  "parquet" : { } }
      },
      
    ],
    "dimAuthority" :
    {
      "packageName" : "com.dbtimes.nfldw.dimauth",
      "dimensionNamePattern" : "(?<dimName>Dim[A-Z][a-zA-Z0-9]+)",
      "dimensionsDestinationFile" :
      {
        "parquet" : {},
        "path" : "C:\\DATA\\NFL\\DW\\DimNamePlaceholder.parquet",
        "previousCopyPath" : "C:\\ASCODE\\DATA\\NFL\\DimAuthorityPrev\\DimNamePlaceholder.prev.parquet"
      }
    },
    "__dimension_name__comment" : "dimension name must match the dimensionNamePattern.It will be used to replace name placeholder with actual dimension name",
    "dimensions" :
    [
      {
        "name" : "DimMyDimension",
        "isLoad" : "true",
        "dimensionStgSources" :
        [
          { "moniker" : "someMoniker" },
          
        ],

        "dimensionDimSources" :
         [
 
         { "name" : "DimDate",
 
           "schema" :
 
          [
              { "colName" : "Date" },
    
         { "colName" : "DateDesc" },
  
           { "colName" : "MonthName" }
  
         ]
 
         }
        ],

        "schema" :
        [
          { "colName" : "MyDimensionKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          { "colName" : "MyDimensionNaturalKeyOne","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "Unknown" },
          { "colName" : "MyDimensionNaturalKeyTwo","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "Unknown" },
          { "colName" : "SomeStringDimensionAttr","colType" : "String",  "isNaturalKey" : "false", "unknownValue" : "Unknown" },
          { "colName" : "SomeIntDimensionAttr","colType" : "Integer",  "unknownValue" : "-1" },
          { "colName" : "SomeDateDimensionAttr ","colType" : "Date", "unknownValue" : "1900-01-01"  }
          { "colName" : "SomeType2Attr", "colType" : "String", "isTypeTwo" : "true", "unknownValue" : "" },
        ]
      },

     

    ],

    "dataMart" :

    {

      "packageName" : "com.dbtimes.nfldw.datamartgames",

      "factNamePattern" : "(?<factName>Fact[A-Z][a-zA-Z0-9]+)",

      "factsDestinationFile" :

      {

        "parquet" : {},

        "path" : "C:\\DATA\\NFL\\DW\\FactNamePlaceholder.parquet",

        "pathWithKeySetters" : "C:\\DATA\\NFL\\DW.ETL\\FactNamePlaceholder.WithKeySetters.parquet"

      }

    },

    "facts" :

    [

      { "name" : "FactPlaysForMerge",

        "isLoad" : "false",

        "processingMode" : "MERGE_PARTITION",

        "factStgSources" :

        [

          { "moniker" : "PlayByPlay" }

        ],

        "schema" :

        [

          { "colName" : "SeasonYear", "colType" : "Integer", "isForPartition" : "true" },

          { "colName" : "DataSourceMoniker", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DataSourceKey", "underlyingDim" : "DimDataSource", "underlyingDimCol" : "DataSourceMoniker" },

          { "colName" : "GameId", "colType" : "String", "isMergeKey" : "true", "isForSettingForeignKey" : "true", "foreignKey" : "GameKey", "underlyingDim" : "DimGame", "underlyingDimCol" : "GameId" },

          { "colName" : "PlayDate", "colType" : "Date", "isMergeKey" : "true", "isForSettingForeignKey" : "true", "isEffDateForTypeTwo" : "true", "foreignKey" : "PlayDateKey", "underlyingDim" : "DimDate", "underlyingDimCol" : "Date" },

          { "colName" : "OffenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "OffenseTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },

          { "colName" : "DefenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DefenseTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },

          { "colName" : "PenaltyTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PenaltyTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },

          { "colName" : "TimeoutTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "TimeoutTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },

          { "colName" : "Formation", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "Formation" },

          { "colName" : "PlayType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PlayType" },

          { "colName" : "PassType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PassType" },

          { "colName" : "PenaltyType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PenaltyType" },

          { "colName" : "ToGo", "colType" : "Double",  "isMeasure" : "true" },

          { "colName" : "PenaltyYards", "colType" : "Double",  "isMeasure" : "true" }

        ]

      },

     

    ]

  }

}

 

Overriding Configuration on Command Line

Ability to override configuration file can be useful, for example, to load different objects from the same configuration by disabling load of specific objects in specific runs.

Use this code in main method of a Data Warehouse implementation to load configuration file and apply overrides

 

val configFileName = if (args.length >= 1) args(0) else throw new RuntimeException("""ERROR: Need to pass a parameter - configuration file name """)
val appConfig = MiscHelper.loadConfigAndApplyOverrides( configFileName, args.drop(1) )

 

instead of this that only loads the configuration from a file

 

val configFileName = if (args.length == 1) args(0) else throw new RuntimeException("""ERROR: Need to pass a parameter - configuration file name """)
val appConfig = ConfigFactory.load(configFileName)

 

Configuration Attributes

jobType

The ETL process consists of two parts – ETL demensions and ETL facts. jobType  is used to run either both or one of these parts. The allowed values for this attribute are "LoadDimsFacts", or "LoadDims", or "LoadFacts". The main scenario for running ETL dimensions separate from ETL facts is loading dimensions in parallel by defining separate configurations for one or more dimensions (using isLoadFlag) and running ETL facts after - serially or in parallel.

isInitialLoad

The flag indicates if the load is initial or subsequent. If the load is initial the existing files for dimensions and/or facts are deleted. Which ones are deleted depends on the jobType -  for "LoadDimsFacts" both dimensions and fact files are deleted, for "LoadDims" - dimensions are deleted, and for "LoadFacts" - fact files are deleted.

 

There is one use case when this flag should be set to “true” for ongoing loads: when dimension or fact has to be recreated every time based on current sources, i.e., when rows for non-existent natural keys deleted and all attributes are Type 1.

rerunEtlAfter

The attribute is used to rerun subsequent load for effective dates processed earlier after the timestamp value specified based on the log.

logFileDir

A full path of the directory for log files. Used primarily for development. It is not available on HDFS. There are two log files created – one with library messages and, another, with Spark messages.

stgSources

An array of specifications for staging files. The staging files are a source for dimensions and facts. We will use stgSource to refer to one element of stgSources array.

stgSource.moniker

A name or label of the staging source. The moniker must be unique within a set of staging sources.

stgSource.effectiveDateColumn

An optional attribute to specify a column in a staging source with effective date. Effective dates are used to load dimensional model objects incrementally, to process Type 2 dimensions, to process Date dimension and to set Date key on a fact table. Only one out of all sources for a dimension can have stgSource.effectiveDateColumn . If more than one sources have effective date defined, only the first one will be used.

stgSource.isDefaultForEffectiveDate

An optional attribute to designate a staging source as a default source for effective date. The staging source that is default for effective date is used when none of the sources for a specific model object have effective date column.

stgSource.effectiveDateRule

An attribute to specify how to derive effective date from a column stgSource.effectiveDateColumn. The rule can have following predefined values:

·         ALL (default) – all calendar dates between minimum and maximum dates in the source,

·         DISTINCT – all distinct dates in the source,

·         WEEKDAYS - distinct weekdays (Monday-Friday) between minimum and maximum dates in the source.

stgSource.timestampColumn

An optional attribute to specify a column with the timestamp. If staging sources are loaded using source load library then every source will have a timestamp column. Timestamp is used for incremental processing in subsequent ETL runs.

stgSource.fileSource

An attribute to specify a file as a type of a staging source.

stgSource.fileSource.path

An attribute to specify a full path of a staging source file.

stgSource.fileSource.parquet

An attribute to specify that file is in parquet format.

dimAuthority

A section of specifications for attributes common to all dimensions.

dimAuthority.packageName

All dimension classes described in a configuration must be defined in the same package. Dimension classes can be split up between different configurations for parallel load. In that case dimension classes can be implemented in different packages. Package name is used to dynamically create an instance of a dimension class.

dimAuthority.dimensionNamePattern

A regular expression pattern for dimension name. Each dimension’s name must match this pattern.

dimAuthority.dimensionsDestinationFile

A section to describe the dimension destination file. The file will be created and maintained by the ETL library.

dimAuthority.dimensionsDestinationFile.parquet

Defines dimension destination file as parquet.

dimAuthority.dimensionsDestinationFile.path

Defines destination file path for all dimensions. For example, "path" : "/dw/data/dm/DimNamePlaceholder.parquet". Text 'DimNamePlaceholder' in path must match the dimensionNamePattern above. Text 'DimNamePlaceholder' will be replaced with actual dimension name which also has to match the pattern.

dimAuthority.dimensionsDestinationFile.previousCopyPath

Defines path for the previous copy of the dimension. The previous copy can be used to re-run the ETL in case of the latest version got corrupt and has unfixable data issues.

dimensions

An array of specifications for dimensions. We will use dimension to refer to one element of dimensions array.

dimension.name

The name of the dimension. The name must match the pattern defined in dimAuthority.dimensionNamePattern attribute.

dimension.isLoad

If set to false the dimension will not be loaded. Can be used to create separate configurations to load specific dimensions or during development to skip the load of already tested dimensions.

dimension.dimensionStgSources

An array of staging source monikers used to load this dimension. The Etl library will call loadDim for each staging source passing the moniker name as parameter. The array can be empty. If it is empty the loadDim will be called once with “N/A” as the value for staging source moniker.

dimension.dimensionStgSources.moniker

A staging source moniker from the stgSources array.

dimension.dimensionDimSources

An array of dimensions that are a source for loading current dimension. In some instances the dimension can be used to load another dimension for performance reasons and to minimize code duplication. A dimension that is a source will be loaded prior to loading this dimension. The array can be empty. A loadDim will not be called for each dimSource and will not be passed as a parameter to loadDim, but dimension source can be referenced as a view with the same name in the SQL to load dimension.

dimension. dimensionDimSources.name

A dimension Name that is a source for loading current dimension.

dimension. dimensionDimSources.schema

A dimension source can optionally have a schema that list a subset of columns used to load current dimension. If schema is absent all columns can be used in loading current dimendion.

dimension.schema

An array of specifications for dimension columns. We will use dimension.schema.column to refer to one element of dimension.schema array.

dimension.schema.column.colName

A name of the dimension column.

dimension.schema.column.colType

A data type of the dimension column. colType must be Short, Integer or Long if this column is a dimension key (i.e., "isKey" : "true")

dimension.schema.column.isKey

If true, this column is a dimension key.

dimension.schema.column.isSurrogateKey

isSurrogateKey can only be defined for dimension key column, i.e., when "isKey" : "true". If true, this column is a surrogate key and its value will be generated using increasing – but not sequential - distinct values. The data type for a surrogate key must be “Long”.

 

The alternative to a surrogate key can be a key generated using some algorithm. For example, the key for Date dimension can be based on the calendar date member and generated using YYYYMMDD format. The key like that will satisfy all the requirements for dimension key. If the key is algorithmically generated, specify "isSurrogateKey" : "false"  and include dimension key in the result set created in loadDim override (see loadDim override for DimDate or DimTeamRole dimensiona in sample implementation of NFL Data Warehouse).

 

Limitation: the maximum number of rows in a dimension with surrogate key cannot exceed 1 Billion.

dimension.schema.column.unknownValue

Unknown value must be the valid value for column data type. Use yyyy-mm-dd format for Date type. If unknown value is not specified it will be set to NULL to any column but the dimension key – the default unknown value for dimension key is 0 (zero).

dimension.schema.column.isNaturalKey

A flag to indicate that a column is a natural key for the dimension. Multiple columns can be defined as natural key to create a compound natural key.

dimension.schema.column.isTypeTwo

A flag to indicate that a column is a Type 2 column. Multiple columns can be defined as Type 2. If the flag is not present or set to false the column is Type 1 column.

dataMart

A section of specifications for attributes common to all facts or fact tables.

dataMart.packageName

All fact classes described in a configuration must be defined in the same package. Fact classes can be split up between different configurations for parallel load. In that case fact classes can be implemented in different packages. Package name is used to dynamically create an instance of a fact class.

dataMart.factNamePattern

A regular expression pattern for fact table name. Each fact table’s name must match this pattern.

dataMart.factsDestinationFile

A section to describe the fact destination file. The file will be created and maintained by the ETL library.

dataMart.factsDestinationFile.parquet

Defines fact destination file as parquet.

dataMart.factsDestinationFile.path

Defines destination file path for all facts. For example, "path" : "/dw/data/dm/FactNamePlaceholder.parquet". Text FactNamePlaceholder in path must match the factNamePattern above. Text 'FactNamePlaceholder' will be replaced with actual fact table name which also has to match the pattern.

dataMart.factsDestinationFile.pathWithKeySetters

Defines path for the fact table with columns used to set keys. This file is used mostly in development to insure the correct setting on dimension keys on a fact table and troubleshoot the unknown keys. The help in troubleshooting the unknown keys can make this file useful in production. Creating of this file has relatively low performance cost.

facts

An array of specifications for fact tables or facts (fact table and fact are used interchangeably). We will use fact to refer to one element of facts array.

fact.name

The name of the fact table. The name must match the pattern defined in dataMart.factNamePattern attribute.

fact.isLoad

If set to false the fact table will not be loaded. Can be used to create separate configurations to load specific facts or during development to skip the load of already tested facts.

fact.processingMode

The following fact processing modes are supported on subsequent ETL runs:

·         REPLACE – new fact table replaces the old one. Fact table can have partition columns (one or more), i.e., columns with isForPartition flag set to true. If partition columns are defined the fact table will be partitioned.

·         REPLACE_PARTITION – this mode is similar to REPLACE, but requires to have partition columns. In this mode new partitions replace old partitions with the same name, but if new partition does not exist for some existing old partitions – the old partitions will remain.

·         MERGE – old fact table rows are merged with new based on the merge key columns. All rows in old fact table with the same merge key are replaced with new ones. The old rows with merge keys that do not exist in the new fact table will remain.

·         MERGE_PARTITION – this mode is similar to MERGE, only in addition to merge key takes partition column into consideration.

·         ADD – new rows are added to existing ones. Fact table can have one or more partition columns. If partition columns are defined the new rows are added to correct partitions.

fact.factStgSources

An array of staging source monikers used to load this fact. The Etl library will call loadFact for each staging source passing the moniker name as parameter. The array can be empty. If it is empty the loadFact will be called once with “N/A” as the value for staging source moniker.

fact.factStgSources.moniker

A staging source moniker from the stgSources array.

fact.schema

An array of specifications for fact columns. We use fact.schema.column to refer to one element of fact.schema array.

 

The fact table is constructed from schema using following rule:

·         If "isForPartition" column or columns are defined the fact table will be partioned on this columns using Spark partitioning feature.

·         Distinct "foreignKey" values will be used to create fact table foreign keys. If multiple columns are needed to set a foreign key, the schema would have multiple elements with the same value of a foreign key – one for each column needed to asset a key. For example, if there are two columns from the underlying dimension are needed to set a key, the fact table schema will have two elements with the same value of "foreignKey".

·         Columns with "isMeasure" attribute set will be added to a fact table as measures

 

For example the following schema

        "schema" :
        [
          {
"colName" : "SeasonYear", "colType" : "Integer", "isForPartition" : "true" },
          {
"colName" : "DataSourceMoniker", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DataSourceKey", "underlyingDim" : "DimDataSource", "underlyingDimCol" : "DataSourceMoniker" },
          {
"colName" : "GameId", "colType" : "String", "isMergeKey" : "true", "isForSettingForeignKey" : "true", "foreignKey" : "GameKey", "underlyingDim" : "DimGame", "underlyingDimCol" : "GameId" },
          {
"colName" : "PlayDate", "colType" : "Date", "isMergeKey" : "true", "isForSettingForeignKey" : "true", "isEffDateForTypeTwo" : "true", "foreignKey" : "PlayDateKey", "underlyingDim" : "DimDate", "underlyingDimCol" : "Date" },
          {
"colName" : "OffenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "OffenseTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "DefenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DefenseTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "PenaltyTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PenaltyTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "TimeoutTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "TimeoutTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "Formation", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "Formation" },
          {
"colName" : "PlayType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PlayType" },
          {
"colName" : "PassType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PassType" },
          {
"colName" : "PenaltyType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PenaltyType" },
          {
"colName" : "ToGo", "colType" : "Double",  "isMeasure" : "true" },
          {
"colName" : "PenaltyYards", "colType" : "Double",  "isMeasure" : "true" }
        ]

will result in a fact table with following columns

·         DataSourceKey (FK)

·         GameKey (FK)

·         PlayDateKey (FK)

·         OffenseTeamKey (FK)

·         PlayTypeKey (FK)

·         ToGo

·         PenaltyYards

 

The fact able will be partitioned on SeasonYear.

 

In this example there are four columns in the schema needed to set foreign key PlayTypeKey.

fact.schema.column.colName

A name of the fact column.

fact.schema.column.colType

A data type of the fact column. colType must be Double , Short, Integer or Long if this column is a measure, i.e., "isMeasure" : "true"

fact.schema.column.isForPartition

A flag to indicate that this column is used for partitioning fact data. For example, for column { "colName" : "SeasonYear", "colType" : "Integer", "isForPartition" : "true" } the fact directory structure would look like below

 

 

To query partitioned Fact tables in , for example Apache Drill, use syntax like this

select * from dfs.`fact-file-path\FactPlay.parquet` where dir0=’SeasonYear=2016’ limit 5;

fact.schema.column.isMergeKey

A flag to indicate that this column is used for merge of fact data during subsequent loads. All fact table records that have a merge key value that is present in a new fact table will be deleted and replaced with new values.

fact.schema.column.isForSettingForeignKey

When isForSettingForeignKey is set, this column is used to set a foreign key from a dimension on a fact row. The column is either a part of natural dimension key or some other column in a dimension that is used in a custom logic to set a foreign dimension key.

If this attribute of a column is set to true, there should be three more properties defined: foreignKey, underlyingDim, and underlyingDimCol. The underlyingDimCol is optional.

 

For example, if some dimension has three natural key columns that comprise a unique compound key, in the schema there would be three columns  - one for each natural key – to set one foreign key on the resulting fact table.

 

If underlyingDimCol is not defined the setForeignKeyOnFactTable method must be overridden to implement custom logic for setting the foreign key on the fact table.

fact.schema.column.foreignKey

The name of the column on the fact table to be set using this field.

fact.schema.column.underlyingDim

The name of the dimension with the primary key.

fact.schema.column.underlyingDimCol

The name of the column in the underlying dimension. If all underlying dimension columns have isNaturalKey set to true, the default implementation of setForeignKeyOnFactTable will be used, otherwise this method will have to be overridden in the corresponding dimension class.

fact.schema.column.isEffDateForTypeTwo

This flag can be set on a column of Date type to designate a column as an effective date for Type 2 dimension. The column will be used to set the key of Type 2 dimension via equation:  eff-date-for-type-two-column between dimension-row-start-date and dimension-row-end-date. Only one column in fact table schema can have this attribute set to true.

fact.schema.column.isMeasure

This flag designates a column as a fact table measure or fact. The measure column must be one of the numeric types – Float, Short, Integer, or Long.

Sample configuration file

 

{
 
"isDebug" : "false",
 
"sparkParams" :
  {
   
"sessionAppName" : "Spark NFL DW"
 
},
 
"dwEtl" :
  {
   
"jobType" : "LoadDimsFacts",
   
"isInitialLoad" : "true",
   
"optional rerunEtlAfter" : "2021-04-28 16:33:00.119",
   
"logFileDir" : "/dw/data/logs/etl",
   
"etlLogFilePath" : "/dw/data/etl/etldwnfl.parquet",
   
"__moniker_comment" : "moniker must be unique for each etl source, the sources will be searched by their monikers",
   
"stgSources" :
    [
      {
"moniker" : "PlayByPlay",
       
"description" : "Play by play s/sheet for all games",
       
"__transactionName" : "transactionName is an optional attribute",
       
"transactionName" : "play",
       
"__effectiveDateColumn" : "effectiveDateColumn is an optional attribute",
       
"effectiveDateColumn" : "GameDate",
       
"isDefaultForEffectiveDate" : "true",
       
"__timestampColumn" : "timestampColumn is optional and can be used for incremental load",
       
"timestampColumn" : "_SrcDt_RowTimestamp",
       
"__effectiveDateDays_comment1_" : "Effective Date\n|2014-01-11|\n|2013-12-15|\n|2014-10-12|\n rule selected from Stg Sources to define Data dimension and to load dimensions with type 2 columns. The values can be \"distinct\", i.e., only the values in the stg source used in the etl, \"weekdays\", i.e., all days between min and max values excluding weekends, or \"all\" - all days bewenn min and max values",
       
"__effectiveDateDays_comment2_" : "EffectiveDateRule default = all",
       
"effectiveDateRule" : "DISTINCT",
       
"fileSource" : { "path" : "/dw/data/staging/stgpbp.parquet",  "parquet" : { } }
      },
      {
"moniker" : "Teams",
       
"description" : "Team names historically",
       
"fileSource" : { "path" : "/dw/data/staging/team.parquet",  "parquet" : { } }
      }
    ],
   
"dimAuthority" :
    {
     
"packageName" : "com.dbtimes.nfldw.DataWarehouseNFL.DimAuthNFL",
     
"__effectiveDateStart" : "2015-01-01",
     
"__effectiveDateEnd" : "2020-01-01",
     
"dimensionNamePattern" : "(?<dimName>Dim[A-Z][a-zA-Z0-9]+)",
      
"dimensionsDestinationFile" :
      {
       
"parquet" : {},
       
"__path__comment" : "text 'DimNamePlaceholder' in path must match the dimensionNamePattern above. Text 'DimNamePlaceholder' will be replaced with actual dimension name which also has to match the pattern",
       
"path" : "/dw/data/dm/dimnameplaceholder.parquet",
       
"previousCopyPath" : "/dw/data/dmprev/dimnameplaceholder.prev.parquet"
     
}
    },
   
"__dimension_name__comment" : "dimension name must match the dimensionNamePattern.It will be used to replace name placeholder with actual dimension name",
   
"dimensions" :
    [
      {
       
"name" : "DimDataSource",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
        ],
       
"schema" :
        [
          {
"colName" : "DataSourceKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "DataSourceMoniker","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "Unknown" },
          {
"colName" : "DataSourceDescription","colType" : "String", "unknownValue" : "Unknown" },
          {
"colName" : "TransactionName","colType" : "String", "unknownValue" : "Unknown" }
        ]
      },
      {
       
"__dimension_name__comment" : "dimension name must match the dimensionNamePattern. It will be used to replace name placeholder with actual dimension name",
       
"name" : "DimDate",
       
"isLoad" : "true",
       
"__dimensionStgSources__comment_1" : "dimensionStgSources is optional and can be empty",
       
"__dimensionStgSources__comment_2" : "dimensionStgSources can have secondary sources to join to primary source to build the dimenion",
       
"dimensionStgSources" :
        [
        ],

        
"schema" :
        [
          {
"colName" : "DateKey","colType" : "Integer", "isKey" : "true", "isSurrogateKey" : "false", "unknownValue" : "0" },
          {
"colName" : "Date","colType" : "Date", "isNaturalKey" : "true", "unknownValue" : "1900-01-01" },
          {
"colName" : "DateDesc","colType" : "String", "unknownValue" : "Unknown" },
          {
"colName" : "MonthName","colType" : "String", "unknownValue" : "Unknown" },
          {
"colName" : "Year","colType" : "Integer",  "unknownValue" : "1900" },
          {
"colName" : "Quarter","colType" : "Integer", "unknownValue" : "0"  },
          {
"colName" : "QuarterDesc","colType" : "String", "unknownValue" : "Unknown" },
          {
"colName" : "SeasonYear","colType" : "Date", "unknownValue" : "1900-01-01"  },
          {
"colName" : "SeasonDescription","colType" : "String", "unknownValue" : "Unknown" }
        ]
      },
      {
       
"name" : "DimGame",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
          {
"moniker" : "PlayByPlay" },
          {
"moniker" : "Teams" }
        ],
       
"schema" :
        [
          {
"colName" : "GameKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "GameId","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "Unknown" },
          {
"colName" : "GameTeams","colType" : "String", "unknownValue" : "Unknown" }
        ]
      },
      {
       
"name" : "DimFieldPosition",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
          {
"moniker" : "PlayByPlay" }
        ],
       
"schema" :
        [
          {
"colName" : "FieldPositionKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "Down","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "YardLine","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "RushDirection","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "YardLineFixed","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "YardLineDirection","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" }
        ]
      },
      {
       
"name" : "DimPlayType",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
          {
"moniker" : "PlayByPlay" }
        ],
       
"schema" :
        [
          {
"colName" : "PlayTypeKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "Formation","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "PlayType","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "PassType","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "PenaltyType","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" }
        ]
      },
      {
       
"name" : "DimPlayTime",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
          {
"moniker" : "PlayByPlay" }
        ],
       
"schema" :
        [
          {
"colName" : "PlayTimeKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "Quarter","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "Minute","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "Second","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" }
        ]
      },
      {
       
"name" : "DimPlayDetails",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
          {
"moniker" : "PlayByPlay" }
        ],
       
"schema" :
        [
          {
"colName" : "PlayDetailsKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "IsRush","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsPass","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsIncomplete","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsTouchdown","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsSack","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsChallenge","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsChallengeReversed","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsMeasurement","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsInterception","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsFumble","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsPenalty","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsTwoPointConversion","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsTwoPointConversionSuccessful","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsPenaltyAccepted","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "IsNoPlay","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" }
        ]
      },
      {
       
"name" : "DimTeam",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
          {
"moniker" : "Teams" }
        ],
       
"schema" :
        [
          {
"colName" : "TeamKey","colType" : "Long", "isKey" : "true", "isSurrogateKey" : "true", "unknownValue" : "0" },
          {
"colName" : "TeamOriginalName","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" },
          {
"colName" : "TeamName","colType" : "String",  "isNaturalKey" : "false", "unknownValue" : "" },
          {
"colName" : "TeamAbbrevName","colType" : "String", "unknownValue" : "" },
          {
"colName" : "TeamConference","colType" : "String",  "unknownValue" : "" },
          {
"colName" : "HistTeamName","colType" : "String", "isTypeTwo" : "true", "unknownValue" : "" },
          {
"colName" : "HistTeamAbbrevName","colType" : "String", "isTypeTwo" : "true", "unknownValue" : "" }
        ]
      },
      {
       
"name" : "DimTeamRole",
       
"__comment" : "This is static dimension. Its population is not needed in incremental load",
       
"isLoad" : "true",
       
"dimensionStgSources" :
        [
        ],
       
"schema" :
        [
          {
"colName" : "TeamRoleKey","colType" : "Integer", "isKey" : "true", "isSurrogateKey" : "false", "unknownValue" : "0" },
          {
"colName" : "TeamRole","colType" : "String",  "isNaturalKey" : "true", "unknownValue" : "" }
        ]
      }
    ],
   
"dataMart" :
    {
     
"packageName" : "com.dbtimes.nfldw.DataWarehouseNFL.DataMartNFLGames",
     
"factNamePattern" : "(?<factName>Fact[A-Z][a-zA-Z0-9]+)",
     
"factsDestinationFile" :
      {
       
"parquet" : {},
       
"__path__comment" : "text 'FactNamePlaceholder' in path must match the factNamePattern above. Text 'FactNamePlaceholder' will be replaced with actual fact table name which also has to match the pattern",
       
"path" : "/dw/data/dm/factnameplaceholder.parquet",
       
"pathWithKeySetters" : "/dw/data/etl/factnameplaceholder.withkeysetters.parquet"
     
}
    },
   
"facts" :
    [
      {
"name" : "FactPlays",
       
"isLoad" : "true",
       
"processingMode" : "REPLACE_PARTITION",
        
"factStgSources" :
        [
          {
"moniker" : "PlayByPlay" }
        ],
       
"schema" :
        [
          {
"colName" : "SeasonYear", "colType" : "Integer", "isForPartition" : "true" },
          {
"colName" : "DataSourceMoniker", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DataSourceKey", "underlyingDim" : "DimDataSource", "underlyingDimCol" : "DataSourceMoniker" },
          {
"colName" : "GameId", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "GameKey", "underlyingDim" : "DimGame", "underlyingDimCol" : "GameId" },
          {
"colName" : "PlayDate", "colType" : "Date", "isForSettingForeignKey" : "true", "isEffDateForTypeTwo" : "true", "foreignKey" : "PlayDateKey", "underlyingDim" : "DimDate", "underlyingDimCol" : "Date" },
          {
"colName" : "OffenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "OffenseTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "DefenseTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DefenseTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "PenaltyTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PenaltyTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "TimeoutTeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "TimeoutTeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "Formation", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "Formation" },
          {
"colName" : "PlayType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PlayType" },
          {
"colName" : "PassType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PassType" },
          {
"colName" : "PenaltyType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PenaltyType" },
          {
"colName" : "Down", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "Down" },
          {
"colName" : "YardLine", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "YardLine" },
          {
"colName" : "RushDirection", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "RushDirection" },
          {
"colName" : "YardLineFixed", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "YardLineFixed" },
          {
"colName" : "YardLineDirection", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "YardLineDirection" },
          {
"colName" : "Quarter", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTimeKey", "underlyingDim" : "DimPlayTime", "underlyingDimCol" : "Quarter" },
          {
"colName" : "Minute", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTimeKey", "underlyingDim" : "DimPlayTime", "underlyingDimCol" : "Minute" },
          {
"colName" : "Second", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTimeKey", "underlyingDim" : "DimPlayTime", "underlyingDimCol" : "Second" },
          {
"colName" : "IsRush", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsRush" },
          {
"colName" : "IsPass", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsPass" },
          {
"colName" : "IsIncomplete", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsIncomplete" },
          {
"colName" : "IsTouchdown", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsTouchdown" },
          {
"colName" : "IsSack", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsSack" },
          {
"colName" : "IsChallenge", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsChallenge" },
          {
"colName" : "IsChallengeReversed", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsChallengeReversed" },
          {
"colName" : "IsMeasurement", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsMeasurement" },
          {
"colName" : "IsInterception", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsInterception" },
          {
"colName" : "IsFumble", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsFumble" },
          {
"colName" : "IsPenalty", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsPenalty" },
          {
"colName" : "IsTwoPointConversion", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsTwoPointConversion" },
          {
"colName" : "IsTwoPointConversionSuccessful", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsTwoPointConversionSuccessful" },
          {
"colName" : "IsPenaltyAccepted", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsPenaltyAccepted" },
          {
"colName" : "IsNoPlay", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayDetailsKey", "underlyingDim" : "DimPlayDetails", "underlyingDimCol" : "IsNoPlay" },
          {
"colName" : "Score", "colType" : "Double", "isMeasure" : "true" },
          {
"colName" : "Yards", "colType" : "Double",  "isMeasure" : "true" },
          {
"colName" : "ToGo", "colType" : "Double",  "isMeasure" : "true" },
          {
"colName" : "PenaltyYards", "colType" : "Double",  "isMeasure" : "true" }
        ]
      },
      {
"name" : "FactBridgeTeamsPlays",
       
"isLoad" : "true",
       
"processingMode" : "REPLACE",
       
"factStgSources" :
        [
          {
"moniker" : "PlayByPlay" }
        ],
       
"schema" :
        [
          {
"colName" : "DataSourceMoniker", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "DataSourceKey", "underlyingDim" : "DimDataSource", "underlyingDimCol" : "DataSourceMoniker" },
          {
"colName" : "GameId", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "GameKey", "underlyingDim" : "DimGame", "underlyingDimCol" : "GameId" },
          {
"colName" : "PlayDate", "colType" : "Date", "isForSettingForeignKey" : "true", "isEffDateForTypeTwo" : "true", "foreignKey" : "PlayDateKey", "underlyingDim" : "DimDate", "underlyingDimCol" : "Date" },
          {
"colName" : "TeamAbbrevName", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "TeamKey", "underlyingDim" : "DimTeam", "underlyingDimCol" : "HistTeamAbbrevName" },
          {
"colName" : "TeamRole","colType" : "String",  "isForSettingForeignKey" : "true", "foreignKey" : "TeamRoleKey", "underlyingDim" : "DimTeamRole", "underlyingDimCol" : "TeamRole" },
          {
"colName" : "Formation", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "Formation" },
          {
"colName" : "PlayType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PlayType" },
          {
"colName" : "PassType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PassType" },
          {
"colName" : "PenaltyType", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTypeKey", "underlyingDim" : "DimPlayType", "underlyingDimCol" : "PenaltyType" },
          {
"colName" : "Down", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "Down" },
          {
"colName" : "YardLine", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "YardLine" },
          {
"colName" : "RushDirection", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "RushDirection" },
          {
"colName" : "YardLineFixed", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "YardLineFixed" },
          {
"colName" : "YardLineDirection", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "FieldPositionKey", "underlyingDim" : "DimFieldPosition", "underlyingDimCol" : "YardLineDirection" },
          {
"colName" : "Quarter", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTimeKey", "underlyingDim" : "DimPlayTime", "underlyingDimCol" : "Quarter" },
          {
"colName" : "Minute", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTimeKey", "underlyingDim" : "DimPlayTime", "underlyingDimCol" : "Minute" },
          {
"colName" : "Second", "colType" : "String", "isForSettingForeignKey" : "true", "foreignKey" : "PlayTimeKey", "underlyingDim" : "DimPlayTime", "underlyingDimCol" : "Second" }
        ]
      }
    ]
  }
}

Sample dimension data

Spark configuration

Scenarios and solutions

Using previous version of a dimension to restore a field value on initial load

In some cases, for example to preserve a value of a field on initial load, the previous version of a dimension can be used. The previous version of a dimension is a standard feature of a library. To use previous version of a dimension specify it as a source in stgSources configuration section and then reference it in dimStgSources

Enable/disable Debug flag without creating separate configuration file

Use “Overriding Configuration on Command Line  12” feature