转 Managing kettle job configuration

简介:

Over time I've grown a habit of making a configuration file for my kettle jobs. This is especially useful if you have a reusable job, where the same work has to be done but against different conditions. A simple example where I found this useful is when you have separate development, testing and production environments: when you're done developing your job, you transfer the .kjb file (and its dependencies) to the testing environment. This is the easy part. But the job still has to run within the new environment, against different database connections, webservice urls and file system paths.

Variables

In the past, much has been written about using kettle variables, parameters and arguments. Variables are the basic features that provide the mechanism to configure the transformation steps and job entries: instead of using literal configuration values, you use a variable reference. This way, you can initialize all variables to whatever values are appropriate at that time, and for that environment. Today, I don't want to discuss variables and variable references - instead I'm just focussing on how to manage the configuration once you already used variable references inside your your jobs and transformations.

Managing configuration

To manage the configuration, I typically start the main job with a set-variables.ktr transformation. This transformation reads configuration data from a config.properties file and assigns it to the variables so any subsequent jobs and transformations can access the configration data through variable references. The main job has one parameter called ${CONFIG_DIR} which has to be set by the caller so the set-variables.ktr transformation knows where to look for its config.properties file:


Reading configuration properties

The config.properties file is just a list of key/value pairs separated by an equals sign. Each key represents a variable name, and the value the appropriate value. The following snippet should give you an idea:

#staging database connection
STAGING_DATABASE=staging
STAGING_HOST=localhost
STAGING_PORT=3351
STAGING_USER=staging
STAGING_PASSWORD=$74g!n9

The set-variables.ktr transformation reads it using a "Property Input" step, and this yields a stream of key/value pairs:


Pivoting key/value pairs to use the "set variables" step

In the past, I used to set the variables using the "Set variables" step. This step works by creating a variable from selected fields in the incoming stream and assigning the field value to it. This means that you can't just feed the stream of key/value pairs from the property input step into the set variables step: the stream coming out of the property input step contains multiple rows with just two fields called "Key" and "value". Feeding it directly into the "Set variables" step would just lead to creating two variables called Key and Value, and they would be assigned values multiple times for all key/value pairs in the stream. So in order to meaningfully assign variable, I used to pivot the stream of key/value pairs into a single row having one field for each key in the stream using the "Row Denormaliser" step:

key field: the value of this field is scanned to determine in which output fields to put the corresponding value. There are no fields that make up a grouping: rather, we want all key/value pairs to end up in one big row. Or put another way, there is just one group comprising all key/value pairs. Finally, the grid below specifies for each distinct value of the "Key" field to which output field name it should be mapped, and in all cases, we want the value of the "Value" field to be stored in those fields. 


Drawbacks

There are two important drawbacks to this approach:

  • The "Row Normaliser" uses the value of the keys to map the value to a new field. This means that we have to type the name of each and every variable appearing in the config.properties file. So you manually need to keep he config.propeties and the "Denormaliser" synchronized, and in practice it's very easy to make mistakes here.

  • Due to the fact that the "Row Denormaliser" step literally needs to know all variables, the set-variables.ktr transformation becomes specific for just one particular project.

Given these drawbacks, I seriously started to question the usefulness of a separate configuration file: because the set-variables.ktrtransformation has to know all variables names anyway, I was tempted to store the configration values themselves also inside the transformation (using a "generate rows" or "data grid" step or something like that), and "simply" make a new set-variables.ktrtransformation for every environment. Of course, that didn't feel right either.


Solution: Javascript

As it turns out, there is in fact a very simple solution that solves all of these problems: don't use the "set variables" step for this kind of problem! We still need to set the variables of course, but we can conveniently do this using a JavaScript step. The new set-variables.ktr transformation now looks like this:

The actual variable assignemnt is done with Kettle's built-in setVariable(key, value, scope). The key and value from the incoming stream are passed as arguments to the key and value arguments of the setVariable() function. The third argument of thesetVariable() function is a string that identifies the scope of the variable, and must have one of the following values:

  • "s" - system-wide

  • "r" - up to the root

  • "p" - up to the parent job of this transormation

  • "g" - up to the grandparent job of this transormation

For my purpose, I settle for "r".

The bonus is that this set-variables.ktr is less complex than the previous one and is now even completely independent of the content of the configuration. It has become a reusable transformation that you can use over and over.

------------------------------------------------------------------------------------------------------

There are some people argue that why not config these variables in <Kettle Installation Folder>\.kettle\kettle.properties, of course this way can achieve this goal, but this solution is not good, because when the Kettle Job(.kjb) or Kettle Transformation(.ktr) files are moved to another environment(dev, test, alpha, beta, prod..), you need to update this file, this is also need to be done when one or more properties are updated.

See also: http://wiki.pentaho.com/display/EAI/Expose+key-value+pairs+from+a+.properties+file+as+variables

目录
相关文章
|
SQL 存储 数据采集
【技术分享】元数据与数据血缘实现思路
【技术分享】元数据与数据血缘实现思路
6191 0
|
存储 监控 数据可视化
无需重新学习,使用 Kibana 查询/可视化 SLS 数据
本文演示了使用 Kibana 连接 SLS ES 兼容接口进行查询和分析的方法。
67323 119
|
前端开发 网络协议 Dubbo
超详细Netty入门,看这篇就够了!
本文主要讲述Netty框架的一些特性以及重要组件,希望看完之后能对Netty框架有一个比较直观的感受,希望能帮助读者快速入门Netty,减少一些弯路。
91492 32
超详细Netty入门,看这篇就够了!
|
C语言 容器
【C语言】为什么推荐把for循环写成左闭右开区间
循环十次打印"hello world",判定条件通常有两种写法:左闭右开区间、左闭右闭区间
|
关系型数据库 Java 数据库
|
Java 开发框架 应用服务中间件
The server does not support version 3.0 of the J2EE Web module specification
1.错误: 在eclipse中使用run->run on server的时候,选择tomcat6会报错误:The server does not support version 3.0 of the J2EE Web module specification 2.原因: Tomcat 6.0最多支持Servlet 2.5,而现在要import的项目是3.0版本的。
1309 0
|
Java Android开发
【技术贴】解决Eclipse编译java源文件之后没有生成class文件|找不到class文件
【技术贴】解决Eclipse编译java源文件之后没有生成class文件|找不到class文件   今天遇到的,非常恶心,项目下有个红叉叉,因为是公司的老项目,1.4的jdk,我把工程导入后,发现没有报错,但是有小红叉,自动保存代码编译完java源文件之后,找不到class文件。
3122 0
|
13天前
|
存储 关系型数据库 分布式数据库
PostgreSQL 18 发布,快来 PolarDB 尝鲜!
PostgreSQL 18 发布,PolarDB for PostgreSQL 全面兼容。新版本支持异步I/O、UUIDv7、虚拟生成列、逻辑复制增强及OAuth认证,显著提升性能与安全。PolarDB-PG 18 支持存算分离架构,融合海量弹性存储与极致计算性能,搭配丰富插件生态,为企业提供高效、稳定、灵活的云数据库解决方案,助力企业数字化转型如虎添翼!