sql-data-profiler
Generate sql queries to describe and explore sql data sets
Last updated 5 years ago by pcothenet .
MIT · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install sql-data-profiler 
SYNC missed versions from official npm registry.

npm version Build Status

sql-data-profiler

sql-data-profiler is a utility module that generates sql code to profile data sets (eg. tables) in Redshift.

Installation

npm install sql-data-profiler --save

Quick start

var sql_data_profiler = require('sql-data-profiler');

Table profiler

Generates a sql statement that provides basic stats on the data in the "contacts" table.

var data_profiler = sql_data_profiler.data_profiler;

var options = {
	target_table: 'contacts',
	target_columns: ['email', 'a_industry']
};
var sql_code = data_profiler(options);

which accepts the following options

  • target_table
  • target_columns
  • results_table
  • calculate_frequency
  • use_perm_table
  • truncate_table

Distribution analysis

** What does this do?**

var distribution_analyzer = sql_data_profiler.distribution_analyzer;

var sql_code = distribution_analyzer({
	target_table: 'contacts',
	target_variable: {
		data_set: 'SELECT DISTINCT contact_id FROM events WHERE meta_event = \'conversion\'',
		join: {
			source_column: 'contact_id',
			target_column: 'contact_id'
		}
	},
	target_columns: [
		{
			name: 'a_name',
			max_number_of_values: 10,
		},
		{
			name: 'a_gender',
			max_number_of_values: 10
		},
		{
			name: 'a_address',
			max_number_of_values: 10,
			transformation: 'has_any_value'
		}
	]
});

Table stats

The following stats are calculated for each column.

Stats Description
count_total number of records in the table
count_not_null number of records where the value for the specified column is not null
fill_rate number of non-null values divided by number of records
count_distinct number of distinct values
dupe_rate number of distinct values divided by number of records where the value for the specified column is not null
maximum_value
minimum_value
most_frequent_value_1
most_frequent_value_1_frequency
most_frequent_value_2
most_frequent_value_2
most_frequent_value_3
most_frequent_value_3

TO DO

  • handle different data type (eg. boolean)
  • performance improvement

Current Tags

  • 0.2.6                                ...           latest (5 years ago)

9 Versions

  • 0.2.6                                ...           5 years ago
  • 0.2.5                                ...           5 years ago
  • 0.2.4                                ...           5 years ago
  • 0.2.3                                ...           5 years ago
  • 0.2.2                                ...           5 years ago
  • 0.2.1                                ...           5 years ago
  • 0.1.2                                ...           5 years ago
  • 0.1.1                                ...           5 years ago
  • 0.1.0                                ...           5 years ago
Maintainers (2)
Downloads
Today 0
This Week 0
This Month 0
Last Day 0
Last Week 0
Last Month 0
Dependencies (4)
Dev Dependencies (1)
Dependents (0)
None

Copyright 2014 - 2016 © taobao.org |