Import data from AWS S3 to AWS Aurora

This post describes an easy way of importing a CSV-file saved in Amazon S3 to a table in Amazon Aurora.

Technologies used in this post:

  • Amazon Aurora is a cloud-based relation database which is compatible with both MySQL and PostgreSQL
  • Amazon S3 is a cloud-based object storage

To get started we need to download a sample CSV file. I this case I will use this dataset: https://www.stats.govt.nz/assets/Uploads/Births-by-statistical-area-2-and-area-unit-for-comparison.csv

First you need to upload this file to an S3 Bucket

Then connect to your Aurora database and create a new table

create table s3_import_test
(
    geography_type varchar(100),
    gCode varchar(100),
    gDescription varchar(100),
    birth_2013 int,
    birth_2014 int,
    birth_2015 int,
    birth_2016 int,
    birth_2017 int,
    inserted_date DATETIME
)

To import the data we will use this statement

LOAD DATA FROM S3 's3://your-S3-url/Births-by-statistical-area-2-and-area-unit-for-comparison.csv'

INTO TABLE s3_import_test
CHARACTER SET UTF8
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY ''
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(
    geography_type ,
    gCode ,
    gDescription,
    birth_2013 ,
    birth_2014 ,
    birth_2015 ,
    birth_2016 ,
    birth_2017 ,
    @inserted_date
)

SET inserted_date = CURRENT_TIMESTAMP;

@inserted_date is a variable that will be assigned in the SET-statement at the end. I use variables a lot to call custom functions or to use some of the built-in functionality in Aurora.

And the result will look like this:

Result

You can read more in the documentation here: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.LoadFromS3.html

Advertisements

Print all column names in SSMS (SSMS trick)

Every once in a while I come across some “hidden” and nice features in SSMS. One of them is really nice if you want to print all columns names of a table.

So instead of writing all the column names by hand you could follow these easy steps.

– Open SSMS
– Expand you database
– Expand your table

expand-ssms

And then drag the folder called “Columns” to your query window

all-columns

Using filters in DAX

In this post I will show some samples on different ways to use filtering in DAX. If you write ineffective filters you will most likely always run into performance issues. We will also see the difference between using SUMMARIZE() and ROW() in queries that only return one row.

Lets pretend one of your customers wants to know the sum of sales for the product “AWC Logo Cap”. And the minimum sales.

The benchmarking is done by running the different queries in DAX Studio multiple times. I have cleared cache between the executions.

Sample query 1.

DEFINE
    MEASURE 'FactInternetSales'[MyFilteredNumber] =
        CALCULATE (
            SUM ( FactInternetSales[UnitPrice] ),
            FILTER ( DimProduct, [EnglishProductName] = "AWC Logo Cap" )
        )
    MEASURE 'FactInternetSales'[MyFilteredNumberMin] =
        CALCULATE (
            MIN ( FactInternetSales[UnitPrice] ),
            FILTER ( DimProduct, [EnglishProductName] = "AWC Logo Cap" )
        )
EVALUATE
 (
    FILTER (
        SUMMARIZE (
            FactInternetSales,
            DimProduct[EnglishProductName],
            "Filtered result", 'FactInternetSales'[MyFilteredNumber],
            "Filtered result min", 'FactInternetSales'[MyFilteredNumberMin]
        ),
        'FactInternetSales'[MyFilteredNumber] > 0
    )
)

This query produces the correct answer. But is does not perform very well.

It took 32 MS.

Then I will change my filters to not filter the entire table. Only the column that we are interested in.
This is done by using VALUES(..) on the column name in the filter expression.

Sample query 2.

DEFINE
    MEASURE 'FactInternetSales'[MyFilteredNumber] =
        CALCULATE (
            SUM ( FactInternetSales[UnitPrice] ),
            FILTER (
                VALUES ( DimProduct[EnglishProductName] ),
                [EnglishProductName] = "AWC Logo Cap"
            )
        )
    MEASURE 'FactInternetSales'[MyFilteredNumberMin] =
        CALCULATE (
            MIN ( FactInternetSales[UnitPrice] ),
            FILTER (
                VALUES ( DimProduct[EnglishProductName] ),
                [EnglishProductName] = "AWC Logo Cap"
            )
        )
EVALUATE
 (
    FILTER (
        SUMMARIZE (
            FactInternetSales,
            DimProduct[EnglishProductName],
            "Filtered result", 'FactInternetSales'[MyFilteredNumber],
            "Filtered result min", 'FactInternetSales'[MyFilteredNumberMin]
        ),
        'FactInternetSales'[MyFilteredNumber] > 0
    )
)

The query took now 25 MS to execute

What if I move the filter above SUMMARIZE()?

Sample query 3.

DEFINE
    MEASURE 'FactInternetSales'[MyFilteredNumber] =
        CALCULATE ( SUM ( FactInternetSales[UnitPrice] ) )
    MEASURE 'FactInternetSales'[MyFilteredNumberMin] =
        CALCULATE ( MIN ( FactInternetSales[UnitPrice] ) )
EVALUATE
 (
    FILTER (
        SUMMARIZE (
            FactInternetSales,
            DimProduct[EnglishProductName],
            "Filtered result", 'FactInternetSales'[MyFilteredNumber],
            "Filtered result min", 'FactInternetSales'[MyFilteredNumberMin]
        ),
        'FactInternetSales'[MyFilteredNumber] > 0
            && [EnglishProductName] = "AWC Logo Cap"
    )
)

The query went down to 13 MS

What if I re-write my query to use ROW() instead?

Sample query 4.

EVALUATE
 (
    ROW (
        "Filtered result", CALCULATE (
            SUM ( FactInternetSales[UnitPrice] ),
            FILTER ( DimProduct,  [EnglishProductName] = "AWC Logo Cap" ) 
        ),
        "Count", CALCULATE (
            MIN ( FactInternetSales[UnitPrice] ),
            FILTER ( DimProduct, [EnglishProductName] = "AWC Logo Cap" ) 
        )
    )
 )

Then my query actually went down to 5 MS.

One last modification. Insert VALUES(..) to just apply the filter on the necessary column.

Sample query 5.

EVALUATE
 (
    ROW (
        "Filtered result", CALCULATE (
            SUM ( FactInternetSales[UnitPrice] ),
            FILTER (
                VALUES ( DimProduct[EnglishProductName] ),
                [EnglishProductName] = "AWC Logo Cap"
            )
        ),
        "Count", CALCULATE (
            MIN ( FactInternetSales[UnitPrice] ),
            FILTER (
                VALUES ( DimProduct[EnglishProductName] ),
                [EnglishProductName] = "AWC Logo Cap"
            )
        )
    )
)

And now we went down to about 4 MS.

The server timings for “Sample query 1” looks like this

server-timing-query-1

And the server timings for “Sample query 5” looks like this

server-timing-query-2

Indent on report parameter in SSRS (written in MDX)

In this post I will show you how you can make nice indented report parameters in SSRS. This might become handy if you want to display multiple levels in the same parameter list.

Lets start by writing a simple MDX to list years and week numbers.

WITH MEMBER [RSValue]
AS
[Date].[Production].CurrentMember.UniqueName

MEMBER [RSCaption]
AS
[Date].[Production].CurrentMember.NAME

SELECT {
	[Measures].[RSValue],
	[Measures].[RSCaption]
} ON 0,
DESCENDANTS(
	[Date].[Production].[Year], 
	[Date].[Production].[Week], 
	SELF_AND_BEFORE
) ON 1
FROM [Adventure Works Internet Sales Model]

Running this query will produce a result like this in SSMS

Result 1

To add this result to a parameter I will first have to create a dataset. Let’s call that “MyDate”. Then paste the MDX into it. Then assign the “Available Values” as in the screenshot below.

Result 2

During report preview the report parameter will look like this

Result 3

Not too nice and readable I guess. I really need to make som kind of indenting here. So I start to make some changes to my MDX like this

WITH MEMBER [RSValue]
AS
[Date].[Production].CurrentMember.UniqueName

MEMBER [Indent]
AS
[Date].[Production].CurrentMember.LEVEL_NUMBER - 1

MEMBER [RSCaption]
AS
SPACE([Measures].[Indent] * 5) + " " + [Date].[Production].CurrentMember.NAME

SELECT {
	[Measures].[RSValue],
	[Measures].[Indent],
	[Measures].[RSCaption]
} ON 0,
DESCENDANTS(
	[Date].[Production].[Year], 
	[Date].[Production].[Week], 
	SELF_AND_BEFORE
) ON 1
FROM [Adventure Works Internet Sales Model]

It’s not easy to see the result of this indenting if you run the query in SSMS. But if you replace the query in the dataset we called “MyDate” you will see a much nicer list.

Result 4

Make sure you trim your parameter when setting it against the dataset you want to filter

Result 5

Voila!

Creating a snapshot of your database (SQL)

I am working quite a lot with virtual machines in Hyper-V and VMWare. In these products you have a really useful feature called “snapshot”.

The snapshot feature is most useful when you want to preserve the state of the database so you can return to the same state repeatedly. For example if you are doing a scary update or something you have the possibility to “roll back” these changes.

Did you know that you could make snapshots even in SQL Server?

Lets start by creating a snapshot of the AdventureWorksDW2012 database and create a “super important” table.

-- change database
use [AdventureWorksDW2012]

-- Create a super important table
create table mySuperImportantTable ([ID] int, [Desc] varchar(2000))

-- Insert some values
insert into mySuperImportantTable values (1, 'Secret 1')
insert into mySuperImportantTable values (2, 'Secret 2')
insert into mySuperImportantTable values (3, 'Secret 3')
insert into mySuperImportantTable values (4, 'Secret 4')
insert into mySuperImportantTable values (5, 'Secret 5')

-- Create a snapshot of the database
CREATE DATABASE AdventureWorksDW2012_Mysnapshot ON
(
NAME = AdventureWorksDW2012_Data,
FILENAME = 'C:\db snapshotfile\AdventureWorksDW2012_Mysnapshot.ss' )
AS SNAPSHOT OF [AdventureWorksDW2012];
GO

Then the big mistake happens. We delete the table by accident.

-- Oh no I did a mistake
drop table mySuperImportantTable

Because snapshot files tend to get big I am really curious about the size of it.

Huge file
Ouch… It looks really big 😦 But lets look at the actual disk usage

disk usage

Ahh… only 960 kB. I can live with that.

Okey back to my snapshot. I can write some basic SQL to list all my snapshots.

select * from sys.databases where source_database_id is not null

Lets try to rescue the day by restoring from my snapshot and then get the data from the database

-- I want to restore my database from snapshot
USE MASTER
RESTORE DATABASE AdventureWorksDW2012
FROM DATABASE_SNAPSHOT = 'AdventureWorksDW2012_Mysnapshot'

-- Is my table there?
use AdventureWorksDW2012
select * from mySuperImportantTable
result

Cool. My snapshots saved my day. Then some cleaning. Delete the snapshot and then my sample table.

 -- Drop my snapshot database
DROP DATABASE AdventureWorksDW2012_Mysnapshot ;

-- Drop my table
drop table mySuperImportantTable

See you in Seattle?

Saturday I went from Norway to Seattle with one of my friends. A flight that took over 11 hours. It is good to have a few days up front to shop and so on. The hotel (Hyatt Olive) we are staying at even has a fitness center and swimming pool. So I might even get in shape during this week.

I am really looking forward to SQL Pass that starts on Tuesday. Good speakers and really interesting sessions.

Maybe I will see you there?

Kind regards,
Sindre

Sample queries for KPI in Datazen

When I started working with Datazen I could not find to many samples out there. So to help people getting started I will provide some 🙂

The first DAX query is for creating a KPI in Datazen.

EVALUATE
(
   ROW("Internet total sales", [Internet Total Sales])
)

And with some filtering it might look like something like this

EVALUATE
(
	ROW("Internet total sales", CALCULATE([Internet Total Sales], 
		FILTER('Date', 'Date'[Calendar Year] = 2008)
		)
	)
)

To make a query for trend you will have to use nested summarize. The reason why you have to do this is that you need to have the measure as first column.

A single summarize will produce this result

EVALUATE
(
	SUMMARIZE
	(
		'Date',
		'Date'[Month],
		"Internet Total Sales", 
		CALCULATE([Internet Total Sales], 
			FILTER('Date', 'Date'[Calendar Year] = 2007)
		)
	)
)

The first column is month number and the next column is the measure.

Summarize datazen 1

But we will have to change it so that the measure is the first column.

EVALUATE
(
	SUMMARIZE
	(
		SUMMARIZE
		(
			'Date',
			'Date'[Month],
			"Internet Total Sales", 
			CALCULATE([Internet Total Sales], 
				FILTER('Date', 'Date'[Calendar Year] = 2007)
			)
		),
		[Internet Total Sales],
		'Date'[Month]
	) 
) ORDER BY 'Date'[Month] DESC

And the result

Summarize datazen 2