Sunday 22 November 2009

SQL Bits - R2 day - Parallel Data Warehouse

Went to my first SQL bits last Friday (given that it was practically in my backyard how could I not?) and I have to say that for a not-for-profit community-organised event, it was easily as good as those organised by DevExpress, Microsoft et al. Leaving aside the opportunity to catch-up with ex-colleagues and do a bit of networking there were some excellent sessions. Given that this was 'R2 day' these were almost entirely devoted to the next release of SQL Server. Hence I attended sessions on Gemini (now PowerPivot) Madison (now Parallel Data Warehouse) and R2 Reporting Services.

Today I want to report on the Parallel Data Warehouse (PDW) session given by Thomas Kejser, a good, informative session - that only lacked a demo for obvious reasons (although I was told they are working on that). Any inaccuracies are my own however.

Firstly Thomas went through the limitations of the usual SMP (think traditional big-iron) architecture. At about 10TB designing the data warehouse database requires significant additional customisation incurring additional administration and development cost. On top of that the cost of moving above a 64-core SMP architecture starts becoming prohibitively expensive in terms of the increase in the number of additional cores for each pound spent (ie doubling the core-count costs more than twice as much). None of this is of course news for those who operate at these sort of scales.

So PDW is a true MPP solution operating across multiple cores in multiple nodes. PDW comes as a pre-built appliance - but using standard commodity hardware from Dell/EMC, IBM, HP or Bull. Nodes are nothing more than standard servers or disk arrays. The appliance comes pre-built, with all of the cabling, power, disk configuration and software installation already done. In theory it would be possible to mix racks from different hardware vendors. An appliance consists of one or more racks each with 64 cores - Thomas saying the largest being 5 racks at the moment.

An appliance has two control nodes, a management node, a backup node, and a series of compute and storage nodes. It also had one 'spare' node per rack - in case of a whole node failure. There is also a landing zone for ETL tools to deposit text files and a specialised data movement service to move data between nodes where necessary.

As in other MPP architectures designers have to ensure that data is evenly spread across nodes by choosing atomic, evenly distributed columns. Datetime columns are often good candidates. Tables are created with the following syntax:


CREATE TABLE AS SELECT [select list]
WITH
(
CLUSTER_ON [column name]
DISTRIBUTE_ON [column name] REPLICATE
PARTITION_ON [column name]
)


As the key word DISTRIBUTE ON suggests, it identifies the column(s) that distributes the data across the nodes. Its worth pointing out at this point that, like most other MPP products, PDW will only recognise a subset of T-SQL.

A control node receives queries from clients, breaks the query down and issues it to the compute nodes, based on the distribution of data. Only one control node can be active at one time. Compute nodes work on their slice of the query, aggregating it up and sending it back to the control node. Thus each compute node only has to work on a small part of the total query. As the amount of data grows, additional racks can be added to increase the number of nodes and maintain or increase the performance of the appliance.

The key to the appliance are the compute nodes roughly analogous to a Netezza SPU, although unlike those a compute node is simply a standard commodity server running SQL 2008. The version of 2008 on the compute node is claimed to be entirely standard - although the hardware is optimised it is only the sort of thing you can do yourself - each core having its own dedicated chunk of disk and tempdb separated out.

Thomas was asked about the fact that all queries have to pass through one control node, the suspicion being that it could form a bottleneck. Thomas countered this by saying that most queries would never touch the disk of the control node and would be streamed through in-memory. Aggregations carried out on the control node would be relatively trivial as the compute nodes would pre-aggregate the data first.

Thomas also suggested some ETL tips (that work on 'normal' SQL) which I'll post later-on. There were also intriguing possibilities raised around Analysis Services using PDW in ROLAP mode - leveraging the more powerful MDX query language whilst utilising the raw performance of PDW. MS have already posited PDW as the centre of a hub-and-spoke model:

So thats an overview of Madison/PDW - hopefully its of interest. We were told 8-9 companies have it in production and its scheduled for a first half 2010 release. After the summer we should start hearing more about customer experiences and the future roadmap. Pricing is anticipated to be cheaper than Netezza on a per-TB basis, and its likely that MS will discount the list price.

Tuesday 10 November 2009

November SQL Server 2008 R2 CTP out

Thanks to Jamie for posting about this yesterday. The November SQL Server 2008 R2 CTP is out for TechNet and MSDN subscribers now and everbody else tomorrow. Get the bits here.

Wednesday 4 November 2009

More SQL 2008 R2 news

Gemini (posted about previously) has been renamed 'PowerPivot' or more precisely 'Power Pivot for Excel 2010', continuing MS's penchant for long-winded product names. I've had a play with the CPT2 and it pretty much works as advertised - giving the ability to slice-and-dice large, disconnected, datasets with ease on modest hardware, with massive levels of data compression. There's also options to combine it with non-datawarehouse type data such as industry trends analysis or high-level reports retrieved from outside organisations in excel-ready formats. It also includes some useful GUI tricks pinched from QlickView such as greying out slicers which have no data. Unfortunately CTP2 didn't include a connector for Analysis Services so I await that with baited breath in the November CTP.

Also in the news was the renaming of Madison (also mentioned previously) to - wait for it - 'SQL Server 2008 R2 Parallel Data Warehouse' - confirmed as coming out with R2 in the first half of next year. This is great news as it answers the frequent criticism that 'SQL doesn't scale'. Not only that, but unlike many other warehouse appliance providers (such as Netezza and Teradata) you have the choice of commodity hardware from 5 different vendors, and unlike Greenplum (an open-source hardware-independent appliance vendor) MS is hardly a tiny company that few have heard of. More information on the sort of architecture MS propose for their Parallel Data Warehouse can be found here.