Sunday, March 06, 2011

Quick and Dirty Unpivoting with SQL Server’s XML Features

Let’s suppose you want to return some data in the following extremely unpivoted format:
SELECT  ID
   ,    ColumnName
   ,    Value
FROM    .. something ..

Your source table or query has (let’s say) about 30 different columns.  How can you quickly produce something in the format above?


Well, one way is a bunch of UNION statements, like so (using the AdventureWorks database):

select  ID = ProductId
    ,   ColumnName = 'Name'
    ,   Value = Name
from    Production.Product
union
select  ID = ProductId
    ,   ColumnName = 'ProductNumber'
    ,   Value = ProductNumber
from    Production.Product    
union
select  ID = ProductId
    ,   ColumnName = 'MakeFlag' 
    ,   Value = CONVERT(char(1), MakeFlag)
from    Production.Product

This works fine, if you have only a few columns you want to return in this format.  If you have 20-30, as with the Production.Product table, that’s a lot of UNION statements.  And copying the code to use for another table is basically a waste of time:  you’ll have to change so many things you might as well just start typing from scratch.


Another way is using the UNPIVOT statement.  It produces the same sort of result, and looks like this:

select  *
from    (
    Select  ProductId, 
            ProductLine, 
            Class, 
            Style 
    from Production.Product) as pv
UNPIVOT (Value for FieldName IN (ProductLine, Class, Style)) p

But all the fields you want to unpivot need to be of the exact same type.  For strings, that means the same type AND the same length.  (The ProductLine, Class, and Style fields above are are nchar(2)).  Plus, you need to list all the fields to which you want the UNPIVOT to apply.  If you have 20 or 30 fields, that’s a lot of fields to list.


So, what we need is a way to produce a list of this form, that can adapt to differing column types, can be applied to all fields or some fields in a table with a minimum of column-name-typing, and which can be copied-and-repurposed without requiring a complete retype.


Here’s one way to do it, using XML.  In terms of performance, it is the slowest of the three options, but in terms of programmer time, it’s definitely the fastest and most versatile.

with A
as (
    --  We're going to return the product ID, plus an XML version of the 
    --  entire record.
 select  ProductId
     ,   (
         Select  * 
         from    Production.Product 
         where   ProductID = pp.ProductId
         for xml auto, type) as X
 from    Production.Product pp
 )
, B
as (
    --  We're going to run an Xml query against the XML field, and transform it
    --  into a series of name-value pairs.  But X2 will still be a single XML
    --  field, associated with this product ID.
    select  ProductId
        ,   X.query(
         'for $f in Production.Product/@*
          return 
        <product name="{ local-name($f) }" value="{ data($f) }" />
      ') as X2
 from A
 )
, C
as (
    --  We're going to run the Nodes function against the X2 field, splitting 
    --  our list of "product" elements into individual nodes.  We will then use
    -- the Value function to extract the name and value.  
 select B.ProductId as ID
  , norm.product.value('@name', 'varchar(max)') as Name
  , norm.product.value('@value', 'varchar(max)') as Value
from B
 cross apply B.X2.nodes('/product') as norm(product)
)
-- Select our results.
select *
from C

Note a couple of things:


1.  We’re using (and returning) all the fields from Production.Product, but we didn’t have to list the field names.


2.  We’re using the term “Product” in C and referring to the XML node Production.Product, but those names are arbitrary:  we could have formed the XML in step A in such a way that it had a generic name.


3.  If you wanted to return fewer columns, you have a couple of choices:  you can specify the particular columns you want in the subquery in step A, or you could filter them out at the final step.  (Since the column names are in the field Name, you could even join to sys.all_columns or some other list of column names, so as to filter by type or some other criteria.) 


4.  We are returning the Value column as varchar(max); this is because anything can be expressed as a varchar.  But if we were returning, say, all ints, we could specify in the value ()function that the data should be returned as an type int. 


5.  Nulls are automatically filtered out.


Here’s what the results look like, by the way:
Capture Normalization
Not bad, I think.

0 Comments:

Post a Comment

<< Home