Read CSV Files with double quotes and comma into SQL DB

2012-6-12 13:04| 发布者: Test| 查看: 3008| 评论: 0|原作者: Brad Schulz|来自: SELECT Blog FROM Brad.Schulz

摘要: In this post, I’d like to talk about Comma-Separated Values in Files (i.e. CSV Files).I had a client with a need to read in a CSV file and do some subsequent processing with it. This was a very small ...

Use OPENROWSET(BULK)

One approach is to use OPENROWSET(BULK) instead. If we create a complex XML format file that describes our CSV File contents like so (without the /* … */ of course)…

/*
<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" 
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RECORD>
    <FIELD ID="1" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="10"/>
    <FIELD ID="2" xsi:type="CharTerm" TERMINATOR="," MAX_LENGTH="20"/>
    <FIELD ID="3" xsi:type="CharTerm" TERMINATOR="\r\n" MAX_LENGTH="20"/>
  </RECORD>
  <ROW>
    <COLUMN SOURCE="1" NAME="TerritoryID" xsi:type="SQLINT"/>
    <COLUMN SOURCE="2" NAME="TotalSales" xsi:type="SQLDECIMAL"/>
    <COLUMN SOURCE="3" NAME="TotalCost" xsi:type="SQLDECIMAL"/>
  </ROW>
</BCPFORMAT>
*/

…then we can reference that format file as a parameter in a call to OPENROWSET(BULK):

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset(bulk 'C:\CSV Files\Territory Data.CSV'
               ,formatfile='C:\CSV Files\Territory Data.FMT'
               ,firstrow=2) CSVInput
/*
TerritoryID  TotalSales   TotalCost
----------- ----------- -----------
          1  5767341.98  4746522.45
          2  3857163.63  3174445.67
          3  4677108.27  3849260.11
          4  8351296.74  6873117.22
          5  2851419.04  2346717.87
          6  6917270.88  5692913.94
          7  3899045.69  3208914.61
          8  2481039.18  2041895.24
          9  1977474.81  1627461.77
         10  3514865.91  2892734.64
*/

Note that we had to specify FIRSTROW=2 to skip over the header row in the CSV file that had the column names.

This approach works fine, but it’s kind of a pain to have to set up the format file. For any new CSV File that we wanted to process, we would have to create one of these format files. No thanks.

Another approach… much easier… is to use BULK INSERT, like so:

if object_id('tempdb..#TerrData','U') is not null drop table #TerrData
create table #TerrData 
(
   TerritoryID int
  ,TotalSales decimal(12,2)
  ,TotalCost decimal(12,2)
)
  
bulk insert #TerrData 
from 'C:\CSV Files\Territory Data.CSV'
with (fieldterminator=','
     ,rowterminator='\n'
     ,firstrow=2)
    
select * from #TerrData
/*
TerritoryID  TotalSales   TotalCost
----------- ----------- -----------
          1  5767341.98  4746522.45
          2  3857163.63  3174445.67
          3  4677108.27  3849260.11
          4  8351296.74  6873117.22
          5  2851419.04  2346717.87
          6  6917270.88  5692913.94
          7  3899045.69  3208914.61
          8  2481039.18  2041895.24
          9  1977474.81  1627461.77
         10  3514865.91  2892734.64
*/

It can’t get much more simple than that!

But, of course, there’s a catch…

The BULK INSERT and the OPENROWSET(BULK) approaches simply won’t work in certain cases. Let’s say we had data in Excel that looked like this:

If we save that as a CSV file, its contents would look like this (without the /* … */):

/*
CustomerID,CompanyName,Street
1,John's Lovely Meat Pies,"2838 ""A"" Street"
2,"Candies ""R"" Us",167 Broad Ave
3,"Fork & Spoon, Inc",28 Grover Blvd
4,"""Cakes,"" She Said","732 Arch St, Suite ""D"""
5,Soup Is On Me,93573 Vista Ave
*/

You can see that any string column that contained a double-quote character or a comma is surrounded (or delimited) with double-quote characters… and the embedded double-quote characters are doubled up.

If we try to read that in, we certainly don’t get what we want:

if object_id('tempdb..#CustData','U') is not null drop table #CustData
create table #CustData 
(
   CustomerID int
  ,CompanyName varchar(40)
  ,Street varchar(40)
)
  
bulk insert #CustData 
from 'C:\CSV Files\Customer Data.CSV'
with (fieldterminator=','
     ,rowterminator='\n'
     ,firstrow=2)
    
select * from #CustData
/*
CustomerID CompanyName             Street
---------- ----------------------- ---------------------------------------
         1 John's Lovely Meat Pies "2838 ""A"" Street"
         2 "Candies ""R"" Us"      167 Broad Ave
         3 "Fork & Spoon            Inc",28 Grover Blvd
         4 """Cakes                "" She Said","732 Arch St, Suite ""D"""
         5 Soup Is On Me           93573 Vista Ave
*/

We get the surrounding double-quotes and the embedded double double-quotes as part of our data… and the commas within our string columns are interpreted as being column separators.

Yuk!

I guess this is why tab-delimited files were invented. But, interestingly enough, Excel insists on doing the same double-quote delimiting when you save data as a tab-delimited file as well.

So what now?

123 4 5 / 5 页下一页