BC Morning V1806 › 门户 ›IT世界› 应用开发 › 查看内容

Read CSV Files with double quotes and comma into SQL DB

2012-6-12 13:04| 发布者: Test| 查看: 2941| 评论: 0|原作者: Brad Schulz|来自: SELECT Blog FROM Brad.Schulz

摘要: In this post, I’d like to talk about Comma-Separated Values in Files (i.e. CSV Files).I had a client with a need to read in a CSV file and do some subsequent processing with it. This was a very small ...

Directly access a CSV file using T-SQL

In this post, I’d like to talk about Comma-Separated Values in Files (i.e. CSV Files).

I had a client with a need to read in a CSV file and do some subsequent processing with it. This was a very small file that didn’t regularly populate any permanent table… Its data was simply used as input to do some calculations, so using a tool like SSIS to process the file was overkill. We just needed to access the file directly with T-SQL.

Let’s look at an example…

In the C:\Temp folder, we have a CSV File called TerritoryData.CSV with the following contents (Ignore the /*…*/ delimiters… They are there so that sites that syndicate this blog will render the output more reliably):

/*
TerritoryID,TotalSales,TotalCost
1,5767341.98,4746522.45
2,3857163.63,3174445.67
3,4677108.27,3849260.11
4,8351296.74,6873117.22
5,2851419.04,2346717.87
6,6917270.88,5692913.94
7,3899045.69,3208914.61
8,2481039.18,2041895.24
9,1977474.81,1627461.77
10,3514865.91,2892734.64
*/

One method we can use to directly access a CSV file using T-SQL (and the one most recommended on various online forums) is to do something like this:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}'
               ,'select * from C:\Temp\TerritoryData.CSV')
/*
TerritoryID  TotalSales   TotalCost
----------- ----------- -----------
          1  5767341.98  4746522.45
          2  3857163.63  3174445.67
          3  4677108.27  3849260.11
          4  8351296.74  6873117.22
          5  2851419.04  2346717.87
          6  6917270.88  5692913.94
          7  3899045.69  3208914.61
          8  2481039.18  2041895.24
          9  1977474.81  1627461.77
         10  3514865.91  2892734.64
*/

This uses the MSDASQL provider in concert with the Microsoft Access Text Driver to read the data from the CSV File.

Boy that was easy! We’re done, right?

Well… not really. There are some inherent gotchas and problems with this.

First of all, this approach with OPENROWSET will not work unless your server is configured to accept Ad Hoc Distributed Queries. That can easily be accomplished like so:

sp_configure 'Ad Hoc Distributed Queries',1
reconfigure

Note the name of the Driver in the query. My laptop has the Microsoft Access Text Driver installed on it… presumably because I’m on Windows 7 and Office 2010 is the only version of Office that I have installed (and, ironically, I did NOT choose to install Access). But on my old WinXP machine, there is no such driver. On that machine, I would have to use a different driver name:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Text Driver (*.txt; *.csv)}'
               ,'select * from C:\Temp\TerritoryData.CSV')

By the way, note the subtle difference in that the older driver separates *.txt and *.csv with a semicolon; whereas, the newer (Access) driver separates them with a comma. You have to make sure the driver is spelled EXACTLY right.

So, in a nutshell, we have to know which driver is installed in order to make this work. To see the drivers installed in your system, go to Control Panel -> Administrative Tools -> Data Sources (ODBC) -> Drivers.

Note that our CSV file was in the C:\Temp folder. This path has to be visible to the server. In my case, on my laptop, it’s not a problem… my server is my machine. But if the server is on another machine somewhere out there on the network or in the cloud, then C:\Temp does not refer to a path on my local machine running SSMS, but to a path on that machine on which the server is installed. This is not a big problem, but it is something that you have to be aware of.

But the name of that path can be a problem. Let’s say instead of putting the CSV file in the C:\Temp folder, we decide to put it into a folder called C:\CSV Files. If we try to execute that, here’s what happens:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}'
               ,'select * from C:\CSV Files\TerritoryData.CSV')
/*
OLE DB provider "MSDASQL" for linked server "(null)" returned message 
  "[Microsoft][ODBC Text Driver] Syntax error in FROM clause.".
Msg 7321, Level 16, State 2, Line 2
An error occurred while preparing the query 
  "select * from C:\CSV Files\TerritoryData.CSV" 
  for execution against OLE DB provider "MSDASQL" for linked server "(null)". 
*/

It’s that darn space character embedded in the folder name that causes the problem. Even if we put double-quotes around the path and filename, it still doesn’t work, displaying a different error message:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}'
               ,'select * from "C:\CSV Files\TerritoryData.CSV"')
/*
OLE DB provider "MSDASQL" for linked server "(null)" returned message 
  "[Microsoft][ODBC Text Driver] The Microsoft Access database engine could 
  not find the object 'C:\CSV Files\TerritoryData.CSV'. 
  Make sure the object exists and that you spell its name and the path name correctly. 
  If 'C:\CSV Files\TerritoryData.CSV' is not a local object, check your network 
  connection or contact the server administrator.".
Msg 7350, Level 16, State 2, Line 1
Cannot get the column information from OLE DB provider "MSDASQL" 
  for linked server "(null)".
*/

Luckily this problem can be overcome by removing the path from the filename and specifying it as the DefaultDir in the Provider String parameter (the second parameter) of OPENROWSET:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}; 
                   DefaultDir=C:\CSV Files\'
               ,'select * from TerritoryData.CSV')
/*
TerritoryID  TotalSales   TotalCost
----------- ----------- -----------
          1  5767341.98  4746522.45
          2  3857163.63  3174445.67
          3  4677108.27  3849260.11
          4  8351296.74  6873117.22
          5  2851419.04  2346717.87
          6  6917270.88  5692913.94
          7  3899045.69  3208914.61
          8  2481039.18  2041895.24
          9  1977474.81  1627461.77
         10  3514865.91  2892734.64
*/

What if the filename itself has a space in it? If we rename the file to Territory Data.CSV, then we have a problem again:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}; 
                   DefaultDir=C:\CSV Files\'
               ,'select * from Territory Data.CSV')
/*
OLE DB provider "MSDASQL" for linked server "(null)" returned message 
  "[Microsoft][ODBC Text Driver] Syntax error in FROM clause.".
Msg 7321, Level 16, State 2, Line 1
An error occurred while preparing the query 
  "select * from Territory Data.CSV" 
  for execution against OLE DB provider "MSDASQL" for linked server "(null)". 
*/

But this problem can be fixed by putting double quotes around the filename:

select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}; 
                   DefaultDir=C:\CSV Files\'
               ,'select * from "Territory Data.CSV"')
/*
TerritoryID  TotalSales   TotalCost
----------- ----------- -----------
          1  5767341.98  4746522.45
          2  3857163.63  3174445.67
          3  4677108.27  3849260.11
          4  8351296.74  6873117.22
          5  2851419.04  2346717.87
          6  6917270.88  5692913.94
          7  3899045.69  3208914.61
          8  2481039.18  2041895.24
          9  1977474.81  1627461.77
         10  3514865.91  2892734.64
*/

By the way, you may have noticed that the TerritoryID output above is right-justified in the output, indicating that the Text Driver made a “best guess” (based on the data it saw in the column) that the column was numeric. Sometimes you don’t want the Text Driver to assume anything about the column datatypes. For example, if you are importing a CSV file that has Zip Codes or Social Security Numbers, for example, you want those imported as character data, but the Text Driver will only see digits in its sampling and will insist on making it numeric data (thereby losing leading zeroes). You can get around this by creating a SCHEMA.INI file to describe the contents of your CSV files. You can read more about that here.

Another problem with this OPENROWSET(‘MSDASQL’) approach is that it assumes the MSDASQL provider is installed. I naïvely thought that this was installed by default when SQL Server is installed. Apparently it’s not. And unfortunately, my client did NOT have this in his list of providers. To see which providers you have installed, go to the Object Explorer in SSMS and look under Server Objects -> Linked Servers -> Providers:

Finally, the OPENROWSET command expects only string literals for its parameters, so you cannot attempt to use variables like so:

declare @FilePath varchar(100) = 'C:\CSV Files\'
       ,@FileName varchar(100) = 'Territory Data.CSV'
select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset('MSDASQL'
               ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)}; 
                   DefaultDir='+@FilePath
               ,'select * from "'+@FileName+'"')
/*
Msg 102, Level 15, State 1, Line 8
Incorrect syntax near '+'.
*/

So, if you have the Path and Filename in variables, the only way you can accomplish reading in that file’s data is via Dynamic SQL, like so:

declare @FilePath varchar(100) = 'C:\CSV Files\'
       ,@FileName varchar(100) = 'Territory Data.CSV'
       ,@SqlStmt nvarchar(max) 
 
set @SqlStmt=N'
select TerritoryID
      ,TotalSales
      ,TotalCost
from openrowset(''MSDASQL''
               ,''Driver={Microsoft Access Text Driver (*.txt, *.csv)}; 
                    DefaultDir='+@FilePath+N'''
               ,''select * from "'+@FileName+N'"'')'
 
exec sp_executesql @SqlStmt
/*
TerritoryID  TotalSales   TotalCost
----------- ----------- -----------
          1  5767341.98  4746522.45
          2  3857163.63  3174445.67
          3  4677108.27  3849260.11
          4  8351296.74  6873117.22
          5  2851419.04  2346717.87
          6  6917270.88  5692913.94
          7  3899045.69  3208914.61
          8  2481039.18  2041895.24
          9  1977474.81  1627461.77
         10  3514865.91  2892734.64
*/

So, to recap, if you want to use the OPENROWSET(‘MSDASQL’) approach, use the following guidelines:

The server must be configured to allow Ad Hoc Distributed Queries
The MSDASQL Provider must be installed on the server
Reference the correct Text Driver name in the Provider String parameter
Make sure the file path is visible to the server
Specify the file path in the DefaultDir of the Provider String parameter
Surround the filename with double-quotes in the query parameter of OPENROWSET
Use a SCHEMA.INI file if you want more control over the datatypes of the columns
Either hard-code the path and filename or use Dynamic SQL if the path and filename are in variables

Whew!

Since my client didn’t have the MSDASQL provider at all, I had to come up with a completely different approach to reading in the CSV file.

12 3 4 5 / 5 页下一页