Talk:Comma-separated values/Archives/2022/September

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Windows-Centric

This article is Windows-centric. I'll go so far as to say it's Windows-myopic. It defies the imagination why time and again people who have no clue whatsoever come in to write authoritative articles. And of course get everything horribly wrong. There should be expert review before anything of a technical nature is published here. You're misleading a lot of people and that is simply unforgivable.

And no, C does not stand for common. Seriously. Or maybe it stands for other four letter words not mentioned here? Seriously: who cares? Why don't you concentrate on getting the technical aspects of this article right instead? And leave your urban legends to another place, another time? And of course if you don't have the expert knowledge to correct this article then you shouldn't be here in the first place.

The Weird World of Wikipedia. Disgusting. —Preceding unsigned comment added by 81.50.44.156 (talk) 19:10, 20 March 2009 (UTC)

Encyclopedia written by the people. Have you found a more extensive free-content encyclopedia? Jasonxu98 (talk) 00:04, 22 January 2011 (UTC)

Or, even better, why don't you fix it? Jasonxu98 (talk) 00:06, 22 January 2011 (UTC)

Comma vs Character

The C in CSV usually refers to comma, (anonymous insertion: Actually this entire article is wrong, I have it on good authority that it actually stands for "Common" As in Commonly seperated value) but I prefer character as in character separated values. The principal is the same and it is more universal in application. Where commas can be embedded in the values, the presence or absence of quotes is critical. The use of uncommon characters for delimiters (such as vertical bar |) reduces the need for quotes and is much safer to use in practice. [This is wrong - delimiting with a common-use character such as comma helps ensure correct use of escaping and encoding, thereby assuring safety. Use of an uncommon character like bar or caret is leaving a time bomb in the system, triggered when the uncommon character turns up in the business data at a later time. Use of uncommon characters for delimiters should always be avoided where possible. --Jaymax 16:50, 2 April 2007 (UTC)]
Most file format conversion programs allow a variety of delimiter characters to be used in place of comma. Tabs are common delimiters.
The character delimiter is widely known to experienced users, but novices are easily confounded when delimiters are not commas. [and really experienced users have dealt with trying to fix an interface where someone years ago decided that | or ¦ or ^ or similar would be 'safe' and therefore never noticed that the escaping logic was broken.]

When CSV is used for search, disambiguation is warranted for comma vs character separated values.

john 07:21, 26 Sep 2004 (UTC)

Sorry but are you on prescription medicine? What other type of delimiter is there today but the character? A character is a byte and a byte is about as far down as you can break things in this context, isn't it? Or don't you know even that? Good rule of thumb for technical research: avoid the mall rats at Wikipedia at all costs. Good grief what a pathetic circus. —Preceding unsigned comment added by 81.50.44.156 (talk) 19:13, 20 March 2009 (UTC)

Quite right. You'll notice that the "good authority" referred to by anonymous is not provided. The C in CSV is for comma, not for character or common. Good grief indeed; such suggestions are just making stuff up. Consider e.g. the SAS definition of CSV: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003103525.htm **However**, it is a courtesy to fix such nonsense in the articles. Please do so next time. This time, I did it. Cerberus (talk) 22:58, 26 May 2015 (UTC)

The article should be descriptive of usage, not prescriptive. The acronym CSV has obviously been interpreted as "Character Separated Values" in many circumstances, certainly in my experience, despite usually meaning "Comma Separated Values". Since there is no standard technical definition, I don't see how there can be an authority beyond evidence of that usage. I'm not sure what counts as a good source, but hopefully we can find something better than the following tidbits and then fix the ugly misinformation: http://acronyms.thefreedictionary.com/Character+Separated+Values http://www.acronymfinder.com/Character-Separated-Values-(data-format)-(CSV).html 129.67.45.74 (talk) 16:40, 8 June 2015 (UTC)

Making guidelines more formal

I think that the general guidelines of a CSV should be explained in the Formal Specifications section rather than in the Example section (see note 1 below), stating clearly that those are not the standard, but perhaps the most widely used ones. I propose the Creativyst guidelines to be used (already linked from mentioned section). It would also be good to note the differences between the last and the RFC 4180. Juan Loman. 00:44, 13 December 2005 (PST)

Note 1: The guidelines in the "Example" section are good for an example, but should not be the only ones in the entire document.

Example

Does anyone else think that the format shown in the example CSV data is a poor poor choice to show other people, with spaces between the comma and the text qualifier?

I don't think it's even valid CSV -- I'm not sure how the comma-space-quote should be parsed. Richard W.M. Jones 09:53, 3 November 2005 (UTC)

I agree. A good example would exercise the full range of valid input (note that spaces after the commas is valid input though). How about:

1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""",,4900.00
1996, Jeep, Grand Cherokee, "air, moon roof, loaded
MUST SELL!", 4799.00

Yes, this is a much better example. Perhaps also a table showing how it should be interpreted in a spreadsheet? Particularly the multi-line "air ... SELL!" cell which would be very well demonstrated by the table. Richard W.M. Jones 21:29, 3 November 2005 (UTC)

Small problem is that in both the text and the table, the CR looks like natural wrapping. A bit of tweaking could put the CR such that the first line is shorter than the first in the table, and the CR in the text truncates the line early. Also, would it be appropriate to include field labels at the top of the CSV text only, showing the optional convention of the first line being a comma delimited list of fieldnames? --195.92.40.49 14:57, 2 March 2006 (UTC)

Umm... The text says the example illustrates "a space before and after delimeter commans may not be trimmed". But I don't see how it illustrates that... RobertII (talk) 21:30, 30 December 2009 (UTC)

Programming language tools

There was a huge unwieldy section which basically was a list of everyone (me included) promoting their own little CSV tool. I've changed this into a slightly less unwieldy table. Richard W.M. Jones 15:59, 21 October 2005 (UTC)

Man that table is ugly. The sectionated arrangement was much easier to understand. The Java section is just a run-on sentence of confusion. Is there a special way to roll back that one update without disturbing the two legitimate updates that followed? Mike

It's a lot better than it was before. If anything needs to be rolled back it's the decision to have a huge list of implementations, which are basically adverts for various products. It adds nothing at all to the article or to the understanding of the CSV format. Richard W.M. Jones 09:07, 29 October 2005 (UTC)

I disagree. After looking at it again I can see you stripped a lot of useful information about each tool. Unfortunately I can't figure out how to get to the original content. Also that SiMX thing should be moved into the Programming Tools section. You wouldn't happen to be affiliated with that product would you?

No. As you could find out with minimal research, I did the OCaml CSV tool. Before changing this list you should clearly understand Wikipedia:What_Wikipedia_is_not. Richard W.M. Jones 21:27, 1 November 2005 (UTC)

Then by your reasoning we would have to remove the whole section and leave nothing but the conceptual description, examples, and history. But I think that would be a mistake because people interested in CSV are invariably working with such files and need developer tools. So either take it out or leave it in. Don't leave some half baked chain of links. How about reformatting it without sub-sections so that it doesn't appear in the TOC. Just make one section with a list of bullets like the below? We MUST leave the descriptive text in.

As I said above, I think it needs to be removed. The change I made to a table was better than before when this section took up 2/3rds of the article; but the ideal would be to remove it altogether. This is an article about the CSV format, and not a list of links or promotions. If people want to find Java-based CSV implementations, they can easily do this on Google. If creators of these tools wish to advertise them, they can advertise them through paid search on Google. Richard W.M. Jones 09:21, 2 November 2005 (UTC)

Fine. Go ahead then. Remove it.

I'm actually looking for a suitable place to host the list. For example Wikibooks. If it was located there (or a suitable place) then we could just link to that list ([http://example.com/csv_implementations A list of CSV implementations]). Best of both worlds. Richard W.M. Jones 21:33, 3 November 2005 (UTC)

I would suggest condensing the application support, programming tools and utilities sections into a single paragraph that explains how CSV an extremely widely supported and implemented file format. You can then point to another article for that section and title it something like "Comma-separated values (implementations)". That way, the information can stay in the encyclopedia, the list can grow and it stays out of the main article where most people probably don't want it anyway. I think this article could use a lot of cleanup. --MattWright (talk) 23:47, 28 January 2006 (UTC)

OK, I moved it to another article as no one objected for over 9 months. :) --MattWright (talk) 14:52, 19 October 2006 (UTC)

MattWright, that sounds like a good idea. Where did you move it to? CSV application support? --DavidCary (talk) 16:25, 5 October 2015 (UTC)

Limit on the number of records

Is there any maximum limit on the number of records that a CSV file can have? Does the max limit for Excel which is around 65536 apply for the CSV file also? Pls post a reply. Thanks

Not in the basic CSV file, no. However applications which process the CSV file can certainly have such limits. Excel and OpenOffice.org are examples of this. Richard W.M. Jones 23:54, 25 January 2006 (UTC)

Excel limits to 65 K rows, but I've worked with CSV files that were gigabytes. A lot ( most? ) legacy apps can export data from their own proprietary format to CSV, including some huge transactional systems, like a lucent (telecomms) switch. The only real limit is storage space. But then you would use something like SQL Server or Oracle to load/analyze the data, instead of Excel. ForrestCroce 02:52, 20 December 2006 (UTC)

Infobox

Maybe add a infobox?

Confusing?

"CSV file format is a delimited data format that has fields separated by the comma character and records separated by newlines." Maybe it would be more clear to say a CSV file is a "flat file" that contains tabular data? Probably flat file isn't recognized outside the relational database world, but everyone is familiar with the idea of tabular data. ( Maybe some text along the lines of "... like an Excel table." I'm not saying the article shouldn't explain how CSVs are structured, but the introduction shouldn't be intimidating, even to someone from outside the IT world. ForrestCroce 01:45, 20 December 2006 (UTC)

This Article is Getting Messy

This article used to be pretty simple and clear but it's not so simple or clear anymore.

First, someone pandering Delimeter-separated_values has taken an interest in this page. I've never heard of delimiter separated values and searching the web for it comes up with basically nothing. I think someone is trying to organize concepts at the expense of history. We know "comma separated values" is a misnomer but the fact is that's what people have been calling this format forever. You can't invent things on Wikipedia.

Second, the examples with bullets of notes is odd. It's in the Specification section and starts "The basic rules are". That whole sequence of examples with bullets of notes should be in the Example section. The Specification section should only cite specifications. I think the old simple example and "The basic rules are" sequence should be merged. Specifically the old example should be a quick and simple example at the top of the examples section. The bullets from the old example should be replaced with "The basic rules are" sequence.

--Miallen 02:39, 17 May 2007 (UTC)

I propose merging this article with Delimiter-separated_values and redirecting Comma-separated values to it. Maybe Delimited file format would be a better title for the article. MFNickster 05:15, 7 July 2007 (UTC)

There are several problems here:

1) The name "comma separated values" (hereinafter CSV) is well-recognized and well-established;
2) The specification for CSV is not well-recognized, and there is no "authoritative specification" [1];
3) The use of CSV is well-recognized, although there are variations because of the lack of a well-established specification;
4) The use of "delimiter separated values" (or whatever you want to call it) is well-recognized *and* well-established, but there is no singular "well-recognized" name: in fact there are several "names" (See e.g., [2], [3], [4], [5])
5) The fact that topic relates to something "well-used" but not "well-named" does not mean that a WP article on that topic is inappropriate, nor does that make it an "invention" established by WP. (See e.g., User:Dcoetzee/Named topic bias). Similarly, something that is not "well-specified" may still warrant independent treatment in WP if it is "well-used" and "well-named". Many people will come specifically looking for information on CSV, regardless of whether that name was poorly chosen or a misnomer. Since that is the case, there should be an article describing what CSV is.

Therefore, merging CSV and Delimiter separated values (or whatever you want to call it) sounds like a bad idea because the potential for confusion is already high, given that these articles talk about concepts with either poorly-chosen (but well-established) names, or no well-established name at all.

This article still could use some cleanup, no doubt about that, but that does not warrant a merge of loosely-related articles, especially since the potential for confusion is high. dr.ef.tymac 14:51, 7 July 2007 (UTC)

I concur with dr.ef.tymac's remarks. --Crath 21:05, 7 July 2007 (UTC)

Negative MFNickster. Please do not redirect to "Delimiter separated values". CSV is the term people recognise. Let's not get carried away with sematic details pls. --Miallen 21:23, 26 September 2007 (UTC)

So user --Miallen 02:39, 17 May 2007 (UTC)-- wrote; "You can't invent things on Wikipedia." Ah, contraire. It's done all the time. Even well-cited passages (with still functioning links!) get gaffled with alarming frequency. Which is why I now no longer contribute, except occasionally on the "Talk Tab" for cathartic purposes :) <sigh> I feel better now. — Preceding unsigned comment added by 159.121.119.134 (talk)

Trailing whitespaces

Our article says: "Leading and trailing spaces or tabs, adjacent to commas, are trimmed" opposed to RFC 4180 stating: "Spaces are considered part of a field and should not be ignored". This leads imho to the following question: What is the purpose of this article? To define 'CSV according to WP', 'CSV according to the creativyst article', 'CSV according to RFC' or -what I favor- information on all (notable) styles? Tierlieb 10:53, 12 June 2007 (UTC)

I agree, the WP article should document the various (notable) CSV styles in use. May I suggest the addition of a section the article that documetns the effect various major applications have had upon CSV; e.g., Excel's dominance as a spreadsheet of has caused many to understand the CSV format only as Excel understands it. --Crath 21:11, 7 July 2007 (UTC)

Note that RFCs are only informational and must be evaluated for relevance. Unfortunately, in the case of RFC 4180 it sounds like it not terribly relevant. The "specification" for CSV is defined by Microsoft Excel's CSV import / export code. I suspect that might be an unpopular idea to some but AFAIC any CSV emitted by an application MUST be completely compatible with Excel because of that application's long continuing history of support for CSV. If someone wants to do an RFC that's fine but I think it should go as far as to actually state that it is simply formalizing observed behavior of the Microsoft Excel spreadsheet application. --Miallen 21:43, 26 September 2007 (UTC)

US bias: I know for Germany that Excel as standard uses ; not ,. Maybe this is true for more local editions since lot of countries use , as a deciaml separator, IIRC there is also a international agreement on that. The RFC is not much relevant for CSV, CSV exists much longer. This notable separator issue was removed here http://en.wikipedia.org/w/index.php?title=Comma-separated_values&diff=54321682&oldid=54291230 UnLoCode (talk) 14:31, 2 April 2008 (UTC)

Image

With all these discussions about allowing use of fair-use images or not, how to avoid costly lawsuits... WHY does anyone put a copyrighted image in the article (referring to screenshot of import window of MS Access) if there is so much free software around? --Ben ^T/_C 08:49, 27 September 2007 (UTC)

How to open CSV in OOo

IIRC OpenOffice Base 2.3 does _not_ support csv import. Not tested 2.4, but tired of that testing with every edition. UnLoCode (talk) 14:34, 2 April 2008 (UTC)

Of course it does. Do not look for "import" option, just try to create a new base as follows:

select "Connect to existing base", select "Text" from the pull-down list, click "Next"
select path for existing CSV files, mark CSV format, choose separators, click "Create"
name your new base, click "Save"
select "Tables" from the sidebar, choose any of CSV table and double-click to browse and edit that table.

I do not know exact names in English as I just installed OOo in Polish version. --Andrzej P. Wozniak (talk) 16:23, 2 April 2008 (UTC)

but if I have a odb already and want to add a new table and import csv data? Is that finally possibly without the sometimes described workaround via .ods? The thing is, I have to manage several tables, so I really need importing ... and exporting. And I need utf8. Would be great if OOo would finally support that. Export should please not need too much clicks. :-) Thanks for your help anyway. UnLoCode (talk) 22:58, 2 April 2008 (UTC)

This is not the right place for discussion about OOo or databases. Choose some of recommended by OOo support sites. Here only go two hints:

You do not need to use import/export/convert explicitly as those operations are done on-the-fly if you specify proper parameters for any table you create or connect to.
You do not need to use database operations if you only want to convert text files from some codepage to UTF-8, there are many text codepage/charset converters. --Andrzej P. Wozniak (talk) 09:14, 3 April 2008 (UTC)

Comma versus Semicolon

I have noticed that, in Windows at least, if you have the comma set as your decimal separator (which many countries use) then Excel will export a "comma delimited" CSV file will semi-colons instead of commas. This is something to watch out for if you are importing or exporting data from your own applications with the intention that people will be able to load it into Excel. I ran into this problem with a colleague in Europe. Does anybody know if this is common in other applications or if it's just an Excel thing? Also are there any common alternatives other than comma and semi-colon? Wjousts (talk) 19:36, 3 April 2008 (UTC)

I ran into a similar issue with a European colleague and was told by him that CSV files always (in his experience, of course) used a semicolon (ie not because of Windows nor Excel) —Preceding unsigned comment added by 91.85.197.128 (talk) 11:13, 21 July 2008 (UTC)

Yes this is because of the language settings and stupid windows which is not compatible with it self (tools). So excel for example in Finland uses the country settings ',' (comma) and in Denmark it uses ';'. So that breaks the compatibility. That's untolerable but that's excel ;D 192.100.124.218 (talk) 11:44, 15 September 2008 (UTC)

Today's edit: Values to Volume

Does anyone besides me think that the edit applied to this entry today, where "values" was replaced with "volume", is inappropriate? I've been working in and around computers for 25 years and I have never before heard CSV referred to as Comma Separated Volume. Rather than simply revert the edit, I thought I'd see what others think of the edit. Christopher Rath (talk) 19:34, 17 April 2008 (UTC)

Yeah, google doesn't even turn up any real references to that name either, so it shouldn't be listed in the article without some sort of reference or usage basis... --MattWright (talk) 07:31, 18 April 2008 (UTC)

Line Breaks & white space

Article mentions:

"Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, line-breaks can be embedded."

RFC4180:

"Each record is located on a separate line, delimited by a line break (CRLF)." but it then says

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

which (correctly?) implies any field that contains a CR, LF or CRLF should be all enclosed in a double quoted field.

Interestingly my version of Excel will accept either CR or CRLF to delimit a record but CRLF when it's embedded in a field enclosed in double quotes, results in a line break together with the "square" character representing a character that cannot be displayed, in this case the CR. ie only a single LF works correctly if part of a field.

Article then links to [[6]], which adds in a lone CR to the mix.

Similar problems in the article referring to:

"leading and trailing spaces or tabs, adjacent to commas, are trimmed. This practice is contentious and in fact is specifically prohibited by RFC 4180, which states, "Spaces are considered part of a field and should not be ignored.""

At severe danger of confusing spaces with whitespace here - eg does one need to enclosure a field in double quotes, if it has leading or trailing tabs?

Any one want to tackle these points? - I don't! —Preceding unsigned comment added by 124.191.116.29 (talk) 00:26, 13 June 2009 (UTC)

It has been my experience that some versions of some Microsoft products, as well as third-party products, will fail on a CSV file delimited by LF characters. They only work if the file is delimited by the CRLF sequence. --Jym (talk) 21:05, 23 March 2010 (UTC)

Move Pilcrow support

The secction on Pilcrow support in applications should be moved to CSV application support. —Preceding unsigned comment added by Paddy3118 (talk • contribs) 05:29, 31 July 2009 (UTC)

Incredibly confusing introduction

Sorry, but after reading the introduction, no "normal" person will know any more than before reading it. What's all this confusion about? A "table of lists form"? What does that even mean? And "where each associated item (member) in a group is in association with others also separated by the commas of its set." - wow, I've never read anything more confusing!

What about "A comma-seperated values (CSV) file is used for the digital storage of tabular data, where each table row is stored as one line in a text file, with the individual columns seperated by commas."??? Or is that not "programmer-like" enough? —Preceding unsigned comment added by Intrr (talk • contribs) 15:42, 23 May 2010 (UTC)

I felt the same and attempted a re-write. --Paddy (talk) 06:15, 31 January 2011 (UTC)

Good idea but somebody messed with Paddy's rewrite. I tried for something close to your language. Cerberus (talk) 23:01, 26 May 2015 (UTC)

Graphic is good, but ...

It would be better if lines didn't have the commas all aligned. This would accentuate the fact that it is the commas and not some horizontal index in the line, that is the field delimiter. --Paddy (talk) 06:18, 31 January 2011 (UTC)

ANSI only?

I recently discovered (and read about on the web) that .csv inheretly supports ANSI and not unicode. should that be put in? http://support.microsoft.com/kb/172727 69.136.72.16 (talk) 02:58, 7 February 2011 (UTC)

But note this; and it seems Microsoft have other programs that generate Unicode csv files. --Paddy (talk) 06:14, 7 February 2011 (UTC)

It is not true that .csv inherently does not support Unicode. I am a retired programmer who has often read in Unicode .csv files with no problems. The example you give refers to Microsoft Project, which seemingly does not support Unicode when you save to particular formats. That is a matter of the application, not of the format. Jallan (talk) 01:49, 12 February 2012 (UTC)

Tab

While i've heard of (and used) text data files divided by tabs i've never seen them reffered to as CSV files. Does anyone have any source for this terminology. 130.88.108.187 (talk) 13:20, 10 September 2013 (UTC)

Agreed. This article - about Comma-separated values - talks about other formats that are completely uninteresting to me. The whole section about how some people see CSV as being anything other than comma-separated values, and the seemingly meticulous avoidance of mentioning commas as being the separator, make this article awkward to read, and leave me puzzled. It reads as if the article used to be about other formats, but was trimmed down to be about C(omma)SV, but the smell of earlier text remains.

Just create a separate page describing various delimiter-separated values file formats, and include all of the confusion and (non-)controversy about the delimiter on that page. It makes sense for this page to mention: a) other formats are similar but use other delimiters, and b) some CSV applications use colons as separators when commas can cause confusion such as for some date formats. But that should be about all.

And the suggestion that CSV is anything but comma-separated values is ludicrous. I've located resources written as far back as 2004 (eg https://repositories.tdl.org/ttu-ir/bitstream/handle/2346/17115/31295019801124.pdf) that use the full term "Comma Separated Values". Anything else is a clumsy attempt at a backronym. Jlaidman (talk) 01:40, 22 October 2015 (UTC)

CSV standardisation

As is described in the article there is currently no real CSV standard. However, there are various moves afoot to change this, not least the W3C CSV on the web working group. Also today The National Archives has released a CSV Schema language and CSV Validator, more info at http://blog.nationalarchives.gov.uk/blog/csv-validator-new-digital-preservation-tool/ - by me hence why I'm only adding on the talk page, not into the article so others can decide on the significance so far as the article is concerned. David Underdown (talk) 11:25, 15 July 2014 (UTC)

Why does the CSV article have a huge disclaimer at the top? The article seems pretty good to me. You are causing people to wonder if the article is accurate. I think it is (and I didn't have anything to do with it).Dtaviation (talk) 14:47, 16 May 2015 (UTC)Dave

File extensions

All reputable sources that I can find say the common file extension for a comma-separated values format is .csv, so the infobox should reflect this generalisation. The .txt extension is widely viewed as a text file extension, which could contain comma-separated values, but it is not a common file extension for a comma-separated values file format. +mt 22:18, 16 March 2016 (UTC)

A major point of csv files is that they are text files and can be edited with a text editor. A .txt extension is completely appropriate.

I've used several application programs that would open .txt extensions but refuse to open .csv because they do not recognize the extension. Using a .txt extension gets around such narrowmindedness. If you want lcd, then use .txt.

Double clicking a .txt file will probably open a text editor. Double clicking a .csv file may fail for some systems (on my systems, it opens the file in Excel).

.txt is a common file extension for CSV. Many (and possibly the majority) of the CSV files that I use have a .txt extension. (Certainly the majority of text DB files that I use are .txt files.)

The article discusses using the extension .txt at Comma-separated values#Application support.

Microsoft Office will use either .csv or .txt.[7] (That article says .txt is typically tab-delimited, but the text driver is flexible.)

Microsoft JET (Joint Engine Technology) expects .csv or .txt.[8] See the figure with the dropdown box that says "Microsoft Text Driver (*.txt; *.csv)".

Microsoft ActiveX Data Objects take .csv or .txt.[9]

Microsoft OLE Text driver expects one of two extensions: *.csv and *.txt.[10]

There is no standard for CSV files -- despite this article's bias for RFC 4180. How many CSV implementation tolerate a line break within a string? The files are as one finds them.

Glrx (talk) 23:25, 16 March 2016 (UTC)

Glrx, I had a look at the article you quoted: Import-or-export-text-txt-or-csv-files

There are two commonly used text file formats: Delimited text files (.txt), in which the TAB character (ASCII character code 009) typically separates each field of text. Comma separated values text files (.csv), in which the comma character (,) typically separates each field of text.

This obsession with file-types only applies to Microsoft products today, *NIX doesn't have formal file types. It would therefore appear to be sensible to take MS's word, paticularly when reinforced by RFC 4180. Any non-binary file can be called .txt, but that fails to note that there is a structure to CSVs. The infobox is not a place to discuss boundary cases, the text is the place to explain that the standard is not always followed. The quoted example of unicode in Excel doesn't really address the CSV standard, it is suggesting a workaround for a regional deficiency in the program. It is also important to distinguish between input and output formats. Good coding practice is to accept as many input variants as possible but to output only according to standard. Martin of Sheffield (talk) 13:08, 17 March 2016 (UTC)

IP User 85.255.232.22 has reverted this claiming that "RFC 4180 specifically states that text files are supported as well as .csv". Reading RFC 4180 shows that this is not the case. Section 3 (MIME Type Registration of text/csv) lists:

   MIME media type name: text

   MIME subtype name: csv

but then as part of the text/csv registration this is to be expected! Later in the entry is the line:

      File extension(s): CSV

indeed, nowhere in the whole standard do the letters "txt" occur. Likewise, RFC 7111 which updates RFC 4180 also does not include the string "txt" within its body. I'm therefore going to revert and add a request for the IP user to see this talk page, but as with many infrequently used IPs there is a risk that it will not be read. Martin of Sheffield (talk) 09:47, 21 March 2016 (UTC)

Section about spaces potentially wrong?

The article currently says

"According to RFC 4180, spaces outside quotes in a field are not allowed".

I think that is wrong. The spec does not say that. Instead, the spec explicitly allows them with

record = field *(COMMA field)
field = (escaped / non-escaped)
non-escaped = *TEXTDATA
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E. In there %x20 is the space character

In there, %x20 is the space character. So the space character is allowed in fields that aren't quoted.

If anything, we could interpret the current wording as "if the field is quoted, then there must not be spaces outside the quotes", but currently it reads much more like "if the field contains a space, it must be quoted". Even if the former was the intent, the wording should be clarified, and it should be explicitly stated that quotes are not required for fields that include spaces. Then we can also remove the current "however, the RFC also says ..." wording, which indicates that there's a contradiction when there isn't one.

Further references: My comment on stackoverflow, My comment on superuser Issue for Haskell's CSV library

Nh2-wiki (talk) 21:55, 1 June 2016 (UTC)

I don't have time to look at this, but the post seems to confuse quoted ("escaped"?) and non-quoted ("non-escaped") fields.

In general, the RFC does not want any spurious spaces. Consequently, the quotation marks of a field must not have any spaces to the left or right.

Common practice does not follow the RFC. Common practice throws away spaces at the beginning or end of a field unless the spaces are inside quotation.

Glrx (talk) 14:29, 4 June 2016 (UTC)

I agree with Nh2-wiki. The current wording in the article about a conflict isn't correct. The RFC simply states that escaped fields begin with a quote and unescaped fields do not. There is no conflict. The article should be modified. Christopher Rath (talk) 22:46, 4 June 2016 (UTC)

The RFC sez escaped fields may not begin with a space and may not end with a space.

RFC page 3: "Spaces are considered part of a field and should not be ignored."

RFC page 3: escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

the BNF is unambiguous: a field does not have any spaces outside the DQUOTEs

there are no productions that remove whitespace

RFC page 5: "Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2."

Contrary to what the RFC says, most CSV implementations ignore whitespace characters adjacent to BOL, comma, and EOL.

Glrx (talk) 00:09, 5 June 2016 (UTC)

What are some examples of the "common practice" you mention? I'm not familiar with any.

Also, there's no ambiguity in the BNF: the BNF simply states that escaped fields begin with a DQUOTE. Those fields may contain spaces. A non-escaped field begins with any other character, which could be a space. If you implement the BNF in the RFC you get a well behaved program that always processes CSV files in a predictable manner and doesn't discard any spaces. That said, most programs/applications don't rigourously implement the RFC. Christopher Rath (talk) 06:35, 5 June 2016 (UTC)

Then you apparently agree with me and disagree with Nh2-wiki. The RFC does not permit spaces at the start of a quoted/escaped field.

The issue here is does the RFC permit spaces outside of the DQUOTE marks in an escaped field. The answer is no.
Look at http://stackoverflow.com/questions/22492244/is-whitespace-at-the-end-of-a-line-valid-csv#comment62645138_22514146 -- the link given in Nh2's OP.

Matt's answer of March 19, 2014, uses the input line
"foo", "bar"

That is, there is a space after the the first comma and then a DQUOTE. The second field has a space outside the quotation mark. Matt showed that parsing the line raised an illegal quotation exception. The implication is matt's example line violates the RFC spec (which it does).

Following matt's post, nh2 (here Nh2-wiki) posted a claim that matt was wrong -- with the same argument that Nh2-wiki gives above.

The problem is Nh2-wiki is looking at the BNF for a non-escaped field. *TEXTDATA does permit leading spaces (and makes them significant as field data), but DQUOTE is not an element of TEXTDATA, so a non-escaped field may not have a DQUOTE anywhere within. That means the premise of spaces outside the quotation marks cannot be met.

The proper context is escaped fields. Those fields begin and end with DQUOTE, so there are no spaces outside the quotatin marks.
As an aside, the RFC BNF is very restrictive. The def for TEXTDATA is
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

The BNF does not permit arbitrary characters within an escaped field. The definition for TEXTDATA does not include any control characters (such as tab), the DQUOTE character (0x22), the COMMA character (0x2C), the DEL character (0x7F), or any character with bit 7 set (0x80 to 0xFF). Consequently, the BNF only explicitly permits the control characters CR and LF. That means that the RFC (written in 2005) does not support UTF-8 (Thompson 1992; recommended for internet mail in 1998). UTF-8 is compatible with ASCII and cannot collide with a comma or double quote character because all multibyte sequences set bit 7. (Overlong encoding could surreptitiously quote a comma.) Many DBCS charsets (such Shift-JIS) have the second byte stay out of 0x00 to 0x3F, so those charsets will not collide either.

In fact, it would be simpler to specify a tab-delimited format because under the RFC's definition, field data may never contain a tab. The tab character would therefore make an ideal delimiter.
To your first question about common practice, many CSV implementations (e.g., Excel) ignore unquoted whitespace at the start and end of a field.[11][12] Consequently, those programs would process matt's example without incident. Sadly, the RFC cites both of those references (as refs 4 and 5) and then inexplicably keeps the whitespace significant. RFC ref 6 has a "trim" whitespace option.[13] RFC ref 7 does not address the whitespace issue (and does not understand that backslash is problematic as an escape character in the DBCS context).[14] From that survey, Shafranovich concludes most implementations do not trim. I believe Shafranovich did not understand the leading-trailing whitespace issue; if he had understood it, then he would have addressed it in the body of the RFC. If he had understood the issue, then his advice about being liberal in what is accepted should have suggested processing matt's example without raising an exception; I think that would have also lead to trimming whitespace in non-escapted fields.

Many implementations allow DQUOTE processing to be turned off. Many GIS systems read fields such as 46°20'48"N. Some implementations treat DQUOTE as a normal character if the field did not start with a DQUOTE; those implementations allow quoted fields (e.g., "Acme Products, Inc.") and GIS coordinates.

Glrx (talk) 19:16, 5 June 2016 (UTC)

Application Support Discussion

A heads up to those watching this article that may not also be tracking the linked CSV application support page. Some uninterested parties are proposing that the CSV application support page be deleted. To participate in that discussion, please see Wikipedia:Articles for deletion/CSV application support Christopher Rath (talk) 12:58, 30 December 2018 (UTC)

Specification

@Crath: – mine was a revert of an IP user, you should follow WP:BRD not simply re-revert. The opening of that paragraph now reads "RFC 4180 proposes a specification for the CSV format, and this is the definition commonly used. As a result, in practice ..."; "as a result" of what? It implies that because the RFC proposes a definition in practice it is ignored, something I'm sure we both agree is wrong. I'm recasting the whole opening of the paragraph to make the ambiguities clearer. Regards, Martin of Sheffield (talk) 15:13, 20 January 2019 (UTC)

'@Martin of Sheffield:, my apologies for not reading the change log closely enough. Your rewording looks very good. Thanks! Christopher Rath (talk) 02:46, 22 January 2019 (UTC)

Supported by Unix-style utility programs?

The section "Application support" says, "Many utility programs on Unix-style systems can operate on CSV files", then lists as examples cut, paste, join, sort, uniq, emacs, awk. With the exception of emacs, I think that is incorrect. Those programs can of course split a string with a comma separator, but I believe they cannot natively handle commas within quotation marks, which for me is what qualifies it as a CSV parser. At the least it would need to be amended to "... can operate on some CSV files", but I would rather remove it altogether. Adpete (talk) 22:59, 3 March 2019 (UTC)

Rather than "some", I suggest the word "simple" be used. Christopher Rath (talk) 23:04, 3 March 2019 (UTC)

Good suggestion. I've reworked that section now. Obviously, others are welcome to improve it. Adpete (talk) 00:44, 4 March 2019 (UTC)

Odd sentence

There's this odd sentence in the section General functionality:

Similarly, CSV cannot naturally represent hierarchical or object-oriented databases or other data.

I think the highlighted words should be deleted, or is there some special meaning there? BroVic (talk) 08:22, 8 March 2019 (UTC)

Just looks like poor wording to me. But I am going to remove the words "databases or other", because databases are a subset of data. Adpete (talk) 04:03, 17 March 2019 (UTC)