Почему запятая является плохим разделителем / разделителем записей в файлах CSV?

32

Я читал эту статью, и мне любопытно найти правильный ответ на этот вопрос.

Единственное, что мне приходит в голову, это, возможно, то, что в некоторых странах десятичный разделитель - это запятая, и это может быть проблемой при обмене данными в CSV , но я не совсем уверен в своем ответе.

Дэвид Гаскес
источник
6
Почти любой разделитель лучше, чем запятая. Причина в том, что когда файлы с разделителями-запятыми считываются некоторыми инструментами синтаксического анализа данных, запятые могут быть перепутаны с пунктуацией, нарушая «расположение» полей или столбцов.
Майк Хантер
33
Циник, заметив, что эта статья является куском SAS, может предположить, что, возможно, у SAS есть проблемы с обработкой CSV-файлов запятыми :-).
whuber
3
@whuber - SAS (in my experience) can struggle with CSV files, whether they have commas or not, requiring huge amounts of hand coding for every weird thing that SAS doesn't like.
Jeremy Miles
8
There's a desperation in the search for ever-more-obscure delimiters - pipes, pilcrows, thorns - that suggests agreeing on & following a standard is really the only safe way for people to exchange data in delimited text files. And a universal standard has to allow any text string to be represented (as does RFC4180), rather than relying on the assumption that some won't need to be & can be put to other work.
Scortchi - Reinstate Monica
2
(a) I've often imported .csv files successfully. (b) I advise people not to use .csv if they have commas within their data. These don't contradict each other. It's unfortunate that (b) needs explanation in some quarters.
Nick Cox

Ответы:

33

CSV format specification is defined in RFC 4180. This specification was published because

there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files

Unfortunately, since 2005 (date of publishing the RFC), nothing has changed. We still have a wide variety of implementations. The general approach defined in RFC 4180 is to enclose fields containing characters such as commas in quotation marks, this recommendation however is not always meet by different software.

The problem is that in various European locales comma character serves as the decimal point, so you write 0,005 instead of 0.005. Yet in other cases, commas are used instead of spaces to signal digit groups, e.g. 4,000,000.00 (see here). In both cases using commas would possibly lead to errors in reading data from csv files because your software does not really know if 0,005, 0,1 are two numbers or four different numbers (see example here).

Last but not least, if you store text in your data file, then commas are much more common in text than, for example, semicolons, so if your text is not enclosed in quotation marks, that such data can also be easily read with errors.

Nothing makes commas better, or worse field separators as far as CSV files are used in accordance with recommendations as RFC 4180 that guard from the problems described above. However if there is a risk of using the simplified CSV format that does not enclose fields in quotation marks, or the recommendation could be used inconsistently, then other separators (e.g. semicolon) seem to be safer approach.

Tim
источник
6
Well any software implementing the actual CSV standard as defined by RFC 4180 would certainly know exactly how to interpret any given string. The argument that using , instead of a rarer separator bloats the data because you have to escape it all the time is true though. And obviously there's all those people who think they know how CSV works but really don't.
Voo
2
@Voo Yes, but because "csv" files are used in such chaotic manner it is safer not to use commas and instead of them to use other separators, e.g. semicolons. This is the answer to OP question. There is nothing "better" in semicolons (or other non-commas) compared to commas, they are just simply safer choice in many cases.
Tim
2
@Voo +1 to your comment. However, anybody who is using CSV doesn't really care about bloated data files!
whuber
17

Technically comma is as good as any other character to be used as a separator. The name of the format directly refers that values are comma separated (Comma-Separated Values).

The description of CSV format is using comma as an separator.

Any field containing comma should be double-quoted. So that does not cause a problem for reading data in. See the point 6 from the description:

  1. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

For example the functions read.csv and write.csv from R by default are using comma as a separator.

djhurio
источник
4
This is the best answer, as it refers to values that are comma separated. Others alluding to european formatting of numbers, this is not an issue for the csv standard, as you correctly cite point 6 above. Divergences from "correct use" exist with any data format. The point is - know your data. Others mention tab or ; delimited, however these can have the same issues as commas when you're dealing with data that is user-entered (perhaps via a form and captured by a database - I've had to wrangle with free text entry fields that people have fat fingered in tab ... it sucks)
Adrian Torrie
Tim's answer has now been edited to include the information @djhurio provided.
Adrian Torrie
11

In addition to being a digit separator in numbers, it is also forms part of address (such as customer address etc) in many countries. While some countries have short well-define addresses, many others have, long-winding addresses including, sometimes two commas in the same line. Good CSV files enclose all such data in double quotes. But over-simplistic, poorly written parsers don't provide for reading and differentiating such. (Then, there is the problem of using double quotes as part of the data, such as quote from a poem).

Whirl Mind
источник
2
(+1) The standard provides for use of double quotes as part of the data by insisting on doubling them again: "Belloc", "Tarantella", """the fleas that tease in the High Pyrenees""". In England it's not uncommon to find address fields containing the name of a house in quotes, thus: "Chatsworth", Melton Road, Leamington. (It's not clear why: Fowler grumbled that "the implication seems to be: living in the house that sensible people call '164 Melton Road', but one fool likes to call 'Chatsworth'".)
Scortchi - Reinstate Monica
1
@Scortchi It seems that we learned the same poems at age 12 (+/- error). I fear that what I read as unfortunate early 20th century English snobbery of the upper middle-class for the habits of the lower middle-class obscures your last example, which will not be transparent beyond a small group.
Nick Cox
@NickCox: Twelve sounds about right. Funny that I can't remember whether I've read any poems this year, let alone recall any lines from them. Though Fowler's point was about the effect on the reader of unnecessary quotation marks (see unnecessaryquotes.com), I think you're right to see the influence of snobbery in his choice of example. At any rate, I hope the rather minor point that it's something to watch out for if you're ever sent a CSV file containing English addresses is clear to all despite my divagations.
Scortchi - Reinstate Monica
1
in India, it is common for people who build their first homes (not apartments), to keep a innovative flowery name, often in a vernacular language or Sanskrit phrase and those are in double quotes, such as "Guru Kripa". Names like Genelia D'Souza and Derek O'Brien are common too. Then, addresses that say, " Old Door No. nnn / New Door No. mm/c " , due to government renumbering complicate address storage even further, for having slashes and single quotes in unexpected corners.
Whirl Mind
@WhirlMind: That's interesting - I've noticed a lot of - well, more than I'd expect - Scottish Gaelic & Welsh house names in England, which is perhaps the nearest equivalent to picking a vernacular language in which to name your home.
Scortchi - Reinstate Monica
9

While @Tim s answer is correct - I would like to add that "csv" as a whole has no common standard - especially the escaping rules are not defined at all, leading to "formats" which are readable in one program, but not another. This is excarberated by the fact that every "programmer" under the sun just thinks "oooh csv- I will build my own parser!" and then misses all of the edge cases.

Moreover, csv totally lacks the abillity to store metadata or even the data type of a column - leading to at several documents which you must read to unterstand the data.

Christian Sauer
источник
5
Yes, there is standard tools.ietf.org/html/rfc4180 and many other formats do not store any metadata, it is just not designed for storing metadata - .txt files also do not store metadata about text documents...
Tim
4
Tim, that standard is ignored more often than not, making it a non-standard,,,
Christian Sauer
8
The great thing about standards is that there are so many to choose from. (Variously mutated and attributed.)
Nick Cox
4

If you can ditch the comma delimiter and use a tab character you will have much better success. You can leave the file named .CSV and importing into most programs is usually not a problem. Just specify TAB delimited rather than comma when you import your file. If there are commas in your data you WILL have a problem when specifying comma delimited as you are well aware.

Gorilla
источник
5
If there are tabs in your data, the converse applies. It's just, at least in my experience, less likely.
Nick Cox
@Nick and Gorilla: I've had good results with | as a delimiter in home-brewed csv-like text files of records (with book titles and other document metadata). | never occurs in the data I work with, so I can just write perl scripts that simply split/join without checking for quoting of any kind. This was for a one-off project that just involves processing metadata saved from an MS Access database. For any larger project, or if you plan to keep data in this file-format long-term, pick something more robust! I could always tweak something if this month's batch broke something.
Peter Cordes
@PeterCordes I believe you, and whatever works. But clearly the cost of idiosyncratic separators may be the need to explain those to others and it is key that they can import such data files without difficulty. Faced with an unusual file format, it is necessary to have access to some routine, function or command that can split strings on arbitrary separators.
Nick Cox
@PeterCordes When I wrote a split command for Stata I looked at, among other things, the Perl equivalent to see what it did and didn't do. Not the source code, just the functionality offered.
Nick Cox
1
@NickCox: A lot of perl's functions are quite well designed, IMO. They get the job done without a lot of special limitations like you find in awk (which is often good), or esp. other Unix tools like cut, sort, and uniq.
Peter Cordes
4

ASCII provides us with four "separator" characters, as shown below in a snippet from the ascii(7) *nix man page:

   Oct   Dec   Hex   Char
   ----------------------
   034   28    1C    FS  (file separator)
   035   29    1D    GS  (group separator)
   036   30    1E    RS  (record separator)
   037   31    1F    US  (unit separator)

This answer provides a decent overview of their intended usage.

Of course, these control codes lack the human-friendliness (readability and input) of more popular delimiters, but are acceptable choices for internal and/or ephemeral exchange of data between programs.

Ronald Straight
источник
2
Interesting. I don't think I've ever seen these used in the wild though...
Matt Krause
4

The problem is not the comma; the problem is quoting. Regardless of which record and field delimiters you use, you need to be prepared for meeting them in the content. So you need a quoting mechanism. AND THEN you need a way for the quoting character(s) to appear too.

Following the RFC 4180 standard makes everything simpler for everybody.

I have personally had to write a script to probably fix the output from a program that got this wrong, so I am a bit militant about it. "probably fix" means that it worked for MY data, but I can see situations where it would fail. (In that program's defense, it was written before the standard.)

Stig Hemmer
источник