
Products of HFLabs in industrial volumes process data: addresses, full names, company details and another car in total. Naturally, testers deal with this data every day: update test cases, study the results of cleaning. Often, customers provide a “live” base for the tester to set up a service for it.
The first thing we learn about new QA is to keep the data in its original form. All the covenants: "Do no harm." In the article I will tell you how to work with CSV files in Excel and Open Office. Tips will help not to spoil anything, save the information after editing and generally feel more confident.
')
Material base, professionals absolutely bored.What are CSV files?
The CSV format is used to store tables in text files. Data is often packaged in tables, so CSV-files are very popular.
A CSV file consists of data lines and delimiters that indicate the boundaries of the columns.CSV stands for comma-separated values ​​- comma-separated values. But don't let the name fool you: the semicolons and tabs can serve as column delimiters in a CSV file. This will still be a CSV file.
CSV has a lot of advantages in front of the same Excel format: text files are as simple as a button, open quickly, read on any device and in any environment without additional tools.
Because of its advantages, CSV is a super-popular data exchange format, although it is already 40 years old. CSV has been using industrial application programs, it uploads data from databases.
One problem - a text editor to work with CSV is not enough. Nothing if the table is simple: in the first field the ID is of one length, in the second one is in one format, and in the third one is some address. But when the fields are of different lengths and there are more than three of them, the torment begins.
Follow the delimiters and columns - break eyesEven worse with data analysis - try Notepad to at least add up all the numbers in a column. I'm not talking about beautiful graphics.
Therefore, CSV-files are analyzed and edited in Excel and analogues:
Open Office ,
LibreOffice and others.
Veterans who still read it: guys, we know about the analysis directly in the database using SQL, we know about Tableau and Talend Open Studio . This article is for beginners, but at a basic level and a small amount of data Excel with analogs is enough.How Excel spoils data: from classics
Everything would be fine, but Excel, barely opening a CSV file, starts its crafty freaks. Without a demand, he changes the data so that they become unusable. And it does this completely unnoticed. Because of this, in due time, we grabbed a bunch of problems.
Most incidents are due to the fact that a program without demand converts strings with a set of numbers into numbers.
Rounds. For example, in the source cell two phones are stored separated by commas without spaces: "5235834,5235835". What will Excel do? Famously turn the numbers into one number and round up to two digits after the decimal point: "5235834,52". So we lose the second phone.
Leads to exponential form. Excel carefully converts "123456789012345" to the number "1.2E + 15". The original value will lose completely.
The problem is relevant for long, characters of fifteen, digital lines. For example, KLADR codes (this is such a state identifier of the address object: city, street, house).
Removes the leading advantages. Excel considers that a plus at the beginning of a line with numbers is a completely extra character. They say, and so it is clear that the number is positive, since it is not worth a minus. Therefore, the leading plus in the number “+74955235834” will be rejected as superfluous - you get “74955235834”. (In reality, the number will suffer even more, but for clarity, I will manage a plus).
The loss of a plus is critical, for example, if the data goes to a third-party system, and that when importing it rigidly checks the format.
Splits three digits. A numeric string longer than three characters of Excel, kind soul, carefully disassemble. For example, "8 495 5235834" will turn into "84 955 235 834".
Formatting is important at least for telephone numbers: spaces separate country and city codes from the rest of the number and from each other. Excel easily violates the correct division of the phone.
Removes leading zeros. Excel will turn the string "00523446" into "523446".
And in the TIN, for example, the first two digits are the region code. For the Republic of Altai it starts from scratch - “04”. Without zero, the meaning of the number will be distorted, and the TIN format check will not pass at all.
Changes dates for local settings. Excel will gladly fix the house number "1/2" to "01.fev." Because Windows has suggested that in this form it is more convenient for you to read the dates.
We defeat data corruption with correct import.
But seriously, it’s not Excel that’s entirely to blame for the trouble, but the unobvious way to import data into the program.
By default, Excel applies the “General” type to the data in the downloaded CSV file — common. Because of it, the program recognizes numeric strings as numbers. This order can be won using the built-in import tool.
I launch the import mechanism which is built in Excel. In the menu this is “Data → Get External Data → From Text”.
I choose a CSV file with data, a dialog opens. In the dialog, I click on the Delimited file type (with delimiters). Encoding - the one in the file, usually determined by the machine. If the first line of the file is a header, I mark “My Data Has Headers”.
I turn to the second step of the dialogue. I select the field separator (usually a semicolon - semicolon). Turning off “Treat consecutive delimiters as one”, and “Text qualifier” I put in “{none}”. (Text qualifier is a symbol of the beginning and end of the text. If the separator in CSV is a comma, then the text qualifier is needed to distinguish commas inside the text from commas-separators.)
In the third step, I choose the format of the fields , and everything was started for him. For all columns I set the type "Text". By the way, if you click on the first column, hold down the shift and click on the last one, all columns will be highlighted at once. Conveniently.
Then Excel will ask where to insert data from CSV - you can simply click "OK", and the data will appear in the open sheet.
Before importing, you will have to create a new workbook in Excel.But! If I plan to add data to CSV through Excel, I have to do one more thing.After importing, you need to force all cells on the sheet to the “Text” format. Otherwise, the new fields will get the same type of “General”.
- Press Ctrl + A twice, Excel selects all the cells on the sheet;
- I click the right mouse button;
- I select Format Cells in the context menu;
- In the dialog that opens, I select the “Text” data type on the left.
To select all the cells, press Ctrl + A twice. Just two, this is not a joke, tryAfter that, if you're lucky, Excel will leave the original data alone. But this is not the strongest guarantee, so after saving we will definitely check the file through a text viewer.
Alternative: Open Office Calc
I use Calc to work with CSV files. It is not that it does not consider digital data at all as strings, but at least it does not apply reformatting to them in accordance with the regional settings of Windows. And import is simpler.
Of course, you will need an Open Office (OO) package. During installation, it will offer to reassign MS Office files to itself. I do not recommend: although OO is quite functional, it does not fully understand the cunning microsoftware formatting of documents.
But assigning OO as the default program for CSV files is quite reasonable. You can do this after installing the package.
So, we start importing data from CSV. After double clicking on the file, Open Office displays a dialog.
Note that in OO you do not need to create a new workbook and force the import to run, all by itself- Encoding - as in the file.
- A “delimiter” is a semicolon. Naturally, if it is the separator in the file.
- The “text delimiter” is empty (all the same as in Excel).
- In the "Fields" section I click in the left-upper square of the table, all columns are highlighted. I specify the type "Text".
A thing that spoiled a lot of blood: if by mistake you select several field separators or the wrong text is separated, the file may open correctly, but it may not be saved correctly.
In addition to Calc, libreOffice is popular in HFLabs, especially under Linux. Both are used more for CSV than Excel.
Bonus Track: problems saving from Calc to .xlsx
If you save data from Calc to Excel format .xlsx, keep in mind - OO sometimes inexplicably losing data on a large scale.
The white wasteland in the middle is rich in data in the original CSV file.Therefore, after saving, I once again open the file and make sure that the data is in place.
If something is lost, treatment is resave from CSV to .xlsx. Or, if Windows is installed, import from CSV to Excel and save from there.
After re-saving, I must once again check that all data is in place and there are no extra empty lines.
If you are interested in working with data, look at our vacancies . HFLabs almost always need analysts, testers, implementation engineers, developers. We will provide data so that it will not find it :)