About a month ago, I downloaded an open source visualization tool for a quick evaluation. I had planned on using my trusty 276,000 record, several dozen attribute delimited file of census data to see how the tool performed with real world volumes. Alas, I ran into a problem right away. My data file, which consists of roughly equal numbers of numerics and string fields interpreted as category values, failed the load. It seems the tool could only handle commas as delimiters, whereas my file used semi-colons. And I couldn't simply change all semi-colons to commas, since the text fields had embedded commas that would subvert the load with a comma delimiter. What I needed to do was first surround each text field with quotes to protect the embedded commas, then change the semi-colon delimiter to a comma.

I could easily have solved my problem using OpenOffice Calc for a one-off solution with CSV files, but decided instead to dust off a favorite open source tool, the programming language Ruby. The short script I ended up writing first looked at 100 records to get a consensus of which fields were numeric and which were text, storing the text field column positions. It then read all records, prefixing and appending quotes to text fields. Finally, the script changed the delimiter from semi-colon to comma, writing it's results to a file that appeased the visualization tool loader. It took me a couple of hours, but I now have a tool that can generically address this type of data problem.

Register or login for access to this item and much more

All Information Management content is archived after seven days.

Community members receive:
  • All recent and archived articles
  • Conference offers and updates
  • A full menu of enewsletter options
  • Web seminars, white papers, ebooks

Don't have an account? Register for Free Unlimited Access