Storing timeseries data

What are your options for a file format if you want to store timeseries data?

CSV or JSON lines: The simple, stupid option which is not to be overlooked. It works as long as you don’t have too much data or too stringent performance requirements.

Parquet: Apache Parquet is the industry standard for columnar data used by data lakes and similar. Possible alternatives include ORC; however, it seems that ORC’s popularity is waning.

Arrow: Apache Arrow is an memory format for columnar data. It’s great, but it’s not designed for efficient long-term storage.

DuckDB: DuckDB is an embedded databased focused on columnar data. Like SQLite, it can be a relevant option for storing data. Since version 0.10, they are promising backwards compatibility for files.

The next-generation columnar formats: Lance promises to better suited for ML than Parquet. Vortex just promises to be all-around better than Parquet. BtrBlocks and FastLanes are more academic projects.

Roll your own: The fun option.

Photo: A small isle on frozen Lake Bodom in Espoo.


About the author: My name is Miikka Koskinen. I'm an experienced software engineer and consultant focused on solving problems in storing data in cloud: ingesting the data, storing it efficiently, scaling the processing, and optimizing the costs.

Could you use help with that? Get in touch at miikka@jacksnipe.fi.

Want to get these articles to your inbox? Subscribe to the newsletter: