
What are your options for a file format if you want to store timeseries data?
CSV or JSON lines: The simple, stupid option which is not to be overlooked. It works as long as you don’t have too much data or too stringent performance requirements.
Parquet: Apache Parquet is the industry standard for columnar data used by data lakes and similar. Possible alternatives include ORC; however, it seems that ORC’s popularity is waning.
Arrow: Apache Arrow is an memory format for columnar data. It’s great, but it’s not designed for efficient long-term storage.
DuckDB: DuckDB is an embedded databased focused on columnar data. Like SQLite, it can be a relevant option for storing data. Since version 0.10, they are promising backwards compatibility for files.
The next-generation columnar formats: Lance promises to better suited for ML than Parquet. Vortex just promises to be all-around better than Parquet. BtrBlocks and FastLanes are more academic projects.
Roll your own: The fun option.
Photo: A small isle on frozen Lake Bodom in Espoo.