Data transformation is a classic problem in compute, but often underestimated in modern software development. Everybody working with persistent data will be involved somehow in restructuring existing data in databases or files, when systems and architectures evolve. A wide range of practices exist, ranging from ad-hoc scripts to sophisticated ETL processes. We were upgrading existing codebases at Schuberg Philis, moving from a mono-lithical application to a microservice architecture. At that moment we found ourselves writing ad-hoc python data migration scripts. Table by table the transformation was done, simply by exporting existing data in CSV’s. With some simple python scripts - a single read/print loop - new INSERT statements for the target database were generated.
How hard can it be? Just some basic processing and INSERT statement emitting, but soon we had to cleaning/fixing data, generating multiple records out of single rows, combine columns, lookup data from other sources. The scripts became larger, more exceptions: unreadable and hard to maintain. That was the moment when we came up with the idea of applying a more declarative approach. Being pretty charmed by the Django model, we adopted that as a base. Soon after, a standardized system based on a declarative definitions originated.
This package is a simple alternative to doing ad-hoc scripts. It is easy to learn, easy to extend and highly expressive in building somewhat more complex transformations. Think of for example:
- renaming, reordering columns
- changing types
- lookup data, advanced transformations
- generating permission records in separate tables for main data
There are many packages that offer parts to do this stuff, some packages we know:
- ruffus: A nice data transformation package, however more geared to long running computational processes (DNA Sequencing) and less to frictionless table transformation.
- petl: the name says it all extract, tranform and load. You can do cool stuff with it. Basically moving table manipulation to Python. We preferred a more direct map and transform approach.
- tablib: very handy if you need to open various sources, a nice extension to data-migrator
- pandas: A big one, if you need more data analysis and less of datamigration, this could be your favorite poison.
- Your average bulk of SQL scripts. Still a useful tool, we love the ease of repetiveness using the pipes-and-filters paradigm in data-migrator.
- R: R is not Python, but tremendously handy if you need large data set processing and statistics.