Extending Types

This is the next tutorial, see the previous

Once you have your base transformation up and running it is time to extend the fields. Out of the box we already offer various extensions, the full list of Field types can be found in the reference.

Translating IDs

One of the patterns we see in the evolution with microservices is a move away from fully normalized database schema’s. Modern systems rely more on auto-completion, data is cheaper and probably a lot more reasons to store full strings instead of IDs. In data-migrator this is easily supported for hardcoded values with a small python function:

M = {
  0: "NOT SET",
  1: "Hallo",
  2: "Hello",
  3: "Bonjour"
}

def parse_b(v):
  return M.get(int(v), M[0])

class Result(models.Model):
  id = models.IntField(pos=0) # keep id
  ....
  # replace ID with value
  b  = models.StringField(pos=2, parse=parse_b, default="How are you doing?")

Note the values are parsed as string from the CSV reader.``NULL`` is by default translated to None, which is replaced by the default value and will never see the parse function.

Merging columns

Another migration pattern is to merge separate (boolean) columns back to a single enumeration column. To support that use a row parser instead of a single value parser. If no pos is given, the parser will be row based instead of a value parsed and linked to a single column value:

def employment_type(row):
  if row[26] == "1":          # contractor
    return 'contractor'
  elif row[27] == "1":        # intern
    return 'intern'
  else:
    return 'perm'

class Result(models.Model):
  ....
  b  = models.StringField(parse=employment_type, default="perm")

Dynamic lookups

At moments one needs to lookup values in the target database. Do not be shy to generate dynamic lookups in the target database using SELECT statements that run during import into the target database.

class Result(models.Model):
  recruiter_uuid = models.StringField(pos=38,
    replacement=lambda x:'(SELECT uuid FROM `persons` WHERE `mail`="%s" limit 0,1)' % x)

This can off course be combined with python based transformations to fix deleted values:

def recruiter(v):
  if v is None or v in ['missing1@mail.com', 'missing2@mail.com']:
      return 'default_person@mail.com'
  else:
    return v

class Result(models.Model):
  recruiter_uuid = models.StringField(pos=38, parse=recruiter,
    replacement=lambda x:'(SELECT uuid FROM `persons` WHERE `mail`="%s" limit 0,1)' % x)

The output is a value for the target database as being the input for a query on that target database. BTW replacement by output string is considered a default. Therefore the replacement string in format type of transformation can be provided directly:

...

class Result(models.Model):
  recruiter_uuid = models.StringField(pos=38, parse=recruiter,
    replacement='(SELECT uuid FROM `persons` WHERE `mail`="{}" limit 0,1)')

Table lookups

For larger tables there is support for table driven lookups from external CSV files. It is also possible for the map (just a key,value) to be ad-hoc generated by other means. data-migrator offers a helper function read_map_from_csv() to read the csv.

from data_migrator.contrib.read import read_map_from_csv

class Result(models.Model):
  country = models.MappingField(pos=33, default='NLD',
    data_map=read_map_from_csv(f=open('data/country.csv'), delimiter=';', key='country_id', value='alpha3'))

Combining table lookups

The table lookup and column reduction can also be combined. Consider a multi state entity with specific lookup values to be merged in one lookup value:

from data_migrator.contrib.read import read_map_from_csv

LOOKUP1 = read_map_from_csv(f=open('data/state1.csv'), delimiter=';', key='id', value='name')
LOOKUP2 = read_map_from_csv(f=open('data/state2.csv'), delimiter=';', key='id', value='name')

def parse_lookup(row):
  return LOOKUP1.get(row[1], LOOKUP2.get(row[2], ''))

class Result(models.Model):
  ...
  state = models.StringField(parse=parse_lookup)

Flatten multi values

The most extensive many-2-many flattening is for example a tagging of multiple values to a main entity. This is mostly implemented in a 3 table structure, following the classic normalization approach:

  • A table with the main entity (for example persons)
  • a table with the attributes in a fixed id,value structure and last
  • a many-to-many table linking the attributes to the main entities.

A simple approach to flatten this is to encode this a JSON list, to transform the data use a four step approach:

  1. Extract the data from the old system fully expanded
  2. Read the CSV and flatten to a map of lists
  3. Link the values at read time replacing the main ID with lists
  4. Emit the whole as a JSON list

The first step relies on queries like:

SELECT
        P.person_id,
        S.name as skill
FROM person_skill_m2m P
INNER JOIN skills S
        ON S.id=P.skill_id;

After that, loading and emitting to JSON is simply using the MappingField

from data_migrator.contrib.read import read_map_from_csv

class Result(models.Model):
  skills = models.MappingField(pos=0, default=[], as_json=True,
    data_map=read_map_from_csv(f=open('results/skill.csv'), key='candidate_id', value='skill', as_list=True))

Now take these examples and mix your own. It is standard Python, we have no doubt you can come up with all kinds of amazing transformations.