On this page

Import phases

As mentioned in the previous section, there are some technical limitations that prevent us from importing all the data in a single run. Mostly, this is caused by circular dependencies between some of the data classes. Because of the way Prisma handles data, it’s not possible to create a new object and link it to another object that doesn’t exist yet.

prisma.programme.create({
  data: {
    ...
    programmes: {
      connect: {
        id: "related-programme-id", // in case this programme doesn't exist yet, this will fail
      },
    },
    ...
  },
});

To solve this problem, we split the import into two phases. In the first phase, we import all the data, but we don’t link it yet. In the second phase, we link the data together - now we can do this, as all the data is already in the database and we are only populating the relation tables.

Creating the data

The first phase is populating the actual entity tables. This is done by the migrateSqliteToPrisma function in the feeder package.

During this phase, the importer removes the {connect: ...} clauses from the transformed rows and only runs the .create() call with the scalar and {create: ...} values.

Info

The migrateSqliteToPrisma function also optimizes the database accesses. While calling createMany in Prisma executes an optimal insert clause, it doesn’t support nested writes (see issue here). Calling the create method on every inserted row would be very inefficient - so the importer batches the rows and calls createMany on them, runs the nested imports separately and then links the data together.

In the following snippet, we see an example transformation function for the person table. The transformer function is called for every row in the source database and it returns an object that is then passed to the createMany method.

During the creation phase, we only create the person object and its scalar values. The faculties and departments are removed in the query preprocessing and are not connected yet.

...
  transformer: (row) => ({
    id: row.PERSON_WHOIS_ID,
    private_id: row.PERSON_ID,
    names: {
        create: [{ 
            lang: 'en',
            value: row.PERSON_NAME,
        }]
    },
    faculties: { connect: { id: row.FACULTY_ID } }, // these lines are ignored in the first phase
    departments: { connect: { id: row.DEPT_ID, } }, //
  }),
...

Linking the data

The second phase is linking the data together. This is done by the connectPrismaEntities function in the feeder package.

This time, the transformed row is stripped of the scalar values and the { create: ... } directives. The remaining values (id and the { connect: ... } values) are then used to connect the data together.

...
  transformer: (row) => ({
    id: row.PERSON_WHOIS_ID,
    private_id: row.PERSON_ID,        // these lines are ignored in the second phase
    names: {                          // 
        create: [{                    //
            lang: 'en',               //
            value: row.PERSON_NAME,   //
        }]                            //
    },                                //
    faculties: { connect: { id: row.FACULTY_ID } },
    departments: { connect: { id: row.DEPT_ID, } },
  }),
...

While it would be tempting to use the Prisma update function here, updating the tables with SQL queries proved to be much faster. While in the database schema section we mentioned that we don’t have control over the database schema, there are a few rules Prisma follows when generating the tables - see the related documentation page.

Linking people with faculties

Since the faculty information is used for most of the visualisations in the Charles Explorer GUI, it’s beneficial to have this for as many entities as possible. After the data linking phase, the importer tries to infer the faculties for people with missing faculty information.

This logic is described in the fixPeopleFaculties function in the feeder package.

For people without faculty information, this function retrieves their classes, study programmes and publications, calculates a “histogram” and picks the most frequent faculty in the related entities as the person’s own faculty.

Indexing data

After the data import is done, the importer feeds the Solr index with text data. The mapping between the entity text fields and the hierarchy levels is described in the index file.

The used insertToSolr function automatically retrieves the text data related to the entity type and using the supplied transformation function maps these to the respective Solr core.

Using the loadConnected option in the insertToSolr call, the function loads the entity including its related entities. This is used for example during indexation of person, as there is no direct text data associated with people (aside from their names). By using the loadConnected option, we can reference the .classes and .publications and add their data to the person index.

Overview

Code versioning

Import phases

Creating the data link

Linking the data link

Linking people with faculties link

Indexing data link

Creating the data

Linking the data

Linking people with faculties

Indexing data