Purpose of this series of blog posts is to map out some very simple steps that will allow an HPCC SALT beginner to start Internal Linking Iterations, based on a basic set of Linking Rules. It does not aim to cover the full breadth of features in SALT, nor does it claim that everything stated in this document constitutes “good practice” – all this information can be found in the SALT documentation. It should enable a novice to generate ECL code through SALT that compiles, as well as examine the output of the various internal linking iterations.
The guide is split over multiple posts for the sake of readability, starting from the prerequisites and continuing with the sample dataset preparation, configuration of SALT specification files and finally, the execution of the linking iteration(s):
- Part 1:
- Part 2:
- Part 3:
- Part 4:
- Reviewing the Iteration Output
- Writing Custom Functions in SALT
Before proceeding with the guide, a number of prerequisites should be met:
- You should have access to an HPCC Cluster, against which you can upload/spray data and execute ECL Workunits. This could be the HPCC Virtual Machine that can be found here
- You should have the latest version of SALT installed on your workstation. Please note that SALT is not open source component and has to be purchased.
- You should have the latest version of the ECL IDE installed on your workstation.
The Use Case
A simple use case will be employed, based on a small dataset (~1000 rows). Such simplicity will allow you to easily follow the logistics of entity linking without having to trawl through massive datasets. As usual, the “People” dataset will be used, with each data row containing the following information:
- First Name
- Last Name
- Email Address
Whilst the dataset consists mainly of unique data entries, we have introduced a number of duplicate records with minor differences, for example:
We know that the three highlighted entries above probably refer to the same person. We will “feed” SALT with the appropriate rules so that it can quantify how likely these are to be the same. Of course, in the real world an identical email would be enough but for the purposes of this tutorial, we will use more complicate rules.
Sample Dataset Preparation
SALT Entity Linking requires two fields (of type INTEGER) to be present in the file that is fed to the algorithm (already present in the provided dataset):
- RecordID – an identifier that uniquely identifies each row in the dataset. In this example, this could simply be the number of the row.
- ID – will initially be the same as the RecordID, however as we proceed through the Entity Linking iterations, the ID will be updated by SALT as it starts clustering entries together.
For example, let’s assume that the Adams entries (shown in figure 1 above) are enriched with RecordID/IDs as shown below:
As can be seen, each row has a distinct RecordID/ID. Such uniqueness implies that all these three entities are distinct and not in any way related. However (assuming that the relevant SALT entity linking iterations have been executed successfully) we will be left with something like the following:
SALT would update the ID of each entry, pointing to the Record ID that it has clustered against. In the above scenario, it indicates that Records 1, 2 & 3 have all been clustered and are all the same as the first entry.
Spraying the Sample Dataset
The sample data file can be downloaded here and sprayed as a Delimited File through the HPCC Landing Zone. For the purpose of this tutorial, the following logical file name will be used:
ECL IDE Setup
On the ECL IDE, prior to connecting to your HPCC Cluster, click on the Preferences button. Then navigate to the Compiler tab and ensure that the –legacy argument has been provided as follows:
This entire tutorial has been built with this argument in place, which ensures that the various module imports are handled in the correct way.