HPCC Internal Entity Linking through SALT – A Quick Start Guide – Pt. 2


This post is the second part of a series that aims to provide simple steps for a novice user to “get going” with HPCC’s SALT Internal Linking. Please see here for further information on the series. 

Having installed SALT, configured the IDE and sprayed the dataset into your cluster (as described in Part 1 of this series), you should now be in a position to

  1. Start preparing the solution in the ECL IDE
  2. Create a draft SALT Specification File
  3. Generate ECL Code based on the draft SALT spec file


ECL IDE Solution Preparation

Layout and Input File

Create a folder titled People within your My Files folder. Within this folder, add an ECL Attribute titled Layout_People containing the record definition as follows:

EXPORT Layout_People := RECORD
   UNSIGNED6 id;
   UNSIGNED6 rid;
   STRING30 first_name;
   STRING30 last_name;
   STRING30 email;
   STRING10 gender;
   INTEGER age;
   STRING30 profession;
END;

Furthermore, add another ECL Attribute (file) titled In_People containing the following:

EXPORT In_People :=
DATASET('~salt::tutorial::input', Layout_People, CSV);

Importing the SALT Modules

Along with the SALT installer, you should have been provided with three .mod files titled as follows:

  • SALTxx.mod
  • SALTTOOLSxx.mod
  • ut.mod

Where xx refers to the version of SALT you are in possession of. Through the ECL IDE, perform the following steps:

  1. Go to the Open Menu
  2. Navigate to the mod files and select all three of them.
  3. Click on Open
  4. On the popup, select the MyFiles folder  as your target
  5. Start the import

This will import all the SALT code into your solution, ready to be utilised for your iterations. It could take a few minutes since there are quite a few files to import. At the end of the import, your directory tree should look like the below:

6

Figure 1: Solution after SALT Modules import


Compiling a Draft SALT Specification File

Before we can start running our entity linking iterations, we need to perform some basic analysis in order to fine-tune our subsequent activities. For these analysis actions to take place, we need to create a SALT Specification file with the Specificity parameters set to their default values (please refer to the SALT Reference Document for more information on specificities). We will refer to this initial specification file as the “draft” spec file. Using this spec file, the SALT compiler will generate a set of ECL files that can subsequently be executed against our HPCC cluster.

Before we proceed, it would make sense for us to define a set of logical rules that would indicate entities that should be clustered together. Within any business domain, these rules would have to be discussed and verified with a domain expert. For our example, we using the following:

Rule #1 (Highest Weight)

If FirstName, LastName and Email are identical, then the possibility of the two entities being the same is very high (almost certain).

Rule #2 (High Weight)

If the Email addresses are identical, then the possibility of the two entities being the same is high (in real life this would probably indicate certainly).

Rule #3 (Low Weight)

If the gender is the same, then there is a slightly higher chance of the two entities being the same.

Rule #4 (Low Weight)

If the ages are within ±5 years, then there is a slightly higher chance of the two entities being the same.

Rule #5 (No Weight)

For the purposes of this example, we will assume that the profession should not be taken into consideration when determining the similarity of entities.

Note: This tutorial does not aim to teach the SALT Spec Syntax in any shape or form. This can be done through the various SALT reference material. However, the majority of the directives should be human readable. 

Based on the Rules above, we have created a Draft Spec File as follows. The inline comments explain the various elements:

// The Module corresponds to the root folder under which we are operating
MODULE:People

// The Filename will be used to identify the data file (In_)
// and layout (Layout_).Please note that the generated ECL code
// will be following these conventions, so better stick with them.
FILENAME:People

// Indicates that our dataset has an ID field
// and assigns it accordingly.
IDFIELD:EXISTS:id

// Indicates the RecordID field in our dataset.
RIDFIELD:rid

// Indicates an approximation of the total number of records
// in the data file. It doesn't have to be exact - an estimate is fine.
RECORDS:1000

// Indicates the expected count of unique entities
// in our dataset. For example, if our people dataset
// has 1 Million rows, but we know that we should be
// expecting only 1000 unique people (after the linking),
// then the population is 1000
POPULATION:1000
NINES:2

// Indicates that all Fields shoudl be treated as of-type default.
// During linking, any left whitespace will be removed, whilst all the
// strings will be uppercased prior to comparisons being performed.
// Please note that these operations are not persisted in the dataset.
// They are only performed during the linking process.
FIELDTYPE:DEFAULT:LEFTTRIM:CAPS:NOQUOTES("'):ONFAIL(CLEAN):

// Using the CARRY directive, we are stating that we
// want the profession to be ignored during linking, but we want the
// profession to be included in the post-iteration output.
FIELD:profession:CARRY

// The following fields are being used for the
// entity linking, with a default Weight of 1.
// Specificities have been set to the default 10.
FIELD:first_name:10,0
FIELD:last_name:10,0
FIELD:gender:10,0
FIELD:age:10,0

// IMPORTANT - the "age" field definition above does not take into
// consideration any variations in age - it will simply check if two ages
// are the same - if not, it will mark it negatively. In later
// stages we can add custom functions to calculate age differences.

// We are assigning extra weight to emails being identical,
// by setting Weight = 5 as per rule #2
FIELD:email:WEIGHT(5):10,0

// Here we are defining a Concept - a group of fields that together
// form a concept. A concept itself can be assigned a weight.
// In this scenario, we are saying that first_name, last_name and
// email all together form a concept of very high weight (as per rule #1)
CONCEPT:name_email:first_name:last_name:email:WEIGHT(10):10,0

Generating ECL Code based on the draft SALT spec file

The next step would be to generate the ECL code that will allow us to calculate the Field Specificities , using the Draft SALT specification file above. These specificities will then be “fed” back to the SALT specification file, which will allow SALT to optimise the linking algorithms.

To generate the ECL Code we will be using the SALT.exe utility, which should be stored under your SALT installation directory (for example C:\Program Files (x86)\HPCCSystems\{version}\SALT). You are encouraged to execute SALT.exe and have a look at the various parameters available – the ones used by our commands will be explained sufficiently. Please follow the steps below:

  1. Save the Draft Specification file as “people_draft.salt”, in the same folder as SALT.exe
  2. Open the Command Line and navigate to the same folder.
  3. Execute the following command:
salt.exe -gs people_draft.salt > People_Draft.mod

 The statement above translates as follows:

Using the SALT spec file people_draft.salt, generate the ECL files that will allow us to calculate the field specificities (this is indicated by the -gs parameter; alternatives would be gh for data hygiene, gfor iterations etc.). The generated ECL files will be wrapped in a file titled “people_draft.mod”, which can then be imported into the ECL IDE.


Proceed to Part 3 of the Guide…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: