Introduction

splinter

Splinter - Transactional Data Enrichment from monily.co

Splinter is set of containerised services compatible with orchestration layers such as kubernetes. We provide functionality to process emails and match them to bank transactions in the in the following way:

  1. Categorise the email.
  2. Parse out the pertinent details.
  3. Match to the corresponding bank transaction.

Obtaining the images

We will enable access of your google accounts to our container registry on google cloud. Please email info@monily.co to gain access to the container registry. Once you have access, please follow the instructions on the google cloud website to pull the containers from the registry. If you haven’t installed the google cloud CLI, you will need to do this first an authorise your account. Follow these steps:

  1. Install the google cloud SDK
  2. Tell Docker to authenticate with google
  3. Pull our containers

For example, once authenticated you will be able to use the docker pull command to obtain the image for the categoriser as follows:

docker pull eu.gcr.io/monily-splinter/categoriser:latest

Running the containers

Each image can be instantiated with the docker run command and can be communicated with via a REST interface with a single endpoint for each container, for example to run the container image for the categoriser you could:

docker run --name categoriser -p 5000:5000 categoriser:splinter

This will forward port 5000 on your machine/instance to port 5000 on the container. Once the container is running the endpoint will be available at:

localhost:5000/categoriser

or

container.ip.address:5000/categoriser

POST ip.address:5000/categoriser Categoriser

/categoriser

   
Container registry address eu.gcr.io/monily-splinter/categoriser:latest
Port 5000
Endpoint ip.address:5000/categoriser

The categoriser will take an html email document or documents as its input. It will then employ our pre-trained machine learning models to infer the category of the email. Specifically the models will first provide a prediction as to whether the email is a retail email or not. Secondly, and if the email is predicted to be a retail email a second categorisation will be made as to where in the retail journey the email belongs to. This can be either Order, Delivery, Refund or Return.

Inputs

The endpoint can be accessed via the POST method and input data as .json in the request body. This service requires an object with a single field emails. This is an array of email objects each containing the subject and html formatted email body body_html as strings. For example:

{ "emails": [ { "subject": "Take the bins out please", "body_html": "<p>This is a bogus <em>test</em> email<p>" }, { "subject": "your order of splinter is on its way", "body_html": "<p>Congratulations. We've received your order of splinter, order number 001." } ] }

Please not that you will need additional fields for the subsequent modules. For your convenience the categoriser will also accept these additional fields and ignore them so that you don’t need to manage these objects. Additional fields that you can pass to the categoriser that will be ignore include id, date and subject, see the /parser endpoint.

Example Call

From the command line you could then do the following:

curl http://localhost:5000/categoriser \ --request POST --header "Content-Type: application/json" \ --data @test-input.json

Outputs

You will receive .json formatted output in the body of the response. At the top level the response body will have three fields detailing the outcome, success, message and details. The data in details is an object with two fields retail_detector and journey_mapper presenting the results of the two ML models. Each of these two variables are objects with two fields class_labels and predictions. The mapping between predicted values and human interpretable labels is stored in the class_labels object and the array of predicted values is stored in the predictions array of the same size as the input array emails. Note that the journey_mapper will return null for a given email if that email was not found to be a retail email by the retail_detector.

{ details: { "journey_mapper": { "class_labels": { "transaction/delivery": "Delivery", "transaction/order": "Order", "transaction/refund": "Refund", "transaction/return": "Return" }, "predictions":[ null, "transaction/delivery" ] }, "retail_detector": { "class_labels": { "0": "Non-Retail", "1": "Retail" }, "predictions": [ 0, 1 ] } }, message: 'The algorithm ran successfully', success: true }

POST ip.address:5001/parser Parser

/parser

Parameter Value
Container registry address: eu.gcr.io/monily-splinter/parser:latest
Port: 5001
Endpoint: ip.address:5001/categoriser

The parser is a service which will take as input the sender address and html body as input, parse the contents out into a structure and return that structured data. This is intended as a utility to extract common features from an email receipt such as the vendor name, order number as well as details about individual items bought such as item descriptions, image urls and the value of the item. Currently the service will only parse emails from vendors that it recognises however we are working on (a) expanding the number of supported vendors and (b) a feature to force the parsing which may result in imperfect results.

Inputs

The endpoint can be accessed via the POST method and input data as .json in the request body. This service requires an array of email objects each containing an optional id, the from address field, the subject and the html formatted email body body_html as strings.

For example: (Warning this example may not run because the body_html has been truncated to fit into this document, please contact us if you would like some example input to run on your instance.)

[ { "id": "169de6a5982fcac3", "date": "Tue, 2 Apr 2019 14:18:22 +0000 (GMT)", "from": "\"Amazon.co.uk\" <auto-confirm@amazon.co.uk>", "subject": "Your Amazon.co.uk order of \"Anker PowerCore+ 26800 PD...\"", "body_html": "<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n <head> \r\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /> \r\n <style type=\"text/css\">\r\nbody {\r\n background-color: #ffffff;\r\n\tmargin:0;\r\n\tfont:12px/16px Arial, sans-serif;\r\n}\r\n\r\na {\r\n\ttext-decoration:none;\r\n\tcolor:#006699;\r\n\tfont:12px/16px Arial, sans-serif;\r\n}\r\n\r\na img {\r\n\tborder:0;\r\n}\r\n\r\nh2 {\r\n\tfont-size:20px;\r\n\tline-height:24px;\r\n\tmargin:0;\r\n\tpadding:0;\r\n\tfont-weight:normal;\t\r\n\tcolor:#000 !important;\r\n}\r\n\r\nh3 {\r\n\tfont-size: 18px;\r\n\tcolor:#cc6600;\r\n\tmargin:15px 0 0 0;\r\n\tfont-weight: normal; ...TRUNCATED" }, { "id": "169de4ff7b0256eb", "date": "Tue, 2 Apr 2019 13:49:19 +0000", "from": "Finery <hello@finery.com>", "subject": "Rainbut Make it Fashion ☔", "body_html": "<!doctype html>\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:v=\"urn:schemas-microsoft-com:vml\" xmlns:o=\"urn:schemas-microsoft-com:office:office\">\r\n <head>\r\n <!--[if gte mso 15]>\r\n <xml>\r\n <o:OfficeDocumentSettings>\r\n <o:AllowPNG/>\r\n <o:PixelsPerInch>96</o:PixelsPerInch>\r\n </o:OfficeDocumentSettings>\r\n </xml>\r\n <![endif]-->\r\n <meta charset=\"UTF-8\">\r\n <meta http-equiv=\"X-UA-Compatible\" ...TRUNCATED" } ] There is an optional parameter at the /parser endpoint to enable filtering of the results to the cases that were successfully parsed by the algorithm. The parameter is called onlyorders and should be set to "true" to enable this filtering. If enabled the results will be output ready to pass directly into the inputs for the subsequent /matcher service (see example call below).

Because you will need additional fields in the /matcher service, the /parser service will also accept these fields and propagate them. Fields that you may want to pass into /matcher and propagate for convenience include id and date.

Example Call

From the command line you could then do the following:

curl http://localhost:5001/parser?onlyorders=true \ --request POST --header "Content-Type: application/json" \ --data @test-input.json

Outputs

You will receive .json formatted output in the body of the response. At the top level the response body will have three fields detailing the outcome, success, message and details. The results stored under details is an object with three fields vendor_name, order_number and items representing the order level details with the item field containing details on each individual item on the receipt. The item field is an array of individual items including type which can be one of "product", "shipping" or "total" and value which is the item price. The item array can also contain the fields description a text description of the item and image the image url.

Additional optional fields such as id and date will be passed through into the results for convenience.

Example output:

{ "success": true, "message": "Algorithm ran successfully", "details": {[ "vendor_name": "\"Amazon.co.uk\"", "order_number": "203-4351263-4100355", "items": [ { "id": '1697272e9fae71cf', "date": 'Tue, 12 Mar 2019 09:08:46 -0600' "type": "product", "value": 102.99, "description": "Anker PowerCore+ 26800 PD, USB-C Portable Charger with Power Delivery, Type-C Port With Fast-Charge Input and 30W Output for iPhone X / 8/8 plus/MacBooks/ iPad Pro 2018 with Type-C Condition: New Sold by: AnkerDirect Fulfilled by Amazon", "image": "https://images-na.ssl-images-amazon.com/images/I/31gpvU6xzzL._SCLZZZZZZZ__SY115_SX115_.jpg" }, { "type": "shipping", "value": 0, "description": "Postage & Packing:", "image": null }, { "type": "total", "value": 102.99, "description": "Order Total:", "image": null } ] ]} }

POST ip.address:5002/matcher Matcher

/matcher

Parameter Value
Container registry address: eu.gcr.io/monily-splinter/matcher:latest
Port: 5002
Endpoint: ip.address:5002/matcher

The matcher is a service that take a set of results from parser and a set of bank transactions as input. The algorithm then attempts to match the transactions to the purchases extracted by the parser service. For negative amounts in the transactions, debits from the account, the matcher will return either a match to an order returned from the parser or an error detailing that no match was found. For positive amounts in the transactions, credits into the account, the matcher will return either a list of candidates or an error. If there are candidate matches to the credit, potential refunds, the matched orders will only contains items from the order that are candidate matches for the refund. We don’t assume a certain currency but we assume that all of your orders and transactions are in the same currency. Also we assume that debits from the account are listed as negative amount with a minus and credits into the account such as refunds are listed as positive amounts.

Inputs

The endpoint can be accessed via the POST method and input data as .json in the request body. At the top level your input data should have the orders and the transactions fields.

Input Parameter Definition
__orders__ An array of order objects typically returned from /parser where each order contains id, date, vendor_name, order_number and items.
id A unique id for each order
date The date of the order as a full text string
vendor_name (optional) Name of the vendor
order_number Order number unique to each order
__items__ An array of item objects each containing type, value, description and image
type The type of item for example product or total
value The monetary value
description Text description found in the email
__transactions___ An array of transactions e.g. obtained from the open banking APIs potentially containing the fields id, created, amount, currency, description
id unique id for each transaction
created date and time of the transaction
amount monetary value of the transaction, negative for debits and positive for credits.
currency (optional) Optional text for the currency
description (optional) Optional text for the description

Example input:

{ "orders": [ { "id": "16972f8e93a32bab", "date": "Tue, 12 Mar 2019 10:35:07 -0700", "vendor_name": "Matalan", "order_number": "30603910", "items": [ { "type": "product", "value": 2, "description": "Father of the Bride Slogan Socks Colour: BlackSize: One Size Qty: 1" }, { "type": "total", "value":2, "description":"Father of the Groom Slogan Socks Colour: BlackSize: One Size Qty: 1" } ] }, { "id": "1696c98b183e2826", "date": "Mon, 11 Mar 2019 11:52:19 +0000 (GMT)", "vendor_name": "confirmation@screwfix.com", "order_number": "A5110669215", "items": [ { "type": "product", "value": 0.49, "description": "18690 x 1 Easyfix Wall Plugs 5 x 100 Pack" }, { "type": "product", "value": 0.58, "description":"13209 x 1 Easyfix Wall Plugs 6 x 100 Pack" }, { "type": "total", "value": 1.07, "description": "Total (inc. VAT)" } ] } ], "transactions": [ { "id": "tx_00009geqKqHI4ZAdLQVkxt", "created": "2019-03-11T09:55:37Z", "amount": 60, "currency": "GBP", "description": "(Faster Payments)", "account": 1 }, { "id": "tx_00009ges0XQSedMnI43v3x", "created": "2019-03-11T10:14:22Z", "amount": -1, "description": "GOOGLE \*SERVICES g.co/helppay# GBR", "account": 1 }, { "id": "tx_00009ges7vN61UwvHnC8fZ", "created": "2019-03-11T10:15:42Z", "amount": -2, "currency": "GBP", "description":"TESCO GROCERY", "account":1 } ] }

Example Call

From the command line you could then do the following:

curl http://localhost:5002/matcher \ --request POST --header "Content-Type: application/json" \ --data @test-input.json

Outputs

You will receive .json formatted output in the body of the response. At the top level there are three fields, success, matches and match_count.

Output field Definition
success (Boolean) Whether or not the algorithm has run successfully
match_count (Integer) The count of successfully reconciled transactions in the sample provided
__matches__ (Array) of matches the same length as the number of transactions provided potentially containing error, type and orders
transaction_id (String) transaction id provided as input.
matched (String) Boolean value indicating whether or not the transaction was matched.
error (String) Detail of the failure to match the transaction.
type (String) Either “purchase” or “potential_refund” depending on whether we matched a debit or credit
orders (Object) in the case of a debit containing the order object defined under /parser or (Array) in the case of a credit containing candidate orders which may have been refunded. In the latter, the orders will only contain those items which are candidates for the return.

Example output:

{ success: true, matches: [ { transaction_id: 'tx_00009gf1wavyfSW4FykLkP', matched: false, error: 'Error: No match for transaction tx_00009gf1wavyfSW4FykLkP' }, { transaction_id: 'tx_00009ggaw3EDfyth0sJUIb', matched: true, type: 'potential_refund', orders: [Array] }, { transaction_id: 'tx_00009ggfKDGoON5ggF9W8f', matched: true, type: 'potential_refund', orders: [Array] }, { transaction_id: 'tx_00009ggynm7FoU5PWt4yOo', matched: false, error: 'Error: No match for transaction tx_00009ggynm7FoU5PWt4yOo' }, { transaction_id: 'tx_00009ggyr28ehL4Ddz4JF3', matched: true, type: 'purchase', orders: [Object] }, { transaction_id: 'tx_00009gh09CFZwRgyAgX12f', matched: false, error: 'Error: No match for transaction tx_00009gh09CFZwRgyAgX12f' } ], "match_count": 11 }

Python

A worked example in Python 3.6.x

Assuming that you have your containers running on localhost:5000/categoriser, localhost:5001/parser and localhost:5002/matcher respectively and that your input emails and bank transactions are formatted correctly then you can follow the example below to pass data from one service to the next.

First we need to use the categoriser to find out if we have any retail emails in our sample. Assuming that test_emails is an array of objects containing the necessary information (see section on /categoriser) we can do this as follows.

import requests categoriser_input = json.dumps({ 'emails': test_emails }) header = {'Content-Type': 'application/json'} categoriser_response = requests.post( 'http://localhost:5000/categoriser', data=categoriser_input, headers=header ) categoriser_output = json.loads(categoriser_response.text)

Now we have the output from the categoriser lets use it to filter our emails. The details.retail_detector.retail_index contains a convenient index into our original email inputs where the model predicted that the email contained a retail transaction.

retail_index = categoriser_output['details']['retail_detector']['retail_index']

We can use this index to filter our email inputs and create the input for the /parser service.

parser_input = json.dumps([test_emails[i] for i in retail_index]) parser_response = requests.post( 'http://localhost:5001/parser?onlyorders=true', data=parser_input, headers=header )

Now that we have our results from the parser, all that is left to do is combine them with the bank transactions and call the /matcher service. I’m assuming that test_transactions is an array of objects containing all the information needed (see section on /matcher for details).

orders = json.loads(parser_response.content)['details'] matcher_input = json.dumps({ 'orders': orders, 'transactions': test_transactions }) matcher_response = requests.post( 'http://localhost:5002/matcher', data=matcher_input, headers=header ) print(matcher_response.content)

Node JS

A worked example in node 12.3.x

Assuming that you have your containers running on localhost:5000/categoriser, localhost:5001/parser and localhost:5002/matcher respectively and that your input emails and bank transactions are formatted correctly then you can follow the example below to pass data from one service to the next.

First we need to use the categoriser to find out if we have any retail emails in our sample. Assuming that test_emails is an array of objects containing the necessary information (see section on /categoriser) we can do this as follows.

First of all lets use a thin wrapper around one of the methods in the request package that we can use repeatedly to make the API calls.

callAPI = function callAPI(path, payload, callback) { request.post({ url: path, headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(payload) }, function(error, response, body){ if (error) { console.log(error); } else { return callback(JSON.parse(body)) } }); }

Now we can make three nested calls to callAPI each time providing the results from the previous call. Assuming that test_emails is an array of objects containing the necessary information (see section on /categoriser) we can do this as follows.

let categoriser_input = { "emails": test_emails } callAPI( "http://localhost:5000/categoriser", categoriser_input, (categoriser_output) => { console.log(categoriser_output); // 1. purify interesting emails let retail_email = categoriser_output.details.retail_detector.predictions let parser_input = categoriser_input.emails.filter((el, i) => retail_email[i] == 1) callAPI( "http://localhost:5001/parser?onlyorders=true", parser_input, (parser_output) => { console.log(parser_output); // 2. combine orders and transactions for the matcher let matcher_input = { "orders": parser_output.details, "transactions": test_transactions } callAPI( "http://localhost:5002/matcher", matcher_input, (matcher_output) => { console.log(matcher_output); }) }) })

Notice that we filtered our original email input array using the results of the categoriser. Specifically we used the details.retail_detector.predictions which is an array of 0’s and 1’s which indicate whether the model predicted that a given email contained a retail transaction.