While early tests with a sample set of data from JRUL were encouraging, see See SALT – a demo, an overhaul of the methodology behind the recommender API was required once the full set of loan transactions was obtained.
It was feared that processing the data into the nborrowers table – containing, for each combination of two items, a count of the unique number of library users to have borrowed both items – might become too onerous with the anticipated 3 million records. That fear turned to blind panic when 8 million loan records actually arrived!
The approach for processing the data for the API was thus re-jigged. As before the data was loaded into two MySQL tables, items and loans, and then some simple processing pushed the total number of loans for each item into a further, nloans, table. The remainder of the logic for the recommender was moved to run, on demand, in the API.
Given the ISBN of a certain item, let’s say ITEM A, and a threshold value, the PHP script for the API was coded to do the following:
- Find the list of all users in the loans table who have borrowed ITEM A
- For each user found in 1. find the list of all items in the loans table that have been borrowed by that user
- Sum across the lists of items found in 2. to compile a single list of all possible suggested items which includes, for each of these items, the number of unique users to have borrowed both that item and ITEM A
- From the list in 3. remove ITEM A and any items for which the number of unique users falls below the given threshold
- For each item in the list derived in 4. divide the number of unique users of that item by the total number of times that item has been borrowed, from the nloans table
- Rank the items in the list in 5. by the ratio of unique users to total loans
- Find the details of each item in the list in 6. from the items table and return the list of suggestions
Testing showed that certain queries of the MySQL database involved in the above process were time consuming and affected the responsiveness of the API. The following extra pre-processing was thus performed:
- The items table was split into 10 smaller tables
- The loans table was split into 5 smaller tables
With queries rewritten so that searches access each of these smaller tables in turn rather than just looking at the original, large tables there was a significant boost in API performance. The number of divisions for the above splits was somewhat arbitrary but was sufficient to render the API usable for testing.
Further analysis would more than likely bring additional performance benefits, especially relevant as the amount of data is only going to grow (*). Also on the to-do list is expanding the range of output formats for the API; at present only xml and json are offered though both of the developers implementing the API in Copac and in JRUL respectively suggested that jsonp would be easier to work with.
(*) For reference, just over 8 million loan transactions are used for the current SALT recommender covering all available records up to July 2011, and these loans feature around 628,000 individual library items.