Technical Processes

1     Data

a     Format

Loan transaction data from the John Rylands University Library (JRUL) was provided via FTP onto the SALT server as CSV files that contained the quantities detailed below:

Name Description
CREATE_DATE The date of the loan transaction
LOAN_ID The ID of the loan transaction
BORROWER_ID The ID of the library user who made the loan
WORK_ID The ID of the item loaned
CONTROL_NUMBER The ISBN of the item where available, otherwise a local code
AUTHOR_DISPLAY Bibliographic details of the item loaned for citation
TITLE_DISPLAY
PUB_DATE
EDITION_MAIN

b     Number

Over 8 million loan transactions were received (8,070,319 to be precise) covering all available records up to 16:18 on 01/07/2011.  These loans featured 627,999 individual library items.

2     Loading

A new database, called ‘salt’, was created in the MySQL instance on the SALT server and two tables created as per Appendix A.  The data in each record in each CSV file was loaded into these two tables as follows:

a     Items

The relevant details of the item loaned for any given record are encoded as UTF-8, inserted in an array and then added to the database using the function detailed in Appendix B.

The relevant details for each item are:

  • WORK_ID
  • CONTROL_NUMBER
  • AUTHOR_DISPLAY
  • TITLE_DISPLAY
  • EDITION_MAIN
  • PUB_DATE

b     Loans

The relevant details of the loan transaction for any given record are encoded as UTF-8, inserted in an array and then added to the database using the function detailed in Appendix C.

The relevant details for each loan transaction are:

  • LOAN_ID
  • CREATE_DATE
  • BORROWER_ID
  • WORK_ID

3     Processing

To aid responsiveness of the API some processing of the loan and item details is performed.

a     Number of Loans

For each item, the total number of times that item has featured in a loan transaction is counted and stored in the database, as per Appendix D.

b     Split Items

The items are split into 10 files, as per Appendix E.  The rationale for this is as follows.  Searching for a given item by ISBN (or rather CONTROL_NUMBER) in the items table created for Loading was found to be costly, but the same search made when the item data was split into 10 smaller files was found to be significantly faster.

c     Split Loans

The loans are split into 2 sets of 5 files, as per Appendix F.  The rationale for this is as follows.  Searching for a BORROWER_IDs given a WORK_ID, or WORK_IDs given a BORROWER_ID in the loans table created for Loading was found to be costly, but the same searches made when the loan transaction data was split into the two sets of 5 smaller files was found to be significantly faster.

4     API

Given the ISBN or WORK_ID of a certain item (ITEM A) and a threshold value, the API returns a list of suggested items for further reading, based on what other items the previous users who borrowed ITEM A also borrowed.  The process behind this derivation is as follows:

(i)        Find the WORK_ID of ITEM A by looking in the items_X tables (created in section 3b) in turn until the item is found

(ii)        Find the list of all users (BORROWER_IDs) from the workid_borrowerid_X tables (created in section 3c) who feature in a loan transaction for ITEM A, given the WORK_ID found in (i)

(iii)        For each user found in (ii), find the list of all items (WORK_IDs) from the borrowerid_workid_X tables (created in section 3c) which feature in a loan transaction for that user

(iv)        Sum across the lists of items found in (iii) to compile a single list of all possible suggested items which includes, for each of these items, the number of unique users who have previously borrowed both that item and ITEM A

(v)        From the list created in (iv) remove ITEM A and any items for which the number of unique users falls below the threshold supplied to the API

(vi)        For each item remaining in the list derived in (v) divide the number of unique users of that item by the total number of times that item has been borrowed, which is taken from the nloans table (created in section 3a)

(vii)        Rank the items in the list in (vi) by the ratio of unique users to total loans

(viii)        Find the details for the items in the list in (vii) from the items table, and create a citation for each item by concatenating the relevant AUTHOR_DISPLAY, TITLE_DISPLAY, EDITION_MAIN and PUB_DATE quantities.

(ix)        Return the list of suggestions

5     Appendices

A    Loading – MySQL Tables

Two tables created as follows:

create table items

(

WORK_ID int unsigned not null,

CONTROL_NUMBER varchar(50) null,

AUTHOR_DISPLAY nvarchar(1000) null,

TITLE_DISPLAY nvarchar(1000) null,

EDITION_MAIN nvarchar(100) null,

PUB_DATE int not null default 0,

primary key (WORK_ID)

);

create table loans

(

LOAN_ID int unsigned not null,

CREATE_DATE datetime not null,

BORROWER_ID int unsigned not null,

WORK_ID int unsigned not null,

primary key (LOAN_ID),

index (BORROWER_ID),

index (WORK_ID)

);

B    Loading – Item insertion script

The following PHP function is used to add details of an item to the database:

function q_addItem( $in_dbConnection,

$in_item )

{

$variableString = ”;

$valueString = ”;

foreach( $in_item as $variable => $value )

{

if( ! empty( $value ) )

{

if( $variableString )

{

$variableString .= ‘, ‘;

}

$variableString .= $in_dbConnection->real_escape_string( $variable );

if( $valueString )

{

$valueString .= ‘, ‘;

}

$valueString .= “‘”.$in_dbConnection->real_escape_string( $value ).”‘”;

}

}

$queryString = ‘insert ignore into items ( ‘.$variableString.’ ) values ( ‘.$valueString.’ );’;

$status = $in_dbConnection->query( $queryString );

return( $in_dbConnection->error );

}

Note that the ‘insert ignore’ directive in the SQL query in this function ensures that only unique items are added to the database

C    Loading – Loan insertion script

The following PHP function is used to add details of a loan transaction to the database:

function q_addLoan( $in_dbConnection,

$in_loan )

{

$variableString = ”;

$valueString = ”;

foreach( $in_loan as $variable => $value )

{

if( ! empty( $value ) )

{

if( $variableString )

{

$variableString .= ‘, ‘;

}

$variableString .= $in_dbConnection->real_escape_string( $variable );

if( $valueString )

{

$valueString .= ‘, ‘;

}

$valueString .= “‘”.$in_dbConnection->real_escape_string( $value ).”‘”;

}

}

$queryString = ‘insert ignore into loans ( ‘.$variableString.’ ) values ( ‘.$valueString.’ );’;

$status = $in_dbConnection->query( $queryString );

return( $in_dbConnection->error );

}

Note that the ‘insert ignore’ directive in the SQL query in this function ensures that only unique loan transactions are added to the database

D    Processing – Number of Loans

The following table created, or dropped and re-created if already in existence,

create table nloans

(

WORK_ID int unsigned not null,

nLOANS int unsigned not null,

index (WORK_ID)

);

and populated via the following SQL query

insert into nloans

( WORK_ID, nLOANS )

select WORK_ID, count(*)

from loans

group by WORK_ID;

E     Processing – Split Items

Ten tables created as follows, or dropped and re-created if already in existence, where X ranges from 1 to 10

create table items_X

(

WORK_ID int unsigned not null,

CONTROL_NUMBER varchar(50) null,

AUTHOR_DISPLAY nvarchar(1000) null,

TITLE_DISPLAY nvarchar(1000) null,

EDITION_MAIN nvarchar(100) null,

PUB_DATE int not null default 0,

primary key (WORK_ID),

index (CONTROL_NUMBER)

);

and populated via the following SQL query

insert into items_X

( WORK_ID, CONTROL_NUMBER, AUTHOR_DISPLAY, TITLE_DISPLAY, EDITION_MAIN, PUB_DATE )

select WORK_ID, CONTROL_NUMBER, AUTHOR_DISPLAY, TITLE_DISPLAY, EDITION_MAIN, PUB_DATE

from items

order by WORK_ID

limit (X-1) * N, N;

where N is calculated as int(T/10) + 1 and T is found via the following SQL query

select count( WORK_ID ) from items;

F     Processing – Split Loans

Five tables created as follows, or dropped and re-created if already in existence, where X ranges from 1 to 5

create table borrowerid_workid_X

(

BORROWER_ID int unsigned not null,

WORK_ID int unsigned not null,

index (BORROWER_ID)

);

and populated via the following SQL query

insert into borrowerid_workid_X

( BORROWER_ID, WORK_ID )

select distinct BORROWER_ID, WORK_ID

from loans

order by BORROWER_ID, WORK_ID

limit (X-1) * N, N;

where N is calculated as int(T/5) + 1 and T is found via the following SQL query

select count( distinct BORROWER_ID, WORK_ID ) from loans;

Also, five tables created as follows, or dropped and re-created if already in existence, where X ranges from 1 to 5

create table worked_borrowerid_X

(

WORK_ID int unsigned not null,

BORROWER_ID int unsigned not null,

index (WORK_ID)

);

and populated via the following SQL query

insert into workid_borrowerid_X

( WORK_ID, BORROWER_ID )

select distinct WORK_ID, BORROWER_ID

from loans

order by WORK_ID, BORROWER_ID

limit (X-1) * N, N;

where N is calculated as int(T/5) + 1 and T is found via the following SQL query

select count( distinct BORROWER_ID, WORK_ID ) from loans;

2 Responses to Technical Processes

  1. Pingback: Final blog post | SALT – Surfacing the Academic Long Tail

  2. Pingback: Final blog post | Innovations

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s