Calculating a Levenshtein Distance

Imagine you’re writing a mobile app, and your user searches for the word kitten. Unfortunately, the only search terms you expected them to enter were from the following:

smitten
mitten
kitty
fitting
written

How do we figure out which word the user meant to type?

Levenshtein Distances

A Levenshtein distance is a distance between two sequences a and b. If a and b are strings, the Levenshtein distance is the minimum amount of character edits needed to change one of the strings into the other. There are three types of edits allowed:

  • Insertion: a character is added to a
  • Deletion: a character is removed from b
  • Substitution: a character is replaced in a or b

For example, if the first string a = 'abc' and the second string is b = 'abc', the Levenshtein distance between the two strings is 0 because a and b are equal. If a = 'abcd' and b = 'a', the distance is 3. If a = 'abcd' and b = 'aacc', the distance is 2.

The definition of the Levenshtein distance for a string a with a length i and a string b with a length j is:

This definition is a recursive function. The first portion, max(i, j) if min(i, j) = 0, is the base cases where either the first string or the second string is empty.

The function 1_(ai != bi) at the end of the third minimum element is the cost. If a[i] != b[i], the cost is 1, otherwise the cost is 0. The first minimum element is a deletion from a, the second is an insertion, and the third is a substitution.

A Naive Implementation

First, let’s implement a straightforward implementation in Swift. We’ll create a function named levenshtein_distance and write the base cases to check whether either of the strings are empty:

func levenshtein_distance(a: String, b: String) -> Int {
    // If either array is empty, return the length of the other array
    if (a.count == 0) {
        return b.count
    }
    if (b.count == 0) {
        return a.count
    }
}

Then we add the recursive portion. We calculate the cost for the substitution, then find the minimum distance between the three different possible edits (deletion, insertion, or substitution):

func levenshtein_distance(a: String, b: String) -> Int {
    // ...

    // Check whether the last items are the same before testing the other items
    let cost = (a.last == b.last) ? 0 : 1

    let a_dropped = String(a.dropLast())
    let b_dropped = String(b.dropLast())

    return min(
        // Find the distance if an item in a is removed
        levenshtein_distance(a: a_dropped, b: b) + 1,
        // Find the distance if an item is removed from b (i.e. added to a)
        levenshtein_distance(a: a, b: b_dropped) + 1,
        // Find the distance if an item is removed from a and b (i.e. substituted)
        levenshtein_distance(a: a_dropped, b: b_dropped) + cost
    )
}

Let’s test our distance function with a simple test case:

print(opti_leven_distance(a: "123", b: "12"))
1

More example test cases can be found below in the final files. And now we can compare the distances of our words to the string kitten to figure out which word the user probably meant to type:

// Print out the distances for our test case
let first_word = "kitten"
let test_words = ["smitten", "mitten", "kitty", "fitting", "written"]

for word in test_words {
    let dist = opti_leven_distance(a: first_word, b: word)
    print("Distance between \(first_word) and \(word): \(dist)")
}
Distance between kitten and smitten: 2
Distance between kitten and mitten: 1
Distance between kitten and kitty: 2
Distance between kitten and fitting: 3
Distance between kitten and written: 2

The user probably meant to type mitten instead of kitten!

An Improved Implementation

The recursive implementation of the Levenshtein distance above won’t scale very well for larger strings. What if we needed to find the distance between a thousand strings, each with hundreds of characters?

One improved way to calculate a Levenshtein distance is to use a matrix of distances to “remember” previously calculated distances. First, the distance function should check for empty strings. Then, we’ll create a matrix to hold the distance calculations:

func opti_leven_distance(a: String, b: String) -> Int {
    // Check for empty strings first
    if (a.count == 0) {
        return b.count
    }
    if (b.count == 0) {
        return a.count
    }

    // Create an empty distance matrix with dimensions len(a)+1 x len(b)+1
    var dists = Array(repeating: Array(repeating: 0, count: b.count+1), count: a.count+1)
}

The first column and first row of the distance matrix are zeros as an initialization step. The next column goes from 1 to the length of a to represent removing each character to get to an empty string, and the next row goes from 1 to the length of b to represent adding (or inserting) each character to get to the value of b:

func opti_leven_distance(a: String, b: String) -> Int {
    //...

    // a's default distances are calculated by removing each character
    for i in 1...(a.count) {
        dists[i][0] = i
    }
    // b's default distances are calulated by adding each character
    for j in 1...(b.count) {
        dists[0][j] = j
    }
}

Similar to our naive implementation, we’ll check the remaining indices in the distance matrix. This time, however, we’ll use the previous values stored in the matrix to calculate the minimum distance rather than recursively calling the distance function. The final distance is the last element in the distance matrix (at the bottom right):

func opti_leven_distance(a: String, b: String) -> Int {
    //...

    // Find the remaining distances using previous distances
    for i in 1...(a.count) {
        for j in 1...(b.count) {
            // Calculate the substitution cost
            let cost = (a[i-1] == b[j-1]) ? 0 : 1

            dists[i][j] = min(
                // Removing a character from a
                dists[i-1][j] + 1,
                // Adding a character to b
                dists[i][j-1] + 1,
                // Substituting a character from a to b
                dists[i-1][j-1] + cost
            )
        }
    }

    return dists.last!.last!
}

We can use our test cases again to verify that our improved implementation is correct:

print(opti_leven_distance(a: "123", b: "12"))

// Print out the distances for our test case
let first_word = "kitten"
let test_words = ["smitten", "mitten", "kitty", "fitting", "written"]

for word in test_words {
    let dist = opti_leven_distance(a: first_word, b: word)
    print("Distance between \(first_word) and \(word): \(dist)")
}
1
Distance between kitten and smitten: 2
Distance between kitten and mitten: 1
Distance between kitten and kitty: 2
Distance between kitten and fitting: 3
Distance between kitten and written: 2

Swift and Python Implementations

Distance.playground:


import Foundation


/// Calculates the Levenshtein distance between two strings
/// - Parameter a: The first string
/// - Parameter b: The second string
func levenshtein_distance(a: String, b: String) -> Int {
    // If either array is empty, return the length of the other array
    if (a.count == 0) {
        return b.count
    }
    if (b.count == 0) {
        return a.count
    }

    // Check whether the last items are the same before testing the other items
    let cost = (a.last == b.last) ? 0 : 1

    let a_dropped = String(a.dropLast())
    let b_dropped = String(b.dropLast())

    return min(
        // Find the distance if an item in a is removed
        levenshtein_distance(a: a_dropped, b: b) + 1,
        // Find the distance if an item is removed from b (i.e. added to a)
        levenshtein_distance(a: a, b: b_dropped) + 1,
        // Find the distance if an item is removed from a and b (i.e. substituted)
        levenshtein_distance(a: a_dropped, b: b_dropped) + cost
    )
}

/// String extension to add substring by Int (such as a[i-1])
extension String {
    subscript (i: Int) -> Character {
      return self[index(startIndex, offsetBy: i)]
    }
}

/// A more optimized version of the Levenshtein distance function using an array of previously calculated distances
/// - Parameter a: The first string
/// - Parameter b: The second string
func opti_leven_distance(a: String, b: String) -> Int {
    // Check for empty strings first
    if (a.count == 0) {
        return b.count
    }
    if (b.count == 0) {
        return a.count
    }

    // Create an empty distance matrix with dimensions len(a)+1 x len(b)+1
    var dists = Array(repeating: Array(repeating: 0, count: b.count+1), count: a.count+1)

    // a's default distances are calculated by removing each character
    for i in 1...(a.count) {
        dists[i][0] = i
    }
    // b's default distances are calulated by adding each character
    for j in 1...(b.count) {
        dists[0][j] = j
    }

    // Find the remaining distances using previous distances
    for i in 1...(a.count) {
        for j in 1...(b.count) {
            // Calculate the substitution cost
            let cost = (a[i-1] == b[j-1]) ? 0 : 1

            dists[i][j] = min(
                // Removing a character from a
                dists[i-1][j] + 1,
                // Adding a character to b
                dists[i][j-1] + 1,
                // Substituting a character from a to b
                dists[i-1][j-1] + cost
            )
        }
    }

    return dists.last!.last!
}

/// Function to test whether the distance function is working correctly
/// - Parameter a: The first test string
/// - Parameter b: The second test string
/// - Parameter answer: The expected answer to be returned by the distance function
func test_distance(a: String, b: String, answer: Int) -> Bool {
    let d = opti_leven_distance(a: a, b: b)

    if (d != answer) {
        print("a: \(a)")
        print("b: \(b)")
        print("expected: \(answer)")
        print("distance: \(d)")
        return false
    } else {
        return true
    }
}

// Test the distance function with many different examples
test_distance(a: "", b: "", answer: 0)
test_distance(a: "1", b: "1", answer: 0)
test_distance(a: "1", b: "2", answer: 1)
test_distance(a: "12", b: "12", answer: 0)
test_distance(a: "123", b: "12", answer: 1)
test_distance(a: "1234", b: "1", answer: 3)
test_distance(a: "1234", b: "1233", answer: 1)
test_distance(a: "1248", b: "1349", answer: 2)
test_distance(a: "", b: "12345", answer: 5)
test_distance(a: "5677", b: "1234", answer: 4)
test_distance(a: "123456", b: "12345", answer: 1)
test_distance(a: "13579", b: "12345", answer: 4)
test_distance(a: "123", b: "", answer: 3)
test_distance(a: "kitten", b: "mittens", answer: 2)

print(opti_leven_distance(a: "123", b: "12"))

// Print out the distances for our test case
let first_word = "kitten"
let test_words = ["smitten", "mitten", "kitty", "fitting", "written"]

for word in test_words {
    let dist = opti_leven_distance(a: first_word, b: word)
    print("Distance between \(first_word) and \(word): \(dist)")
}

Here’s a Python implementation of the Swift code above as distance.py. The Python version also can handle any list as well as any str

# Calculates the Levenshtein distance between two strings
def levenshtein_distance(a, b):
    # If either array is empty, return the length of the other array
    if not len(a):
        return len(b)
    if not len(b):
        return len(a)

    # Check whether the last items are the same before testing the other items
    if a[-1] == b[-1]:
        cost = 0
    else:
        cost = 1

    return min(
        # Find the distance if an item in a is removed
        levenshtein_distance(a[:-1], b) + 1,
        # Find the distance if an item is removed from b (i.e. added to a)
        levenshtein_distance(a, b[:-1]) + 1,
        # Find the distance if an item is removed from a and b (i.e. substituted)
        levenshtein_distance(a[:-1], b[:-1]) + cost
    )

# A more optimized version of the Levenshtein distance function using an array of previously calculated distances
def opti_leven_distance(a, b):
    # Create an empty distance matrix with dimensions len(a)+1 x len(b)+1
    dists = [ [0 for _ in range(len(b)+1)] for _ in range(len(a)+1) ]

    # a's default distances are calculated by removing each character
    for i in range(1, len(a)+1):
        dists[i][0] = i
    # b's default distances are calulated by adding each character
    for j in range(1, len(b)+1):
        dists[0][j] = j

    # Find the remaining distances using previous distances
    for i in range(1, len(a)+1):
        for j in range(1, len(b)+1):
            # Calculate the substitution cost
            if a[i-1] == b[j-1]:
                cost = 0
            else:
                cost = 1

            dists[i][j] = min(
                # Removing a character from a
                dists[i-1][j] + 1,
                # Adding a character to b
                dists[i][j-1] + 1,
                # Substituting a character from a to b
                dists[i-1][j-1] + cost
            )

    return dists[-1][-1]

# Function to test whether the distance function is working correctly
def test_distance(a, b, answer):
    dist = opti_leven_distance(a, b)

    if dist != answer:
        print('a:', a)
        print('b:', b)
        print('expected:', answer)
        print('distance:', dist)
        print()

if __name__ == '__main__':
    # Test the distance function with many different examples
    test_distance('', '', 0)
    test_distance('1', '1', 0)
    test_distance('1', '2', 1)
    test_distance('12', '12', 0)
    test_distance('123', '12', 1)
    test_distance('1234', '1', 3)
    test_distance('1234', '1233', 1)
    test_distance([1, 2, 4, 8], [1, 3, 4, 16], 2)
    test_distance('', '12345', 5)
    test_distance([5, 6, 7, 7], [1, 2, 3, 4], 4)
    test_distance([1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5], 1)
    test_distance([1, 3, 5, 7, 9], [1, 2, 3, 4, 5], 4)
    test_distance([1, 2, 3], [], 3)
    test_distance('kitten', 'mittens', 2)



    first_word = 'kitten'
    test_words = ['smitten', 'mitten', 'kitty', 'fitting', 'written']
    for word in test_words:
        dist = opti_leven_distance(first_word, word)
        print(f'Distance between {first_word} and {word}: {dist}')

Announcing Darkscreen - A Dark App

I’m so excited to announce that my first iOS app, Darkscreen - A Dark App, has a public beta on Testflight! Ever since I was given my first iPod (all the way back in 7th grade!) I’ve dreamed of creating something that millions of people have the ability to enjoy, and I can’t express how excited I am. Here’s the official description:

Darkscreen allows you to use other iPad apps in Split View without any distractions, no hassle.

Darkscreen provides multiple themes, including:

  • Dark
  • Light
  • 80s
  • 90s
  • Outrun

Download using Testflight today!

Why Darkscreen?

I really love using Apollo for Reddit by Christian Selig, but he hasn’t gotten a chance to create a true iPad experience for his Reddit client yet. I use Darkscreen next to Apollo in Split View so that Apollo can be in an iPhone-sized container while keeping the rest of the screen black.

For example, posts shown in Apollo don’t quite look right when in full horizontal mode on iPad:

Apollo in full horizontal mode

Now with Darkscreen, I can browse Apollo in its intended view size without being distracted by other apps:

Apollo in Split View with Darkscreen

Switching to a new theme in Darkscreen automatically updates the table view as well as the root view underneath:

Darkscreen switching themes

My next goal, of course, is for Darkscreen to respond to the system-wide Dark Mode setting.

Why Open Source?

I found it an interesting challenge to modify the appearance of all of all views in the app immediately after a user selects a theme in a UITableView, and I hope this brief example can help other developers implement their own theme system.

Even though iOS 13 introduces system-wide Dark Mode, this example app can be helpful to support any custom themes that go beyond the default dark and light styles.

How to Update the Theme for a View

I’ve implemented the theme system using a Settings Bundle, so the BaseViewController can subscribe to settings (theme) changes:

func registerForSettingsChange() {
    NotificationCenter.default.addObserver(self,
                                            selector: #selector(BaseViewController.settingsChanged),
                                            name: UserDefaults.didChangeNotification,
                                            object: nil)
}

A Theme corresponds to UI styles and colors:

class Theme {

    // ...

    init(_ name: String, statusBar: UIStatusBarStyle, background: UIColor, primary: UIColor, secondary: UIColor) {
        self.name = name
        statusBarStyle = statusBar
        backgroundColor = background
        primaryColor = primary
        secondaryColor = secondary
    }
}

When a setting changes, BaseViewController updates its UI elements:

@objc func settingsChanged() {
    updateTheme()
}

func updateTheme() {
    // Status bar
    setNeedsStatusBarAppearanceUpdate()

    // Background color
    self.view.backgroundColor = Settings.shared.theme.backgroundColor

    // Navigation bar
    self.navigationController?.navigationBar.updateTheme()
}

And UINavigationBar is extended to support theme switching:

public extension UINavigationBar {
    func updateTheme() {
        // Background color
        barTintColor = Settings.shared.theme.backgroundColor

        // Bar item color
        tintColor = Settings.shared.theme.secondaryColor

        // Title text color
        titleTextAttributes = [NSAttributedString.Key.foregroundColor: Settings.shared.theme.secondaryColor]

        // Status bar style
        barStyle = Settings.shared.theme.navbarStyle

        // Tell the system to update it
        setNeedsDisplay()
    }
}

How to Build a Song Recommender Using Create ML MLRecommender

Beta Warning

This example was written using macOS Catalina Version 10.15 Beta and Xcode Version 11.0 beta 5. Changes may have been made to the MLRecommender constructor since this article was written (October 2019).

Objective

By the end of this post, we’ll learn how to use the Create ML MLRecommender to recommend a song to a user given their listening history. We’ll also learn how to parse and prepare an MLDataTable using Python and data from a third party.

Introduction to MLRecommender

A personalized recommendation system can be used in many different applications, such as a music player, video player, or social media site. A machine learning recommendation system compares a user’s past activity to a large library of activity from many other users. For example, if Spotify wanted to recommend you a new Daily Mix, their ML recommendation system might look at your listening history for the past few weeks and compare that history to your friends’ history. Our goal today is to create an MLRecommender to recommend songs to a user given their listening history.

The constructor for MLRecommender is:

init(trainingData: MLDataTable, userColumn: String, itemColumn: String, ratingColumn: String? = nil, parameters: MLRecommender.ModelParameters = ModelParameters()) throws

Creating the Data Tables

The first step is to create the trainingData in the form of an MLDataTable. In this case, our training data is the listening history of many different users from the Million Song Dataset, which holds the metadata for over a million songs and ratings provided by users.

We’ll use two files from the dataset. The first is 1000.txt, which contains the user id, song id, and listen time for 10000 records. We’ll call that history.txt from now on. The second is song_data.csv, which contains the song id, title, release date and artist name. We’ll call that songs.csv from now on. All of the complete files for this tutorial can be found at the end of the post.

Here’s what our input files look like. Notice that songs.csv has a header row while history.txt does not:

# history.txt

b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAKIMP12A8C130995	1
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBBMDR12A8C13253B	2
b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBXHDL12A81C204C0	1
...
# songs.csv

song_id,title,release,artist_name,year
SOQMMHC12AB0180CB8,"Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003
SOVFVAK12A8C1350D9,"Tanssi vaan","Karkuteillä",Karkkiautomaatti,1995
SOGTUKN12AB017F4F1,"No One Could Ever",Butter,"Hudson Mohawke",2006
...

We’ll be using the pandas Python library to handle our CSV data. First, download the files above and name them history.txt and songs.csv, and we’ll load them:

import csv
import pandas as pd

history_file = 'history.txt' # 'https://static.turi.com/datasets/millionsong/10000.txt'
songs_metadata_file = 'songs.csv' # 'https://static.turi.com/datasets/millionsong/song_data.csv'

# Import the files
history_df = pd.read_table(history_file, header=None)
history_df.columns = ['user_id', 'song_id', 'listen_count']
metadata_df =  pd.read_csv(songs_metadata_file)

songs.csv already has the column headers in the file, so we didn’t need to add those like we did with history_df. This is what our dataframes now look like:

# history_df

                                    user_id             song_id  listen_count
0  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOAKIMP12A8C130995             1
1  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBBMDR12A8C13253B             2
2  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBXHDL12A81C204C0             1
...
# metadata_df
# (The '\' means that the row continues onto the next lines)

              song_id              title                release  \
0  SOQMMHC12AB0180CB8       Silent Night  Monster Ballads X-Mas
1  SOVFVAK12A8C1350D9        Tanssi vaan            Karkuteillä
2  SOGTUKN12AB017F4F1  No One Could Ever                 Butter

        artist_name  year
0  Faster Pussy cat  2003
1  Karkkiautomaatti  1995
2    Hudson Mohawke  2006
...

Next, to create a single listening history for all users, we want to merge the song data in the metadata_df to the listening history in the history_df and create a CSV to use in Swift. Let’s also add a column that combines the song title with the artist name so that we can see both in our MLRecommender:

# Merge the files into a single csv
song_df = pd.merge(history_df, metadata_df.drop_duplicates(['song_id']), on="song_id", how="left")
song_df.to_csv('merged_listen_data.csv', quoting=csv.QUOTE_NONNUMERIC)

# Add a "Title - Name" column for easier printing later
song_df['song'] = song_df['title'] + ' - ' + song_df['artist_name']

Here’s what our combined song dataframe now looks like:

# song_df

                                    user_id             song_id  listen_count  \
0  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOAKIMP12A8C130995             1
1  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBBMDR12A8C13253B             2
2  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBXHDL12A81C204C0             1

             title              release    artist_name  year  \
0         The Cove   Thicker Than Water   Jack Johnson     0
1  Entre Dos Aguas  Flamenco Para Niños  Paco De Lucia  1976
2         Stronger           Graduation     Kanye West  2007

                              song
0          The Cove - Jack Johnson
1  Entre Dos Aguas - Paco De Lucia
2            Stronger - Kanye West
...

As of the time of writing, MLRecommender requires that the item id column in trainingData go from 1 to the number of items. In other words, if our trainingData included only three songs, merged_listen_data.csv would have song ids like SOQMMHC12AB0180CB8, SOVFVAK12A8C1350D9, and SOGTUKN12AB017F4F1, but we need to have song ids of 0, 1, and 2. Let’s add a new column to the CSV that uses incremental song ids from 0 to N:

# Find the unique song ids
song_ids = metadata_df.song_id.unique()

# Create a new dataframe of the unique song ids and a new incremental
# id for each one
incremental_id_df = pd.DataFrame({'song_id': song_ids})
incremental_id_df['incremental_song_id'] = incremental_id_df.index

# Merge the original song metadata with the incremental ids
new_song_id_df = pd.merge(song_id_df, incremental_id_df, on='song_id', how='left')
new_song_id_df.to_csv('songs_incremental_id.csv', quoting=csv.QUOTE_NONNUMERIC)

# Create a new merged history and song metadata CSV with incremental ids
new_history_df = pd.merge(history_df, incremental_id_df, on='song_id', how='inner')
new_history_df.to_csv('merged_listen_data_incremental_song_id.csv', quoting=csv.QUOTE_NONNUMERIC)

Here’s what our new song CSV file looks like. Notice that there’s now an added column at the beginning with a song id from 0 to 999999:

# songs_incremental_id.csv

"","song_id","title","release","artist_name","year","incremental_song_id"
0,"SOQMMHC12AB0180CB8","Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003,0
1,"SOVFVAK12A8C1350D9","Tanssi vaan","Karkuteillä","Karkkiautomaatti",1995,1
2,"SOGTUKN12AB017F4F1","No One Could Ever","Butter","Hudson Mohawke",2006,2
...

And here’s what our final merged listening data looks like with the incremental ids, ready to be read by the MLRecommender:

# merged_listen_data_incremental_song_id.csv

"","Unnamed: 0","user_id","song_id","listen_count","title","release","artist_name","year","incremental_song_id"
0,0,"b80344d063b5ccb3212f76538f3d9e43d87dca9e","SOAKIMP12A8C130995",1,"The Cove","Thicker Than Water","Jack Johnson",0,397069
1,18887,"7c86176941718984fed11b7c0674ff04c029b480","SOAKIMP12A8C130995",1,"The Cove","Thicker Than Water","Jack Johnson",0,397069
2,21627,"76235885b32c4e8c82760c340dc54f9b608d7d7e","SOAKIMP12A8C130995",3,"The Cove","Thicker Than Water","Jack Johnson",0,397069
...

Now we’re ready to load it into the recommender!

Using MLRecommender

Create a new Swift Playground, and add the two CSVs merged_listen_data_incremental_song_id.csv and songs_incremental_id.csv as resources to your Playground. For help on adding resources to a Swift Playground, check out this post. Make sure your Swift Playground is a blank macOS Playground and not an iOS Playground. Because our MLRecommender will only give us the user id and incremental song id when generating recommendations, we’ll use the second CSV to view the song titles.

First, let’s load the merged listening history with incremental ids:

import Foundation
import CreateML

// Create an MLDataTable from the merged CSV data
let history_csv = Bundle.main.url(forResource: "merged_listen_data_incremental_song_id", withExtension: "csv")!
let history_table = try MLDataTable(contentsOf: history_csv)
print(history_table)
Columns:
    X1	string
    Unnamed: 0	integer
    user_id	string
    song_id	string
    listen_count	integer
    title	string
    release	string
    artist_name	string
    year	integer
    incremental_song_id	integer
Rows: 2000000
Data:
+----------------+----------------+----------------+----------------+----------------+
| X1             | Unnamed: 0     | user_id        | song_id        | listen_count   |
+----------------+----------------+----------------+----------------+----------------+
| 0              | 0              | b80344d063b5...| SOAKIMP12A8C...| 1              |
| 1              | 18887          | 7c8617694171...| SOAKIMP12A8C...| 1              |
| 2              | 21627          | 76235885b32c...| SOAKIMP12A8C...| 3              |
| 3              | 27714          | 250c0fa2a77b...| SOAKIMP12A8C...| 1              |
| 4              | 34428          | 3f73f44560e8...| SOAKIMP12A8C...| 6              |
| 5              | 34715          | 7a4b8e7d2905...| SOAKIMP12A8C...| 6              |
| 6              | 55885          | b4a678fb729b...| SOAKIMP12A8C...| 2              |
| 7              | 65683          | 33280fc74b16...| SOAKIMP12A8C...| 1              |
| 8              | 75029          | be21ec120193...| SOAKIMP12A8C...| 1              |
| 9              | 105313         | 6fbb9ff93663...| SOAKIMP12A8C...| 2              |
+----------------+----------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+---------------------+
| title          | release        | artist_name    | year           | incremental_song_id |
+----------------+----------------+----------------+----------------+---------------------+
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
+----------------+----------------+----------------+----------------+---------------------+
[2000000 rows x 10 columns]

From there, we can create an MLRecommender. Our trainingData is the data table format of the merged listening history CSV, the userColumn is the user_id column name and the itemColumn is the incremental_song_id column name. The user_id of b80344d063b5ccb3212f76538f3d9e43d87dca9e was randomly picked from the merged CSV data:=.

// Generate recommendations
let recommender = try MLRecommender(trainingData: history_table, userColumn: "user_id", itemColumn: "incremental_song_id")
let recs = try recommender.recommendations(fromUsers: ["b80344d063b5ccb3212f76538f3d9e43d87dca9e"])
print(recs)
Columns:
    user_id	string
    incremental_song_id	integer
    score	float
    rank	integer
Rows: 10
Data:
+----------------+---------------------+----------------+----------------+
| user_id        | incremental_song_id | score          | rank           |
+----------------+---------------------+----------------+----------------+
| b80344d063b5...| 114557              | 0.0461493      | 1              |
| b80344d063b5...| 834311              | 0.0436045      | 2              |
| b80344d063b5...| 939015              | 0.043068       | 3              |
| b80344d063b5...| 955047              | 0.0427589      | 4              |
| b80344d063b5...| 563380              | 0.0426116      | 5              |
| b80344d063b5...| 677759              | 0.0423951      | 6              |
| b80344d063b5...| 689170              | 0.0418951      | 7              |
| b80344d063b5...| 333053              | 0.041788       | 8              |
| b80344d063b5...| 381319              | 0.0403042      | 9              |
| b80344d063b5...| 117491              | 0.0400819      | 10             |
+----------------+---------------------+----------------+----------------+
[10 rows x 4 columns]

But we want to know the song metadata associated with each recommended incremental_song_id. Let’s load the song metadata table and join the recommendations with the song metadata using the incremental id:

// Use the songs data CSV to print the recommended song titles
let songs_csv = Bundle.main.url(forResource: "songs_incremental_id", withExtension: "csv")!
let songs_table = try MLDataTable(contentsOf: songs_csv)
print(songs_table)

let song_title_recs = recs.join(with: songs_table, on: "incremental_song_id")
print(song_title_recs)
Columns:
    X1	string
    song_id	string
    title	undefined
    release	string
    artist_name	string
    year	integer
    incremental_song_id	integer
Rows: 1000000
Data:
+----------------+----------------+----------------+----------------+----------------+
| X1             | song_id        | title          | release        | artist_name    |
+----------------+----------------+----------------+----------------+----------------+
| 0              | SOQMMHC12AB0...| Silent Night   | Monster Ball...| Faster Pussy...|
| 1              | SOVFVAK12A8C...| Tanssi vaan    | Karkuteillä   | Karkkiautoma...|
| 2              | SOGTUKN12AB0...| No One Could...| Butter         | Hudson Mohawke |
| 3              | SOBNYVR12A8C...| Si Vos Querés | De Culo        | Yerba Brava    |
| 4              | SOHSBXH12A8C...| Tangle Of As...| Rene Ablaze ...| Der Mystic     |
| 5              | SOZVAPQ12A8C...| Symphony No....| Berwald: Sym...| David Montgo...|
| 6              | SOQVRHI12A6D...| We Have Got ...| Strictly The...| Sasha / Turb...|
| 7              | SOEYRFT12AB0...| 2 Da Beat Ch...| Da Bomb        | Kris Kross     |
| 8              | SOPMIYT12A6D...| Goodbye        | Danny Boy      | Joseph Locke   |
| 9              | SOJCFMH12A8C...| Mama_ mama c...| March to cad...| The Sun Harb...|
+----------------+----------------+----------------+----------------+----------------+
+----------------+---------------------+
| year           | incremental_song_id |
+----------------+---------------------+
| 2003           | 0                   |
| 1995           | 1                   |
| 2006           | 2                   |
| 2003           | 3                   |
| 0              | 4                   |
| 0              | 5                   |
| 0              | 6                   |
| 1993           | 7                   |
| 0              | 8                   |
| 0              | 9                   |
+----------------+---------------------+
[1000000 rows x 7 columns]


Columns:
    user_id	string
    incremental_song_id	integer
    score	float
    rank	integer
    X1	string
    song_id	string
    title	undefined
    release	string
    artist_name	string
    year	integer
Rows: 11
Data:
+----------------+---------------------+----------------+----------------+----------------+
| user_id        | incremental_song_id | score          | rank           | X1             |
+----------------+---------------------+----------------+----------------+----------------+
| b80344d063b5...| 114557              | 0.0461493      | 1              | 114578         |
| b80344d063b5...| 117491              | 0.0400819      | 10             | 117512         |
| b80344d063b5...| 333053              | 0.041788       | 8              | 333174         |
| b80344d063b5...| 381319              | 0.0403042      | 9              | 381465         |
| b80344d063b5...| 381319              | 0.0403042      | 9              | 444615         |
| b80344d063b5...| 563380              | 0.0426116      | 5              | 563705         |
| b80344d063b5...| 677759              | 0.0423951      | 6              | 678222         |
| b80344d063b5...| 689170              | 0.0418951      | 7              | 689654         |
| b80344d063b5...| 834311              | 0.0436045      | 2              | 834983         |
| b80344d063b5...| 939015              | 0.043068       | 3              | 939863         |
+----------------+---------------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+----------------+
| song_id        | title          | release        | artist_name    | year           |
+----------------+----------------+----------------+----------------+----------------+
| SOHENSJ12AAF...| Great Indoors  | Room For Squ...| John Mayer     | 0              |
| SOOGZYY12A67...| Crying Shame   | In Between D...| Jack Johnson   | 2005           |
| SOGFKJE12A8C...| Sun It Rises   | Fleet Foxes    | Fleet Foxes    | 2008           |
| SOECLAD12AAF...| St. Patrick'...| Room For Squ...| John Mayer     | 0              |
| SOECLAD12AAF...| St. Patrick'...| Room For Squ...| John Mayer     | 0              |
| SOAYTRA12A8C...| All At Once    | Sleep Throug...| Jack Johnson   | 2008           |
| SOKLVUI12A67...| If I Could     | In Between D...| Jack Johnson   | 2005           |
| SOYIJIL12A67...| Posters        | Brushfire Fa...| Jack Johnson   | 2000           |
| SORKFWO12A8C...| Quiet Houses   | Fleet Foxes    | Fleet Foxes    | 2008           |
| SOJAMXH12A8C...| Meadowlarks    | Fleet Foxes    | Fleet Foxes    | 2008           |
+----------------+----------------+----------------+----------------+----------------+
[11 rows x 10 columns]

The last table printed has our recommended songs, and the first one is “Great Indoors”! We can now use our MLRecommender for other user ids.

Wrap Up

First, we took a look at the MLRecommender constructor. Then, we gathered song data from the Million Song Dataset. We modified the dataset to increase legibility and added incremental ids for the song metadata. We loaded the song metadata and listening history into a Swift Playground, created an MLRecommender from the listening history and generated recommended songs. Then, we used the song metadata to join the recommended songs to their titles and artists.

Source Files

Each of the files mentioned in this tutorial can be found here, including:

  • songs.csv: Metadata for one million songs
  • history.txt: Song listening history for multiple users
  • data-parser.py: Python code to manipulate the Million Song Dataset
  • merged_listed_data.csv: Merged dataset of song metadata and listening history
  • merged_listed_data_incremental_song_id.csv: merged_listed_data.csv with incremental ids added
  • songs_incremental_id.csv: songs.csv with incremental ids added
  • MusicRecommender.playground: Swift Playground for creating the MLRecommender