We first need to find our data. Luckily, Mac users have easy access via the chat.db file saved on their computer.
First, make sure that your iPhone's messages are backed up to the cloud. On your phone go to settings>profile>iCloud and check that the switch next to Messages is on. On your Mac open the Messages app and go to preferences>iMessage and click "Enable Messages in iCloud". It's quite likely that both are already on from when you set up your devices.
Searching for chat.db on your Mac, you will find it illusive still. To open it, first go to system preferences>security and privacy>privacy. Now click 'Full Disk Access' and grant access to Terminal. Chat.db is now available!
Programming in python, I've found the sqlite3 library most effective for perusing the file. In your notebook or environment, run the function below.
db_filepath should be your path to chat.db. By default, this is likely /Users/yourname/Library/Messages/chat.db. chat_id is the arbitrary ID of the specific conversation you are hoping to access. Find it easiest by trying different values or by opening up the db file using a database browser (I use 'db Browser for Sqlite').
import pandas as pd
import sqlite3
def get_messages(db_filepath, chat_id):
conn = sqlite3.connect(db_filepath)
c = conn.cursor()
cmd1 = 'SELECT ROWID, text, handle_id, guid, associated_message_guid, associated_message_type, balloon_bundle_id, \
date as date_utc \
FROM message T1 \
INNER JOIN chat_message_join T2 \
ON T2.chat_id={} \
AND T1.ROWID=T2.message_id \
ORDER BY T1.date'.format(chat_id)
c.execute(cmd1)
df = pd.DataFrame(c.fetchall(), columns=['id', 'text', 'sender', 'guid', 'associated_message_guid', 'react_type', 'id_ext', 'time'])
return df
This function returns a pandas dataframe with messages and their metadata filling each row. We need to clean it further.
Here's a function to convert the time column into a datetime format we can work with:
import datetime
def convert_time(time_column):
#converting time column to readable format
time_column = pd.to_numeric(time_column, downcast = "integer")
m = time_column.tolist()
n = [((w/1000000000) + 978307200) for w in m]
time1 = [time.localtime(i) for i in n]
timestamp = [time.mktime(o) for o in time1]
dt = [datetime.datetime.fromtimestamp(q) for q in timestamp]
return dt
df["time"] = gca.convert_time(df["time"])
Here's a function that will add a column "msg_type" to your dataframe identifying the type of message being sent. iMessage has reactions, polls, stickers, and many more.
special_msg_types = {
"com.apple.messages.URLBalloonProvider":"URL",
"com.apple.messages.MSMessageExtensionBalloonPlugin:EWFNLB79LQ:com.gamerdelights.gamepigeon.ext":"game_pig",
"com.apple.messages.MSMessageExtensionBalloonPlugin:H5DMREJLBF:com.nearfuturespecialists.imessagepoll.MessagesExtension":"poll",
"com.apple.messages.MSMessageExtensionBalloonPlugin:Z7Z74TFKB7:com.efrac.poll.MessagesExtension":"poll",
"com.apple.messages.MSMessageExtensionBalloonPlugin:HV6K4MJNS7:com.rapgenius.RapGenius.LyricCardMaker":"genius",
"com.apple.DigitalTouchBalloonProvider":"draw",
"com.apple.Handwriting.HandwritingProvider":"draw",
"com.apple.messages.MSMessageExtensionBalloonPlugin:0000000000:com.apple.mobileslideshow.PhotosMessagesApp":"photo_share"
}
reacts = {
0:"msg",
3:"game_pig",
3002:"rem_dislike",
2002:"dislike",
3001:"rem_like",
2001:"like",
2003:"laugh",
3003:"rem_laugh",
2000:"love",
3000:"rem_love",
2004:"emph",
3004:"rem_emph",
2005:"question",
3005:"rem_question",
1000:"sticker"
}
def fill_types(df):
#labelling types of messages
df["msg_type"] = df["id_ext"].replace(special_msg_types)
del df["id_ext"]
df["msg_type"].fillna(df["react_type"].replace(reacts), inplace=True)
del df["react_type"]
df.loc[df.text == u'\ufffc', 'msg_type'] = "img"
return df
df = gca.fill_types(df)
In your df, "associated_message_guid" is used to identify the root message for reaction messages. Unfortunately, it's done using the convoluted guides associated with each message. Here's a function that will clean up the guide numbers via enumeration:
def enum(df):
#Generating new, enumerated guid # and associated guid #
mapping = {item:i for i, item in enumerate(df["guid"].unique(), start=1)}
df["associated_idx"] = df["associated_message_guid"].str[-36:].map(mapping).fillna(0).astype(int)-1
return df
df = gca.enum(df)
Now we can pull just the columns that we need:
df = df[['time', 'sender', 'text', 'msg_type', 'associated_idx']]
Voila! Dataframe wrangled.