Skip to content Skip to sidebar Skip to footer

Whatsapp Chat Log Parsing With Regex

I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.

Solution 1:

i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g

line = [06.12.16, 16:47:22] Person Two: ::
 line = line.replace("::","")

which would give :

[06.12.16, 16:47:22] Person Two: 

You can then call your regex function on the pre-processed data.

Solution 2:

I encountered similar issues when building a tool to analyze WhatsApp chats.

The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....

The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).

Filtering general:

const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;

Filter System Messages:

const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;

Date:

const regexSplitDate = /[-/.] ?/;

Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)

const regexAttachment = /<.+:(.+)>/;`

Post a Comment for "Whatsapp Chat Log Parsing With Regex"