Whatsapp Chat Log Parsing With Regex
I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.
Solution 1:
i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g
line = [06.12.16, 16:47:22] Person Two: ::
line = line.replace("::","")
which would give :
[06.12.16, 16:47:22] Person Two:
You can then call your regex function on the pre-processed data.
Solution 2:
I encountered similar issues when building a tool to analyze WhatsApp chats.
The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....
The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).
Filtering general:
const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;
Filter System Messages:
const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;
Date:
const regexSplitDate = /[-/.] ?/;
Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)
const regexAttachment = /<.+:(.+)>/;`
Post a Comment for "Whatsapp Chat Log Parsing With Regex"