PyCon Israel 2022

🇺🇸 Detecting anomalous sequences using text processing methods
06-28, 11:00–11:20 (Asia/Jerusalem), PyData

Hello wait you talk see to can’t my!
Sounds weird? Detecting abnormal sequences is a common problem.
Join my talk to see how this problem involves Bert, Word2vec & Autoencoders (in python), and how you can apply it to information security problems

Dealing with sequences can be challenging as each item has its unique position in the sequence and there’s a correlation between all items and their positions. One of the most common issues when working with sequences is dealing with anomalous sequences, that doesn’t fit with the regular sequences’ structure. Those sequences make no sense, create noise in the data and interrupt the learning process.

The most common sequences are text sentences, and possible scenario for abnormal text sequences could be when trying to translate sound to text, and sometimes there’re some irrelevant noise that can translate to nonsense sequences, and if we want to build a model based on that data, we need to find a way to identify and clean this irrelevant and anomalous data.

Detecting anomalous sequences could be also related to non-text sequences, such as sequence of action or events. Those scenarios could be related to information security problems. For example, in many organizations there’re logs of actions that has been made on internal systems and detecting suspicious sequences of actions on the system could be a crucial in detecting attacks or misusage of the systems.

My proposed solution includes two phases.
In the first step, we need to model the items in the sequence and understand its structure and correlations. We need to train a word embedding algorithm for generating the vectors embedding out of the sequences, such as Bert or Word2vec.
The next step, after creating the sequence embedding, is detecting the anomalies. The algorithm we used for the anomaly detection phase is Autoencoder, where you can train the model on normal data and detect the abnormal events.

This pipeline has some challenges. For example, each sequence has different length and there’s a need for training both the word embedding algorithm and the Autoencoder to know how to learn the right structure of all possible lengths.

Join my talk to see how I used python for building this architecture, and learn how you can process your sequences, using word embedding algorithms such as Bert and Word2vec and use their output for Autoencoders in order to create an anomaly detection model for detecting suspicious sequences.

Session language


Target audience

Data Scientists

I'm a Data Scientist at PayPal for almost 5 years. I work for the enterprise threat management team in the information security organization. My focus is finding machine learning solutions for information security problems in general and specifically for insider frauds threats.

I hold a Msc in software and information systems engineering. In my thesis I researched the field of user verification on mobile devices using sequences of touch gestures. The full paper of this thesis was published at PAKDD 2018 conference, and also I published an extended abstract of this research in UMAP 2017 conference.