PyCon Israel 2022

🇺🇸 Detecting anomalous sequences using text processing methods
2022-06-28, 11:00–11:20, PyData

Hello wait you talk see to can’t my!
Sounds weird? Detecting abnormal sequences is a common problem.
Join my talk to see how this problem involves Bert, Word2vec & Autoencoders (in python), and how you can apply it to information security problems

Dealing with sequences can be challenging as each item has its unique position in the sequence and there’s a correlation between all items and their positions. One of the most common issues when working with sequences is dealing with anomalous sequences, that doesn’t fit with the regular sequences’ structure. Those sequences make no sense, create noise in the data and interrupt the learning process.

The most common sequences are text sentences, and possible scenario for abnormal text sequences could be when trying to translate sound to text, and sometimes there’re some irrelevant noise that can translate to nonsense sequences, and if we want to build a model based on that data, we need to find a way to identify and clean this irrelevant and anomalous data.

Detecting anomalous sequences could be also related to non-text sequences, such as sequence of action or events. Those scenarios could be related to information security problems. For example, in many organizations there’re logs of actions that has been made on internal systems and detecting suspicious sequences of actions on the system could be a crucial in detecting attacks or misusage of the systems.

My proposed solution includes two phases.
In the first step, we need to model the items in the sequence and understand its structure and correlations. We need to train a word embedding algorithm for generating the vectors embedding out of the sequences, such as Bert or Word2vec.
The next step, after creating the sequence embedding, is detecting the anomalies. The algorithm we used for the anomaly detection phase is Autoencoder, where you can train the model on normal data and detect the abnormal events.

This pipeline has some challenges. For example, each sequence has different length and there’s a need for training both the word embedding algorithm and the Autoencoder to know how to learn the right structure of all possible lengths.

Join my talk to see how I used python for building this architecture, and learn how you can process your sequences, using word embedding algorithms such as Bert and Word2vec and use their output for Autoencoders in order to create an anomaly detection model for detecting suspicious sequences.

Session language – English Target audience – Data Scientists