PyCon Israel 2024

Beyond KMeans - using LLMs to improve text clustering
09-16, 11:30–11:50 (Asia/Jerusalem), Hall 7
Language: English

Text clustering is a fundamental process in NLP, but what do you do when your clusters just aren’t right? I will share my journey where I ended up combining sklearn and langchain to reduce duplication and "Misc" clusters.


Text clustering can be used to organize text, analyze data, help extract topics or segment customers by interests. In addition, we are blessed with many cheap and high-quality text embedding APIs that should improve clustering. But what do you do when your clusters just aren’t right? Topics are duplicated, clustering is too sensitive to vocabulary, and there’s always that one giant “Misc” cluster.

As a newcomer to clustering, I gravitated towards KMeans since it seems to be the default clustering algorithm. I will share the pitfalls I encountered, the clustering algorithms I explored and how I incorporated LLMs into the process to achieve far better results.


Expected experience level of participants

Intermediate

Target audience

Data Scientists

Noah has been contributing to Sefaria for the last 10 years, helping build its data science department . His contributions include citation recognition, search and topic modelling.

He has a master's in NLP from Cooper Union and has been focusing on the intersection of NLP with Hebrew.

Sefaria is non-profit, open source organization focusing on uploading and enriching basic texts about Judaism. We uniquely focus on giving third-party developers access to almost all of our underlying data through APIs and documentation.