Beyond KMeans - using LLMs to improve text clustering :: PyCon Israel 2024

Beyond KMeans - using LLMs to improve text clustering
.ical

09-16, 11:30–11:50 (Asia/Jerusalem), Hall 7
Language: English

Text clustering is a fundamental process in NLP, but what do you do when your clusters just aren’t right? I will share my journey where I ended up combining sklearn and langchain to reduce duplication and "Misc" clusters.

Text clustering can be used to organize text, analyze data, help extract topics or segment customers by interests. In addition, we are blessed with many cheap and high-quality text embedding APIs that should improve clustering. But what do you do when your clusters just aren’t right? Topics are duplicated, clustering is too sensitive to vocabulary, and there’s always that one giant “Misc” cluster.

As a newcomer to clustering, I gravitated towards KMeans since it seems to be the default clustering algorithm. I will share the pitfalls I encountered, the clustering algorithms I explored and how I incorporated LLMs into the process to achieve far better results.

Expected experience level of participants –

Intermediate

Target audience –

Data Scientists

Noah Santacruz

Noah has been contributing to Sefaria for the last 10 years, helping build its data science department . His contributions include citation recognition, search and topic modelling.

He has a master's in NLP from Cooper Union and has been focusing on the intersection of NLP with Hebrew.

Sefaria is non-profit, open source organization focusing on uploading and enriching basic texts about Judaism. We uniquely focus on giving third-party developers access to almost all of our underlying data through APIs and documentation.

Beyond KMeans - using LLMs to improve text clustering .ical 09-16, 11:30–11:50 (Asia/Jerusalem), Hall 7 Language: English

Beyond KMeans - using LLMs to improve text clustering
.ical

09-16, 11:30–11:50 (Asia/Jerusalem), Hall 7
Language: English