sourcegraph/dev/clustering
Chris Warwick 333edd3345
Dev tool: python script for text clustering based on local embeddings (#58691)
python script for text clustering based on local embeddings
2023-12-04 09:27:14 -05:00
..
cluster.py Dev tool: python script for text clustering based on local embeddings (#58691) 2023-12-04 09:27:14 -05:00
README.md Dev tool: python script for text clustering based on local embeddings (#58691) 2023-12-04 09:27:14 -05:00
requirements.txt Dev tool: python script for text clustering based on local embeddings (#58691) 2023-12-04 09:27:14 -05:00

Text Clustering

This directory contains Python code to cluster text data using sentence embeddings and KMeans clustering.

Overview

The cluster.py script takes in a TSV file with a text field, generates sentence embeddings using the SentenceTransformers library, clusters the embeddings with KMeans, and outputs a TSV file with cluster assignments.

The goal is to group similar text snippets together into a predefined number of clusters.

Usage

Ensure the required packges are installed:

pip install -r requirements.txt

The script accepts the following arguments:

Argument Description Default
--input Path to input TSV file Required
--text_field Name of text field in the tsv file to operate on "text"
--clusters Number of clusters to generate 4
--output Path for output TSV file with clusters Optional
--model Sentence transformer model to use Optional
--silent Whether to hide plots False

Example

python cluster.py --input data.tsv --text_field chat_message --clusters 5 --output out.tsv

Output

The output TSV file contains the original data plus a new "cluster" column with the assigned cluster IDs per row.

Code Overview

Libraries Used

  • pandas - for loading and manipulating data
  • SentenceTransformers - generating embeddings
  • sklearn - KMeans clustering
  • matplotlib - visualization