TF-IDF Vectorizer Simulator

Explore TF-IDF, the classic recipe for turning text into numeric vectors a machine-learning model can use. Change the corpus size, term count and document length to see the term frequency TF, inverse document frequency IDF and TF-IDF weight update in real time, and discover which words characterise a document.

Parameters

Corpus size N

docs

Total number of documents used for training

Term count in this document

times

How often the target word appears in the document

Document length

words

Total word count of the target document

Documents containing the word df

docs

How many corpus documents contain the word at least once

IDF formula

Switch the definition of inverse document frequency

Results

—

Term frequency TF

—

Inverse doc. frequency IDF

—

TF-IDF weight

—

Document frequency DF (%)

—

Discriminative power

—

Stop-word verdict

—

Corpus and term weight — vectorization view

Each icon is a document in the corpus. Documents that contain the target word are shaded and pulse gently. The bar below shows the size of the TF-IDF weight, split into the TF and IDF contributions.

TF-IDF vs document frequency df

TF-IDF vs term frequency (count)

Theory & Key Formulas

$$\text{TF-IDF}=\underbrace{\frac{f_{t,d}}{n_d}}_{\text{TF}}\times\underbrace{\ln\frac{N}{df_t}}_{\text{IDF}}$$

Weight of word t in document d. $f_{t,d}$: count of the word in the document, $n_d$: total words in the document, $N$: corpus size, $df_t$: number of documents containing the word. A word frequent in one document but rare in the corpus scores highest.

$$\text{idf}_{\text{smooth}}=\ln\frac{1+N}{1+df_t}+1$$

Smooth IDF. The +1 inside avoids division by zero, and the trailing +1 keeps the weight from collapsing fully to zero for words present in every document — close to the default of scikit-learn's TfidfVectorizer.

What is TF-IDF?

🙋

I keep seeing "TF-IDF" in machine-learning books. What does it actually do?

🎓

In short, it is a tool for turning text into a row of numbers. A machine-learning model cannot eat raw characters, so you have to convert a document into a numeric vector: "word A appears this many times, word B that many times...". The simplest approach is just to list raw counts, but then ubiquitous words like "the" or "is" get the biggest numbers. TF-IDF corrects that cleverly so that the words that genuinely characterise the document get the large weights instead.

🙋

What do the TF and IDF in the name stand for?

🎓

TF is Term Frequency — how many times the word appears in the document you are looking at. Longer documents inflate the count, so we usually normalise by the total word count of the document. IDF is Inverse Document Frequency — it looks at how many documents in the whole corpus contain the word, and ln(N/df) makes the value large when the word is rare. TF-IDF is those two multiplied. On the left, raise the "term count" and TF grows; lower the "documents containing the word" and IDF grows.

🙋

A multiplication, got it. So what does a word with a high TF-IDF look like?

🎓

A word that appears a lot in this document but rarely in the others. Picture a corpus of news articles. In a sports article the word "home run" appears repeatedly, but it almost never shows up in politics or business pieces. So TF is high and df is low, meaning IDF is high too. The product makes TF-IDF large, and that becomes the feature that tells the machine "this article is sports". By contrast, a word like "news" that appears in every article has IDF near zero, so even with a high TF its TF-IDF stays small.

🙋

Wait — does that mean I no longer need a stop-word list to remove words like "the"?

🎓

Exactly, and that is the satisfying part of TF-IDF. Common words appear in nearly every document, so their df approaches N. Then ln(N/df) collapses to zero and the TF-IDF collapses too. Without manually listing stop words, the statistics decide "ubiquitous word = low information" and down-weight it for you. Try pushing the "documents containing the word" slider up to the full corpus size on the left: the discriminative power switches to "low (common word)" and the stop-word verdict appears.

🙋

If it is that handy, why do we still need newer methods like BERT?

🎓

Good question. TF-IDF only looks at word counts, so it cannot represent word order or meaning. "The dog bit the man" and "The man bit the dog" produce the same TF-IDF vector, and it has no idea that "car" and "automobile" mean similar things. Word2Vec and BERT learn that from context. But TF-IDF is cheap to compute, easy to interpret, and stable with little training data. So in practice the standard order is still "build a TF-IDF baseline for classification first, and move to embeddings only if it is not enough".

Frequently Asked Questions

TF-IDF (Term Frequency – Inverse Document Frequency) is the classic way to turn text into numeric feature vectors that a machine-learning model can use. It gives each word a weight equal to its term frequency in the document multiplied by its rarity across the corpus. A word with a high TF-IDF is frequent in this document yet uncommon elsewhere, so it is exactly the kind of word that characterises the document. TF-IDF powered search-engine ranking and document classification for decades and is still an interpretable, strong baseline.

TF (term frequency) is the count of the target word in this document divided by the total number of words, so tf = term count / document length. IDF (inverse document frequency) in the standard form is idf = ln(N / df), where N is the number of documents in the corpus and df is the number of documents containing the word. The smooth form idf = ln((1+N)/(1+df)) + 1 avoids division by zero when df is 0 and keeps the weight non-zero for words that appear in every document. TF-IDF = tf x idf.

Words such as "the", "is" and "and" appear in almost every document, so their document frequency df is high. Since IDF = ln(N/df) tends to zero as df approaches N, the TF-IDF of these words automatically collapses toward zero. In other words, TF-IDF down-weights ubiquitous words on its own, without a hand-built stop-word list. This automatic suppression of uninformative words is one of the biggest advantages of TF-IDF.

TF-IDF produces sparse vectors based only on word-occurrence statistics; it captures neither word order nor semantic similarity. Word embeddings such as Word2Vec and BERT produce dense vectors learned from context and capture meaning, so "car" and "automobile" end up close together. However, TF-IDF is cheap to compute, easy to interpret and stable even with little training data, so it remains a widely used, strong baseline for classification tasks.

Real-World Applications

Search engines and document ranking: Early Web search and full-text search engines put the TF-IDF of query terms at the core of their scoring. The more often a query word appears (high TF) and the rarer it is across the corpus (high IDF), the higher the document is ranked. Modern BM25 extends TF-IDF with TF saturation and document-length normalisation, and is still the default scoring in Elasticsearch and Solr.

Document classification and spam filtering: For news categorisation, e-mail spam detection or review sentiment analysis, the standard pipeline turns documents into TF-IDF vectors and feeds them to logistic regression, a linear SVM or naive Bayes. It trains fast with modest compute, and you can read which words drove which class straight from the coefficients — valuable in business systems that must explain their decisions.

Keyword extraction and summarisation: Pick the few words with the highest TF-IDF inside a single document and you get the keywords that represent it. This is used for automatic article tagging, related-document recommendation and word-cloud generation. Comparing top TF-IDF words across documents also quantifies how their topics differ.

Clustering and similarity: Once documents are TF-IDF vectors, cosine similarity measures how close two documents are. With it you can group documents into clusters via k-means or find similar articles. It is also widely used as a preprocessing step for de-duplicating search results and for the high-level visualisation of large document collections (building topic maps).

Common Misconceptions and Pitfalls

The most common mistake is computing IDF separately on the training and the evaluation data. IDF is a corpus-wide statistic, so you must store the df and N obtained at training time and apply the same IDF to test documents and production data. Re-computing IDF on the evaluation data alone shifts the feature scale away from training and hurts accuracy. With scikit-learn, fit on the training data and call only transform afterwards — the same data-leakage rule that applies to every preprocessing step.

Next, assuming that TF-IDF understands meaning. TF-IDF is purely a statistic of word counts; it knows neither the meaning of words nor their order. "The product is good" and "The product is not good" mean opposite things because of one negation word, yet their TF-IDF vectors are almost identical. Synonyms such as "car" and "automobile" become separate dimensions with no relation between them. For tasks where semantic closeness or context matters, keep TF-IDF as a baseline and bring in word embeddings or contextual models alongside it.

Finally, underrating preprocessing. The quality of TF-IDF depends heavily on the tokenisation and normalisation that precede it. Languages that are not space-delimited (Japanese, Chinese) need a tokeniser, and the choice of tokeniser and dictionary changes the result. Case folding, lemmatising inflected forms, and trimming extremely rare or extremely common words with min_df / max_df all directly affect the number of dimensions and the accuracy. The TF-IDF formula itself is simple, but carefully designing the text processing at the entrance is what pays off most in practice.

TF-IDF Vectorizer Simulator

What is TF-IDF?

Frequently Asked Questions

Real-World Applications

Common Misconceptions and Pitfalls

How to Use

Worked Example

Practical Notes