Text algorithms allow analysts to extract insights from raw text, which is useful when a dataset has information in the form of notes or descriptions from people like doctors or loan officers. With text mining, analysts can identify which words or phrases in raw text are associated with certain outcomes, thereby gaining greater insight into the factors that relate to their target variable.

What is tokenization in text mining?

Tokenization is the process of breaking down a sequence of strings into parts, such as keywords, phrases, symbols, or other elements called tokens. Tokenization plays an important role in text mining by giving data scientists insight into the analysis of words or phrases that are important to the project.

What is stemming in text mining?

Stemming is the process of analyzing a word’s derivation to look at its root stem. Since related words usually are made up of the same stem, these words can be looked at as synonyms and, therefore, grouped together in text mining.

What are stop words in text mining?

Stop words are a set of commonly used words in any language. In text mining, these words are usually eliminated so that the applications can focus on the most relevant words.

Topics

Alphabetic

Wiki topics

Text Mining

What is Text Mining?

Text algorithms allow analysts to extract useful insights from raw text, which is useful when a dataset has information in the form of notes or descriptions from doctor visits or loan applications.

When data scientists build traditional machine learning models, they use numeric and categorical data as features, such as the requested loan amount (in dollars) or the borrower’s employment type (in a word or two). Text mining algorithms give analysts the ability to leverage information about the purpose of the loan, greatly improving the accuracy of the model. With text mining, analysts can identify which words or phrases in raw text are associated with certain outcomes, thereby gaining greater insight into the factors that relate to their target variable, or object of analysis.

Some common text mining algorithms include:

Sentiment Analysis. Determines how a writer feels and reacts to a particular topic or event. Often used in marketing to evaluate consumers’ responses to new products.
Named Entity Recognition. Locates and classifies specific references to people, organizations, places, and dates.
For example, in the sentence “DataRobot acquired Nutonian, another Boston-based company, in 2017,” the algorithm will recognize DataRobot and Nutonian as organizations, Boston as a location, and 2017 as a date.
Topic Modeling. Discovers the hidden semantic structure of a collection of raw text documents. Used to measure a topic’s prevalence and describe which terms are the most representative in each document.
Summarization and Keyphrase Extraction. Distills a large document down to a set of sentences or terms that summarize it without sacrificing important information.

Why is Text Mining Important?

The vast majority of data is unstructured in the form of images, audio, or video. Text data is ubiquitous in every industry. Claims investigator reports, medical examination notes, social network comments, and software logs contain vital information for predicting particular future events but are rarely formally structured. Text mining allows analysts to make the most of this data, leading to more practical models with higher accuracy.

Text Mining + DataRobot

The majority of the DataRobot AI platform models support text data right out of the box. If a particular combination of words or characters in the text is highly related to the target variable, DataRobot automatically captures the pattern and displays it along with other insights. DataRobot is also multilingual, using automatic language identification for text data and supporting different text mining algorithms, depending on the language it detects.

The process of feature engineering free text data in the traditional way is notoriously complex and difficult, and data scientists often avoid doing it manually. DataRobot automatically finds, tunes, and interprets the best text mining algorithms for a dataset, saving both time and resources.

Learn More About AI