DocsText Processing
Text Processing Tools
Natural language processing tools for text analysis, tokenization, and feature extraction.
Available Tools
Tokenization
Word, sentence, subword tokenization
Text Cleaning
Normalization, stopwords, stemming
Vectorization
TF-IDF, word embeddings
Text Analysis
Sentiment, entities, keywords
Similarity
Document comparison, search
Tokenization
import cyxwiz.text as text
# Word tokenization
tokens = text.word_tokenize("Hello, world! How are you?")
# ['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']
# Sentence tokenization
sentences = text.sent_tokenize(paragraph)
# Subword tokenization (BPE)
tokenizer = text.BPETokenizer(vocab_size=10000)
tokenizer.fit(corpus)
tokens = tokenizer.encode("Hello world")
decoded = tokenizer.decode(tokens)Text Cleaning
# Lowercase
cleaned = text.lower(text_data)
# Remove punctuation
cleaned = text.remove_punctuation(text_data)
# Remove stopwords
cleaned = text.remove_stopwords(text_data, language='english')
# Stemming
stemmer = text.PorterStemmer()
stemmed = stemmer.stem("running") # 'run'
# Lemmatization
lemmatizer = text.WordNetLemmatizer()
lemma = lemmatizer.lemmatize("better", pos='a') # 'good'
# Pipeline
pipeline = text.Pipeline([
text.lower,
text.remove_punctuation,
text.remove_stopwords,
text.stem
])
cleaned = pipeline(text_data)Vectorization
TF-IDF
from cyxwiz.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2)
)
tfidf_matrix = vectorizer.fit_transform(
documents
)
# Get feature names
features = vectorizer.get_feature_names()Word Embeddings
from cyxwiz.text import Word2Vec
model = Word2Vec(
sentences,
vector_size=100,
window=5,
min_count=1
)
# Get word vector
vector = model.wv['machine']
# Find similar words
similar = model.wv.most_similar('king')Text Analysis
# Sentiment analysis
sentiment = text.sentiment(text_data)
# {'positive': 0.8, 'negative': 0.1, 'neutral': 0.1}
# Keyword extraction
keywords = text.extract_keywords(text_data, top_n=10)
# Named Entity Recognition
entities = text.extract_entities(text_data)
# [('Apple', 'ORG'), ('Tim Cook', 'PERSON'), ('California', 'GPE')]
# N-grams
bigrams = text.ngrams(tokens, n=2)
trigrams = text.ngrams(tokens, n=3)Text Similarity
# Cosine similarity
similarity = text.cosine_similarity(vec1, vec2)
# Document similarity matrix
sim_matrix = text.pairwise_similarity(documents)
# Fuzzy string matching
ratio = text.fuzz_ratio("hello world", "hello word")
# 91
# Semantic similarity (using embeddings)
sim = text.semantic_similarity(
"machine learning is great",
"deep learning is awesome"
)Node Editor Integration
| Node | Inputs | Outputs |
|---|---|---|
| Tokenize | Text | Token list |
| TF-IDF | Documents | Sparse matrix |
| Embed | Tokens | Embedding tensor |
| Sentiment | Text | Score dict |