创建一个Python函数,其代码注释和文档字符串在内的字数达到1000字以上,可以是一个功能相对复杂的函数。以下是一个示例代码,它实现了一个简单的文本分析工具,可以对输入的文本进行多种分析操作,如词频分析、关键词提取等。
import re
from collections import Counter
from typing import List, Dict, Tuple
def text_analysis(text: str, stopwords: List[str]) -> Dict[str, any]:
"""
Analyze the given text, providing insights such as word frequency and key phrase extraction.
This function performs several operations on the input text to provide a comprehensive
analysis, including counting word frequencies, removing stopwords, extracting key phrases,
and determining the length of the text. It is designed to help understand the composition
of the text in terms of its most commonly used words and phrases, which can be especially
useful in fields like natural language processing, data analytics, and content optimization.
Parameters:
text (str): The input text to be analyzed. It could be anything from a short paragraph
to a lengthy article. The function is robust to handle regular textual content
but may require preprocessing for very unstructured data.
stopwords (List[str]): A list of stopwords to be excluded from the analysis. Stopwords
are common words that do not contribute much meaning to a sentence
and are often excluded from keyword extraction and frequency analysis.
Returns:
Dict[str, any]: A dictionary containing the results of the analysis, including:
- 'word_count' : Total number of words in the text
- 'unique_words' : Number of unique words in the text
- 'word_frequency' : A dictionary with words as keys and their frequencies as values
- 'key_phrases' : A list of extracted key phrases, which are sequences of words
that provide meaningful content without stopwords.
- 'text_length' : Total number of characters in the text
Raises:
ValueError: If the input text is empty or only contains whitespace.
Example:
--------
text = "Python is a great programming language. It is used in data science, AI and more."
stopwords = ["is", "a", "in", "and"]
analysis_results = text_analysis(text, stopwords)
print(analysis_results)
"""
# Ensure the text is not empty or only whitespace
if not text.strip():
raise ValueError("Input text cannot be empty or only whitespace.")
# 1. Preprocess the text: convert to lowercase and remove punctuation (for simplicity)
processed_text = re.sub(r'[^\w\s]', '', text.lower()) # Remove punctuation and lowercase
# 2. Tokenize the text into individual words
words = processed_text.split()
# 3. Filter out stopwords
filtered_words = [word for word in words if word not in stopwords]
# 4. Word frequency analysis
word_freq = Counter(filtered_words)
# 5. Extract key phrases: This example simply takes sequences of non-stopwords
# For a more advanced approach, consider using NLP libraries like spaCy or NLTK for phrase detection
key_phrases = []
current_phrase = []
for word in words:
# If the word is not a stopword, add it to the current phrase
if word not in stopwords:
current_phrase.append(word)
else:
# If we reach a stopword and current_phrase is not empty, save it as a key phrase
if current_phrase:
key_phrases.append(' '.join(current_phrase))
current_phrase = []
# Add the final phrase if there is one
if current_phrase:
key_phrases.append(' '.join(current_phrase))
# Compile the results into a dictionary
analysis_results = {
'word_count': len(words),
'unique_words': len(set(words)),
'word_frequency': word_freq,
'key_phrases': key_phrases,
'text_length': len(text)
}
return analysis_results
# Example usage
if __name__ == "__main__":
sample_text = "Python is a great programming language. It is used in data science, AI and more."
sample_stopwords = ["is", "a", "in", "and"]
print(text_analysis(sample_text, sample_stopwords))
这段Python代码定义了一个名为text_analysis
的函数,用于对输入文本进行简单的文本分析。包含了文本预处理、单词频率分析以及关键短语提取等步骤。通过注释和文档字符串详细说明了每一个部分的实现细节和用途。本函数可用于基础的文本分析,在自然语言处理和数据分析领域有广泛的应用前景。