注册

2022年最值得收藏的 25 个 Python 文本处理案例!

目录
  • 1提取PDF内容
  • 2提取Word内容
  • 3提取Web网页内容
  • 4读取json数据
  • 5读取CSV数据
  • 6删除字符串中的标点符号
  • 7使用NLTK删除停用词
  • 8使用TextBlob更正拼写
  • 9使用NLTK和TextBlob的词标记化
  • 10使用NLTK提取句子单词或短语的词干列表
  • 11使用NLTK进行句子或短语词形还原
  • 12使用NLTK从文本文件中查找每个单词的频率
  • 13从语料库中创建词云
  • 14NLTK词法散布图
  • 15使用countvectorizer将文本转换为数字
  • 16使用TF-IDF创建文档术语矩阵
  • 17为给定句子生成N-gram
  • 18使用带有二元组的sklearnCountVectorize词汇规范
  • 19使用TextBlob提取名词短语
  • 20如何计算词-词共现矩阵
  • 21使用TextBlob进行情感分析
  • 22使用Goslate进行语言翻译
  • 23使用TextBlob进行语言检测和翻译
  • 24使用TextBlob获取定义和同义词
  • 25使用TextBlob获取反义词列表

1提取 PDF 内容

  1. # pip install PyPDF2 安装 PyPDF2
  2. import PyPDF2
  3. from PyPDF2 import PdfFileReader
  4.  
  5. # Creating a pdf file object.
  6. pdf = open("test.pdf", "rb")
  7.  
  8. # Creating pdf reader object.
  9. pdf_reader = PyPDF2.PdfFileReader(pdf)
  10.  
  11. # Checking total number of pages in a pdf file.
  12. print("Total number of Pages:", pdf_reader.numPages)
  13.  
  14. # Creating a page object.
  15. page = pdf_reader.getPage(200)
  16.  
  17. # Extract data from a specific page number.
  18. print(page.extractText())
  19.  
  20. # Closing the object.
  21. pdf.close()

2提取 Word 内容

  1. # pip install python-docx 安装 python-docx
  2.  
  3.  
  4. import docx
  5.  
  6.  
  7. def main():
  8.      try:
  9.      doc = docx.Document('test.docx') # Creating word reader object.
  10.      data = ""
  11.      fullText = []
  12.      for para in doc.paragraphs:
  13.          fullText.append(para.text)
  14.          data = '\n'.join(fullText)
  15.  
  16.      print(data)
  17.  
  18.      except IOError:
  19.      print('There was an error opening the file!')
  20.      return
  21.  
  22.  
  23. if __name__ == '__main__':
  24.      main()

3提取 Web 网页内容

  1. # pip install bs4 安装 bs4
  2.  
  3. from urllib.request import Request, urlopen
  4. from bs4 import BeautifulSoup
  5.  
  6. req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
  7.          headers={'User-Agent': 'Mozilla/5.0'})
  8.  
  9. webpage = urlopen(req).read()
  10.  
  11. # Parsing
  12. soup = BeautifulSoup(webpage, 'html.parser')
  13.  
  14. # Formating the parsed html file
  15. strhtm = soup.prettify()
  16.  
  17. # Print first 500 lines
  18. print(strhtm[:500])
  19.  
  20. # Extract meta tag value
  21. print(soup.title.string)
  22. print(soup.find('meta', attrs={'property':'og:description'}))
  23.  
  24. # Extract anchor tag value
  25. for x in soup.find_all('a'):
  26.      print(x.string)
  27.  
  28. # Extract Paragraph tag value
  29. for x in soup.find_all('p'):
  30.      print(x.text)

4读取 Json 数据

  1. import requests
  2. import json
  3.  
  4. = requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
  5. res = r.json()
  6.  
  7. # Extract specific node content.
  8. print(res['quiz']['sport'])
  9.  
  10. # Dump data as string
  11. data = json.dumps(res)
  12. print(data)

5读取 CSV 数据

  1. import csv
  2.  
  3. with open('test.csv','r') as csv_file:
  4.      reader =csv.reader(csv_file)
  5.      next(reader) # Skip first row
  6.      for row in reader:
  7.      print(row)

6删除字符串中的标点符号

  1. import re
  2. import string
  3.  
  4. data = "Stuning even for the non-gamer: This sound track was beautiful!\
  5. It paints the senery in your mind so well I would recomend\
  6. it even to people who hate vid. game music! I have played the game Chrono \
  7. Cross but out of all of the games I have ever played it has the best music! \
  8. It backs away from crude keyboarding and takes a fresher step with grate\
  9. guitars and soulful orchestras.\
  10. It would impress anyone who cares to listen!"
  11.  
  12. # Methood 1 : Regex
  13. # Remove the special charaters from the read string.
  14. no_specials_string = re.sub('[!#?,.:";]', '', data)
  15. print(no_specials_string)
  16.  
  17.  
  18. # Methood 2 : translate()
  19. # Rake translator object
  20. translator = str.maketrans('', '', string.punctuation)
  21. data = data.translate(translator)
  22. print(data)

7使用 NLTK 删除停用词

  1. from nltk.corpus import stopwords
  2.  
  3.  
  4. data = ['Stuning even for the non-gamer: This sound track was beautiful!\
  5. It paints the senery in your mind so well I would recomend\
  6. it even to people who hate vid. game music! I have played the game Chrono \
  7. Cross but out of all of the games I have ever played it has the best music! \
  8. It backs away from crude keyboarding and takes a fresher step with grate\
  9. guitars and soulful orchestras.\
  10. It would impress anyone who cares to listen!']
  11.  
  12. # Remove stop words
  13. stopwords = set(stopwords.words('english'))
  14.  
  15. output = []
  16. for sentence in data:
  17.      temp_list = []
  18.      for word in sentence.split():
  19.      if word.lower() not in stopwords:
  20.          temp_list.append(word)
  21.      output.append(' '.join(temp_list))
  22.  
  23.  
  24. print(output)

8使用 TextBlob 更正拼写

  1. from textblob import TextBlob
  2.  
  3. data = "Natural language is a cantral part of our day to day life, and it's so antresting to work on any problem related to langages."
  4.  
  5. output = TextBlob(data).correct()
  6. print(output)

9使用 NLTK 和 TextBlob 的词标记化

  1. import nltk
  2. from textblob import TextBlob
  3.  
  4.  
  5. data = "Natural language is a central part of our day to day life, and it's so interesting to work on any problem related to languages."
  6.  
  7. nltk_output = nltk.word_tokenize(data)
  8. textblob_output = TextBlob(data).words
  9.  
  10. print(nltk_output)
  11. print(textblob_output)

Output:

['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']

10使用 NLTK 提取句子单词或短语的词干列表

  1. from nltk.stem import PorterStemmer
  2.  
  3. st = PorterStemmer()
  4. text = ['Where did he learn to dance like that?',
  5.      'His eyes were dancing with humor.',
  6.      'She shook her head and danced away',
  7.      'Alex was an excellent dancer.']
  8.  
  9. output = []
  10. for sentence in text:
  11.      output.append(" ".join([st.stem(i) for i in sentence.split()]))
  12.  
  13. for item in output:
  14.      print(item)
  15.  
  16. print("-" * 50)
  17. print(st.stem('jumping'), st.stem('jumps'), st.stem('jumped'))

Output:

where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump

11使用 NLTK 进行句子或短语词形还原

  1. from nltk.stem import WordNetLemmatizer
  2.  
  3. wnl = WordNetLemmatizer()
  4. text = ['She gripped the armrest as he passed two cars at a time.',
  5.      'Her car was in full view.',
  6.      'A number of cars carried out of state license plates.']
  7.  
  8. output = []
  9. for sentence in text:
  10.      output.append(" ".join([wnl.lemmatize(i) for i in sentence.split()]))
  11.  
  12. for item in output:
  13.      print(item)
  14.  
  15. print("*" * 10)
  16. print(wnl.lemmatize('jumps', 'n'))
  17. print(wnl.lemmatize('jumping', 'v'))
  18. print(wnl.lemmatize('jumped', 'v'))
  19.  
  20. print("*" * 10)
  21. print(wnl.lemmatize('saddest', 'a'))
  22. print(wnl.lemmatize('happiest', 'a'))
  23. print(wnl.lemmatize('easiest', 'a'))

Output:

She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy

12使用 NLTK 从文本文件中查找每个单词的频率

  1. import nltk
  2. from nltk.corpus import webtext
  3. from nltk.probability import FreqDist
  4.  
  5. nltk.download('webtext')
  6. wt_words = webtext.words('testing.txt')
  7. data_analysis = nltk.FreqDist(wt_words)
  8.  
  9. # Let's take the specific words only if their frequency is greater than 3.
  10. filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
  11.  
  12. for key in sorted(filter_words):
  13.      print("%s: %s" % (key, filter_words[key]))
  14.  
  15. data_analysis = nltk.FreqDist(filter_words)
  16.  
  17. data_analysis.plot(25, cumulative=False)

Output:

[nltk_data] Downloading package webtext to
[nltk_data]     C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...

13从语料库中创建词云

  1. import nltk
  2. from nltk.corpus import webtext
  3. from nltk.probability import FreqDist
  4. from wordcloud import WordCloud
  5. import matplotlib.pyplot as plt
  6.  
  7. nltk.download('webtext')
  8. wt_words = webtext.words('testing.txt') # Sample data
  9. data_analysis = nltk.FreqDist(wt_words)
  10.  
  11. filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
  12.  
  13. wcloud = WordCloud().generate_from_frequencies(filter_words)
  14.  
  15. # Plotting the wordcloud
  16. plt.imshow(wcloud, interpolation="bilinear")
  17.  
  18. plt.axis("off")
  19. (-0.5, 399.5, 199.5, -0.5)
  20. plt.show()


0 个评论

要回复文章请先登录注册