Regular expressions in python datacamp github

- Actions
  Automate any workflow
- Packages
  Host and manage packages
- Security
  Find and fix vulnerabilities
- Codespaces
  Instant dev environments
- Copilot
  Write better code with AI
- Code review
  Manage code changes
- Issues
  Plan and track work
- Discussions
  Collaborate outside of code
- Explore
- All features
- Documentation
- GitHub Skills
- Blog
- By Plan
- Enterprise
- Teams
- Compare all
- By Solution
- CI/CD & Automation
- DevOps
- DevSecOps
- Case Studies
- Customer Stories
- Resources
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
- Repositories
- Topics
- Trending
- Collections
Pricing

- Actions
  Automate any workflow
- Packages
  Host and manage packages
- Security
  Find and fix vulnerabilities
- Codespaces
  Instant dev environments
- Copilot
  Write better code with AI
- Code review
  Manage code changes
- Issues
  Plan and track work
- Discussions
  Collaborate outside of code
- Explore
- All features
- Documentation
- GitHub Skills
- Blog
- By Plan
- Enterprise
- Teams
- Compare all
- By Solution
- CI/CD & Automation
- DevOps
- DevSecOps
- Case Studies
- Customer Stories
- Resources
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
- Repositories
- Topics
- Trending
- Collections
Pricing

#---------------------------------------------------------------------------------------------------------------# #Chapter 1 Regular expressions & word tokenization #---------------------------------------------------------------------------------------------------------------# ##Practicing regular expressions: re.split() and re.findall() # Import the regex module import re # Write a pattern to match sentence endings: sentence_endings sentence_endings = r"[.?!]" # Split my_string on sentence endings and print the result print(re.split(sentence_endings, my_string)) # Find all capitalized words in my_string and print the result capitalized_words = r"[A-Z]\w+" print(re.findall(capitalized_words, my_string)) # Split my_string on spaces and print the result spaces = r"\s+" print(re.split(spaces, my_string)) # Find all digits in my_string and print the result digits = r"\d+" print(re.findall(digits, my_string)) #---------------------------------------------------------------------------------------------------------------# #Word tokenization with NLTK # Import necessary modules from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize # Split scene_one into sentences: sentences sentences = sent_tokenize(scene_one) # Use word_tokenize to tokenize the fourth sentence: tokenized_sent tokenized_sent = word_tokenize(sentences[3]) # Make a set of unique tokens in the entire scene: unique_tokens unique_tokens = set(word_tokenize(scene_one)) # Print the unique tokens result print(unique_tokens) #---------------------------------------------------------------------------------------------------------------# #More regex with re.search() # Search for the first occurrence of "coconuts" in scene_one: match match = re.search("coconuts", scene_one) # Print the start and end indexes of match print(match.start(), match.end()) # Write a regular expression to search for anything in square brackets: pattern1 pattern1 = r"\[.*\]" # Use re.search to find the first text in square brackets print(re.search(pattern1, scene_one)) # Find the script notation at the beginning of the fourth sentence and print it pattern2 = r"[\w\s]+:" print(re.match(pattern2, sentences[3])) #---------------------------------------------------------------------------------------------------------------# #Regex with NLTK tokenization # Import the necessary modules from nltk.tokenize import regexp_tokenize from nltk.tokenize import TweetTokenizer # Define a regex pattern to find hashtags: pattern1 pattern1 = r"#\w+" # Use the pattern on the first tweet in the tweets list regexp_tokenize(tweets[0], pattern1) # Write a pattern that matches both mentions and hashtags pattern2 = r"([#|@]\w+)" # Use the pattern on the last tweet in the tweets list regexp_tokenize(tweets[-1], pattern2) # Use the TweetTokenizer to tokenize all tweets into one list tknzr = TweetTokenizer() all_tokens = [tknzr.tokenize(t) for t in tweets] print(all_tokens) #---------------------------------------------------------------------------------------------------------------# #Non-ascii tokenization # Tokenize and print all words in german_text all_words = word_tokenize(german_text) print(all_words) # Tokenize and print only capital words capital_words = r"[A-ZÜ]\w+" print(regexp_tokenize(german_text, capital_words)) # Tokenize and print only emoji emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']" print(regexp_tokenize(german_text, emoji)) #---------------------------------------------------------------------------------------------------------------# #Charting practice # Split the script into lines: lines lines = holy_grail.split('\n') # Replace all script lines for speaker pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:" lines = [re.sub(pattern, '', l) for l in lines] # Tokenize each line: tokenized_lines tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines] # Make a frequency list of lengths: line_num_words line_num_words = [len(t_line) for t_line in tokenized_lines] # Plot a histogram of the line lengths plt.hist(line_num_words) # Show the plot plt.show() #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------#

Regular expressions in python datacamp github

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội