-
Actions
Automate any workflow
-
Packages
Host and manage packages
-
Security
Find and fix vulnerabilities
-
Codespaces
Instant dev environments
-
Copilot
Write better code with AI
-
Code review
Manage code changes
-
Issues
Plan and track work
-
Discussions
Collaborate outside of code
- Explore
- All features
- Documentation
- GitHub Skills
- Blog
-
- By Plan
- Enterprise
- Teams
- Compare all
- By Solution
- CI/CD & Automation
- DevOps
- DevSecOps
- Case Studies
- Customer Stories
- Resources
-
GitHub Sponsors
Fund open source developers
-
The ReadME Project
GitHub community articles
- Repositories
- Topics
- Trending
- Collections
-
- Pricing
-
Actions
Automate any workflow
-
Packages
Host and manage packages
-
Security
Find and fix vulnerabilities
-
Codespaces
Instant dev environments
-
Copilot
Write better code with AI
-
Code review
Manage code changes
-
Issues
Plan and track work
-
Discussions
Collaborate outside of code
- Explore
- All features
- Documentation
- GitHub Skills
- Blog
-
- By Plan
- Enterprise
- Teams
- Compare all
- By Solution
- CI/CD & Automation
- DevOps
- DevSecOps
- Case Studies
- Customer Stories
- Resources
-
GitHub Sponsors
Fund open source developers
-
The ReadME Project
GitHub community articles
- Repositories
- Topics
- Trending
- Collections
-
- Pricing
#---------------------------------------------------------------------------------------------------------------# #Chapter 1 Regular expressions & word tokenization #---------------------------------------------------------------------------------------------------------------# ##Practicing regular expressions: re.split[] and re.findall[] # Import the regex module import re # Write a pattern to match sentence endings: sentence_endings sentence_endings = r"[.?!]" # Split my_string on sentence endings and print the result print[re.split[sentence_endings, my_string]] # Find all capitalized words in my_string and print the result capitalized_words = r"[A-Z]\w+" print[re.findall[capitalized_words, my_string]] # Split my_string on spaces and print the result spaces = r"\s+" print[re.split[spaces, my_string]] # Find all digits in my_string and print the result digits = r"\d+" print[re.findall[digits, my_string]] #---------------------------------------------------------------------------------------------------------------# #Word tokenization with NLTK # Import necessary modules from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize # Split scene_one into sentences: sentences sentences = sent_tokenize[scene_one] # Use word_tokenize to tokenize the fourth sentence: tokenized_sent tokenized_sent = word_tokenize[sentences[3]] # Make a set of unique tokens in the entire scene: unique_tokens unique_tokens = set[word_tokenize[scene_one]] # Print the unique tokens result print[unique_tokens] #---------------------------------------------------------------------------------------------------------------# #More regex with re.search[] # Search for the first occurrence of "coconuts" in scene_one: match match = re.search["coconuts", scene_one] # Print the start and end indexes of match print[match.start[], match.end[]] # Write a regular expression to search for anything in square brackets: pattern1 pattern1 = r"\[.*\]" # Use re.search to find the first text in square brackets print[re.search[pattern1, scene_one]] # Find the script notation at the beginning of the fourth sentence and print it pattern2 = r"[\w\s]+:" print[re.match[pattern2, sentences[3]]] #---------------------------------------------------------------------------------------------------------------# #Regex with NLTK tokenization # Import the necessary modules from nltk.tokenize import regexp_tokenize from nltk.tokenize import TweetTokenizer # Define a regex pattern to find hashtags: pattern1 pattern1 = r"#\w+" # Use the pattern on the first tweet in the tweets list regexp_tokenize[tweets[0], pattern1] # Write a pattern that matches both mentions and hashtags pattern2 = r"[[#|@]\w+]" # Use the pattern on the last tweet in the tweets list regexp_tokenize[tweets[-1], pattern2] # Use the TweetTokenizer to tokenize all tweets into one list tknzr = TweetTokenizer[] all_tokens = [tknzr.tokenize[t] for t in tweets] print[all_tokens] #---------------------------------------------------------------------------------------------------------------# #Non-ascii tokenization # Tokenize and print all words in german_text all_words = word_tokenize[german_text] print[all_words] # Tokenize and print only capital words capital_words = r"[A-ZÜ]\w+" print[regexp_tokenize[german_text, capital_words]] # Tokenize and print only emoji emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']" print[regexp_tokenize[german_text, emoji]] #---------------------------------------------------------------------------------------------------------------# #Charting practice # Split the script into lines: lines lines = holy_grail.split['\n'] # Replace all script lines for speaker pattern = "[A-Z]{2,}[\s]?[#\d]?[[A-Z]{2,}]?:" lines = [re.sub[pattern, '', l] for l in lines] # Tokenize each line: tokenized_lines tokenized_lines = [regexp_tokenize[s, "\w+"] for s in lines] # Make a frequency list of lengths: line_num_words line_num_words = [len[t_line] for t_line in tokenized_lines] # Plot a histogram of the line lengths plt.hist[line_num_words] # Show the plot plt.show[] #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------#