Regular expressions in python datacamp github

    • Actions

      Automate any workflow

    • Packages

      Host and manage packages

    • Security

      Find and fix vulnerabilities

    • Codespaces

      Instant dev environments

    • Copilot

      Write better code with AI

    • Code review

      Manage code changes

    • Issues

      Plan and track work

    • Discussions

      Collaborate outside of code

    • Explore
    • All features
    • Documentation
    • GitHub Skills
    • Blog

    • By Plan
    • Enterprise
    • Teams
    • Compare all
    • By Solution
    • CI/CD & Automation
    • DevOps
    • DevSecOps
    • Case Studies
    • Customer Stories
    • Resources

    • GitHub Sponsors

      Fund open source developers

    • The ReadME Project

      GitHub community articles

    • Repositories
    • Topics
    • Trending
    • Collections

  • Pricing

    • Actions

      Automate any workflow

    • Packages

      Host and manage packages

    • Security

      Find and fix vulnerabilities

    • Codespaces

      Instant dev environments

    • Copilot

      Write better code with AI

    • Code review

      Manage code changes

    • Issues

      Plan and track work

    • Discussions

      Collaborate outside of code

    • Explore
    • All features
    • Documentation
    • GitHub Skills
    • Blog

    • By Plan
    • Enterprise
    • Teams
    • Compare all
    • By Solution
    • CI/CD & Automation
    • DevOps
    • DevSecOps
    • Case Studies
    • Customer Stories
    • Resources

    • GitHub Sponsors

      Fund open source developers

    • The ReadME Project

      GitHub community articles

    • Repositories
    • Topics
    • Trending
    • Collections

  • Pricing

#---------------------------------------------------------------------------------------------------------------# #Chapter 1 Regular expressions & word tokenization #---------------------------------------------------------------------------------------------------------------# ##Practicing regular expressions: re.split() and re.findall() # Import the regex module import re # Write a pattern to match sentence endings: sentence_endings sentence_endings = r"[.?!]" # Split my_string on sentence endings and print the result print(re.split(sentence_endings, my_string)) # Find all capitalized words in my_string and print the result capitalized_words = r"[A-Z]\w+" print(re.findall(capitalized_words, my_string)) # Split my_string on spaces and print the result spaces = r"\s+" print(re.split(spaces, my_string)) # Find all digits in my_string and print the result digits = r"\d+" print(re.findall(digits, my_string)) #---------------------------------------------------------------------------------------------------------------# #Word tokenization with NLTK # Import necessary modules from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize # Split scene_one into sentences: sentences sentences = sent_tokenize(scene_one) # Use word_tokenize to tokenize the fourth sentence: tokenized_sent tokenized_sent = word_tokenize(sentences[3]) # Make a set of unique tokens in the entire scene: unique_tokens unique_tokens = set(word_tokenize(scene_one)) # Print the unique tokens result print(unique_tokens) #---------------------------------------------------------------------------------------------------------------# #More regex with re.search() # Search for the first occurrence of "coconuts" in scene_one: match match = re.search("coconuts", scene_one) # Print the start and end indexes of match print(match.start(), match.end()) # Write a regular expression to search for anything in square brackets: pattern1 pattern1 = r"\[.*\]" # Use re.search to find the first text in square brackets print(re.search(pattern1, scene_one)) # Find the script notation at the beginning of the fourth sentence and print it pattern2 = r"[\w\s]+:" print(re.match(pattern2, sentences[3])) #---------------------------------------------------------------------------------------------------------------# #Regex with NLTK tokenization # Import the necessary modules from nltk.tokenize import regexp_tokenize from nltk.tokenize import TweetTokenizer # Define a regex pattern to find hashtags: pattern1 pattern1 = r"#\w+" # Use the pattern on the first tweet in the tweets list regexp_tokenize(tweets[0], pattern1) # Write a pattern that matches both mentions and hashtags pattern2 = r"([#|@]\w+)" # Use the pattern on the last tweet in the tweets list regexp_tokenize(tweets[-1], pattern2) # Use the TweetTokenizer to tokenize all tweets into one list tknzr = TweetTokenizer() all_tokens = [tknzr.tokenize(t) for t in tweets] print(all_tokens) #---------------------------------------------------------------------------------------------------------------# #Non-ascii tokenization # Tokenize and print all words in german_text all_words = word_tokenize(german_text) print(all_words) # Tokenize and print only capital words capital_words = r"[A-ZÜ]\w+" print(regexp_tokenize(german_text, capital_words)) # Tokenize and print only emoji emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']" print(regexp_tokenize(german_text, emoji)) #---------------------------------------------------------------------------------------------------------------# #Charting practice # Split the script into lines: lines lines = holy_grail.split('\n') # Replace all script lines for speaker pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:" lines = [re.sub(pattern, '', l) for l in lines] # Tokenize each line: tokenized_lines tokenized_lines = [regexp_tokenize(s, "\w+") for s in lines] # Make a frequency list of lengths: line_num_words line_num_words = [len(t_line) for t_line in tokenized_lines] # Plot a histogram of the line lengths plt.hist(line_num_words) # Show the plot plt.show