Regular expressions in python datacamp github

    • Actions

      Automate any workflow

    • Packages

      Host and manage packages

    • Security

      Find and fix vulnerabilities

    • Codespaces

      Instant dev environments

    • Copilot

      Write better code with AI

    • Code review

      Manage code changes

    • Issues

      Plan and track work

    • Discussions

      Collaborate outside of code

    • Explore
    • All features
    • Documentation
    • GitHub Skills
    • Blog

    • By Plan
    • Enterprise
    • Teams
    • Compare all
    • By Solution
    • CI/CD & Automation
    • DevOps
    • DevSecOps
    • Case Studies
    • Customer Stories
    • Resources

    • GitHub Sponsors

      Fund open source developers

    • The ReadME Project

      GitHub community articles

    • Repositories
    • Topics
    • Trending
    • Collections

  • Pricing

    • Actions

      Automate any workflow

    • Packages

      Host and manage packages

    • Security

      Find and fix vulnerabilities

    • Codespaces

      Instant dev environments

    • Copilot

      Write better code with AI

    • Code review

      Manage code changes

    • Issues

      Plan and track work

    • Discussions

      Collaborate outside of code

    • Explore
    • All features
    • Documentation
    • GitHub Skills
    • Blog

    • By Plan
    • Enterprise
    • Teams
    • Compare all
    • By Solution
    • CI/CD & Automation
    • DevOps
    • DevSecOps
    • Case Studies
    • Customer Stories
    • Resources

    • GitHub Sponsors

      Fund open source developers

    • The ReadME Project

      GitHub community articles

    • Repositories
    • Topics
    • Trending
    • Collections

  • Pricing

#---------------------------------------------------------------------------------------------------------------# #Chapter 1 Regular expressions & word tokenization #---------------------------------------------------------------------------------------------------------------# ##Practicing regular expressions: re.split[] and re.findall[] # Import the regex module import re # Write a pattern to match sentence endings: sentence_endings sentence_endings = r"[.?!]" # Split my_string on sentence endings and print the result print[re.split[sentence_endings, my_string]] # Find all capitalized words in my_string and print the result capitalized_words = r"[A-Z]\w+" print[re.findall[capitalized_words, my_string]] # Split my_string on spaces and print the result spaces = r"\s+" print[re.split[spaces, my_string]] # Find all digits in my_string and print the result digits = r"\d+" print[re.findall[digits, my_string]] #---------------------------------------------------------------------------------------------------------------# #Word tokenization with NLTK # Import necessary modules from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize # Split scene_one into sentences: sentences sentences = sent_tokenize[scene_one] # Use word_tokenize to tokenize the fourth sentence: tokenized_sent tokenized_sent = word_tokenize[sentences[3]] # Make a set of unique tokens in the entire scene: unique_tokens unique_tokens = set[word_tokenize[scene_one]] # Print the unique tokens result print[unique_tokens] #---------------------------------------------------------------------------------------------------------------# #More regex with re.search[] # Search for the first occurrence of "coconuts" in scene_one: match match = re.search["coconuts", scene_one] # Print the start and end indexes of match print[match.start[], match.end[]] # Write a regular expression to search for anything in square brackets: pattern1 pattern1 = r"\[.*\]" # Use re.search to find the first text in square brackets print[re.search[pattern1, scene_one]] # Find the script notation at the beginning of the fourth sentence and print it pattern2 = r"[\w\s]+:" print[re.match[pattern2, sentences[3]]] #---------------------------------------------------------------------------------------------------------------# #Regex with NLTK tokenization # Import the necessary modules from nltk.tokenize import regexp_tokenize from nltk.tokenize import TweetTokenizer # Define a regex pattern to find hashtags: pattern1 pattern1 = r"#\w+" # Use the pattern on the first tweet in the tweets list regexp_tokenize[tweets[0], pattern1] # Write a pattern that matches both mentions and hashtags pattern2 = r"[[#|@]\w+]" # Use the pattern on the last tweet in the tweets list regexp_tokenize[tweets[-1], pattern2] # Use the TweetTokenizer to tokenize all tweets into one list tknzr = TweetTokenizer[] all_tokens = [tknzr.tokenize[t] for t in tweets] print[all_tokens] #---------------------------------------------------------------------------------------------------------------# #Non-ascii tokenization # Tokenize and print all words in german_text all_words = word_tokenize[german_text] print[all_words] # Tokenize and print only capital words capital_words = r"[A-ZÜ]\w+" print[regexp_tokenize[german_text, capital_words]] # Tokenize and print only emoji emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']" print[regexp_tokenize[german_text, emoji]] #---------------------------------------------------------------------------------------------------------------# #Charting practice # Split the script into lines: lines lines = holy_grail.split['\n'] # Replace all script lines for speaker pattern = "[A-Z]{2,}[\s]?[#\d]?[[A-Z]{2,}]?:" lines = [re.sub[pattern, '', l] for l in lines] # Tokenize each line: tokenized_lines tokenized_lines = [regexp_tokenize[s, "\w+"] for s in lines] # Make a frequency list of lengths: line_num_words line_num_words = [len[t_line] for t_line in tokenized_lines] # Plot a histogram of the line lengths plt.hist[line_num_words] # Show the plot plt.show[] #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------# #---------------------------------------------------------------------------------------------------------------#

Chủ Đề