Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
Zip function in Python to change to a new character set
Next article icon

How to Find Chinese And Japanese Character in a String in Python

Last Updated : 15 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Detecting Chinese or Japanese characters in a string can be useful for a variety of applications, such as text preprocessing, language detection, and character classification. In this article, we’ll explore simple yet effective ways to identify Chinese or Japanese characters in a string using Python.

Understanding Unicode Character Ranges

Chinese, Japanese, and Korean (CJK) characters are part of the Unicode standard. These characters belong to specific Unicode ranges, and we can leverage these ranges to detect the presence of CJK characters in a string.

Here are the relevant Unicode ranges for Chinese and Japanese characters:

Chinese characters (CJK Unified Ideographs):

  • \u4E00 to \u9FFF (Common and uncommon Chinese characters)
  • \u3400 to \u4DBF (More rare Chinese characters)

Japanese characters:

  • Hiragana: \u3040 to \u309F
  • Katakana: \u30A0 to \u30FF
  • Kanji: Same as Chinese (CJK Unified Ideographs)

Using Regular Expressions to Find Chinese or Japanese Characters

The most efficient way to find Chinese or Japanese characters in a string is by using regular expressions (regex) to match Unicode character ranges. By specifying the Unicode ranges of Chinese or Japanese characters in a regular expression, we can search for them in any given string.

1. Finding Chinese Characters

Chinese characters, including Simplified and Traditional characters, fall within the CJK Unified Ideographs block. In this example, we provide the range of Chinese characters to the compile() method to convert the expression pattern into a regex object. The findall() method is used to search the entire string (text) for all matches that correspond to Chinese characters.

Python
import re  def find_chinese(text):     # Regex pattern for Chinese characters (CJK Unified Ideographs)     chinese_pattern = re.compile(r'[\u4E00-\u9FFF]')     return chinese_pattern.findall(text)  # Example usage text = "This is an example string 包含中文字符." chinese_chars = find_chinese(text) print("Chinese characters found:", chinese_chars) 

Output
Chinese characters found: ['包', '含', '中', '文', '字', '符'] 

2. Finding Japanese Characters

To detect Japanese characters, we need to include the Unicode ranges for Hiragana, Katakana, and Kanji (CJK Ideographs). In this example, we provide the range of Japanese characters to the compile() method to convert the expression pattern into a regex object. The findall() method is used to search the entire string (text) for all matches that correspond to Japanese characters.

Python
# import regular expression module import re  def find_japanese(text):     # Regex pattern for Hiragana, Katakana, and Kanji (CJK Ideographs)     japanese_pattern = re.compile(r'[\u3040-\u30FF\u4E00-\u9FFF]')     return japanese_pattern.findall(text)  # Example usage text = "これは日本語の文字列です。" japanese_chars = find_japanese(text) print("Japanese characters found:", japanese_chars) 

Output
Japanese characters found: ['こ', 'れ', 'は', '日', '本', '語', 'の', '文', '字', '列', 'で', 'す'] 

3. Detecting Both Chinese and Japanese Characters

Since Chinese and Japanese characters overlap in certain Unicode blocks (such as CJK Ideographs), we can write a single function to detect both languages.

Python
import re  def find_cjk(text):     # Regex pattern for CJK characters (Chinese, Japanese)     cjk_pattern = re.compile(r'[\u3040-\u30FF\u3400-\u4DBF\u4E00-\u9FFF]')     return cjk_pattern.findall(text)  # Example usage text = "This string contains 漢字 and カタカナ." cjk_chars = find_cjk(text) print("CJK characters found:", cjk_chars) 

Output
CJK characters found: ['漢', '字', 'カ', 'タ', 'カ', 'ナ'] 

Iterating Over the String to Check for CJK Characters

Another method involves manually iterating over the string and checking each character to see if it belongs to a specific Unicode range.

1. Checking for Chinese Characters

In this example, we will use a simple for loop to iterate over each character in the string and then using if condition, check if it belongs to the Chinese characters or not.

Python
def contains_chinese(text):     for char in text:         if '\u4E00' <= char <= '\u9FFF' or '\u3400' <= char <= '\u4DBF':             return True     return False  # Example usage text = "This sentence contains 中文." if contains_chinese(text):     print("Chinese characters detected.") else:     print("No Chinese characters found.") 

Output
Chinese characters detected. 

Checking for Japanese Characters

In this example, we will use a simple for loop to iterate over each character in the string and then using if condition, check if it belongs to the Japanese characters or not.

Python
def contains_japanese(text):     for char in text:         if '\u3040' <= char <= '\u30FF' or '\u4E00' <= char <= '\u9FFF':             return True     return False  # Example usage text = "これはテストです。" if contains_japanese(text):     print("Japanese characters detected.") else:     print("No Japanese characters found.") 

Output
Japanese characters detected. 

Detecting Multiple CJK Characters Simultaneously

If we want to check whether a string contains both Chinese and Japanese characters, we can combine the approaches.

Python
def detect_cjk_languages(text):     has_chinese = any('\u4E00' <= char <= '\u9FFF' or                        '\u3400' <= char <= '\u4DBF' for char in text)     has_japanese = any('\u3040' <= char <= '\u30FF' or                         '\u4E00' <= char <= '\u9FFF' for char in text)          if has_chinese:         print("Chinese characters detected.")     if has_japanese:         print("Japanese characters detected.")     if not has_chinese and not has_japanese:         print("No CJK characters detected.")  # Example usage text = "これはテストです and this contains 漢字." detect_cjk_languages(text) 

Output
Chinese characters detected. Japanese characters detected. 

Real-World Use Cases

  • Text Preprocessing for NLP: Detecting and filtering out Chinese or Japanese characters can be useful in multilingual text processing and machine learning tasks.
  • Language Detection: Before analyzing a text, we might want to check for the presence of different languages and take actions accordingly (e.g., translating the text, applying specific language models).
  • Character Categorization: This method can be used to categorize and filter strings based on character types.

Conclusion

Detecting Chinese or Japanese characters in Python can be easily achieved using regular expressions or iterating over characters in the string. Whether we’re looking to preprocess multilingual text, detect specific languages, or filter certain character types, these methods provide a straightforward approach to handling CJK characters in our Python applications.


Next Article
Zip function in Python to change to a new character set

M

monkserndp4
Improve
Article Tags :
  • Python
  • python-string
  • python-regex
Practice Tags :
  • python

Similar Reads

  • Find all duplicate characters in string in Python
    In this article, we will explore various methods to find all duplicate characters in string. The simplest approach is by using a loop with dictionary. Using Loop with DictionaryWe can use a for loop to find duplicate characters efficiently. First we count the occurrences of each character by iterati
    3 min read
  • How Can I Find All Matches to a Regular Expression in Python?
    In Python, regular expressions (regex) are a powerful tool for finding patterns in text. Whether we're searching through logs, extracting specific data from a document, or performing complex string manipulations, Python's re module makes working with regular expressions straightforward. In this arti
    3 min read
  • Count occurrences of a character in string in Python
    We are given a string, and our task is to count how many times a specific character appears in it using Python. This can be done using methods like .count(), loops, or collections.Counter. For example, in the string "banana", using "banana".count('a') will return 3 since the letter 'a' appears three
    2 min read
  • Python | Count the Number of matching characters in a pair of string
    The problem is about finding how many characters are the same in two strings. We compare the strings and count the common characters between them. In this article, we'll look at different ways to solve this problem. Using Set Sets are collections of unique items, so by converting both strings into s
    2 min read
  • Zip function in Python to change to a new character set
    Given a 26 letter character set, which is equivalent to character set of English alphabet i.e. (abcd….xyz) and act as a relation. We are also given several sentences and we have to translate them with the help of given new character set. Examples: New character set : qwertyuiopasdfghjklzxcvbnm Input
    2 min read
  • How to print Odia Characters and Numbers using Python?
    Odia(ଓଡ଼ିଆ) is an Indo-Aryan language spoken in the Indian state of Odisha. The Odia Script is developed from the Kalinga alphabet, one of the many descendants of the Brahmi script of ancient India. The earliest known example of Odia language, in the Kalinga script, dates from
    2 min read
  • Possible Words using given characters in Python
    Given a dictionary and a character array, print all valid words that are possible using characters from the array. Note: Repetitions of characters is not allowed. Examples: Input : Dict = ["go","bat","me","eat","goal","boy", "run"] arr = ['e','o','b', 'a','m','g', 'l'] Output : go, me, goal. This pr
    5 min read
  • Python - Find all close matches of input string from a list
    In Python, there are multiple ways to find all close matches of a given input string from a list of strings. Using startswith() startswith() function is used to identify close matches for the input string. It checks if either the strings in the list start with the input or if the input starts with t
    3 min read
  • Replacing Characters in a String Using Dictionary in Python
    In Python, we can replace characters in a string dynamically based on a dictionary. Each key in the dictionary represents the character to be replaced, and its value specifies the replacement. For example, given the string "hello world" and a dictionary {'h': 'H', 'o': 'O'}, the output would be "Hel
    2 min read
  • Check if both halves of the string have same set of characters in Python
    Given a string of lowercase characters only, the task is to check if it is possible to split a string from middle which will gives two halves having the same characters and same frequency of each character. If the length of the given string is ODD then ignore the middle element and check for the res
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences