Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
How to compare two text files in python?
Next article icon

How To Extract Data From Common File Formats in Python?

Last Updated : 13 Jan, 2021
Comments
Improve
Suggest changes
Like Article
Like
Report

Sometimes work with some datasets must have mostly worked with .csv(Comma Separated Value) files only. They are really a great starting point in applying Data Science techniques and algorithms. But many of us will land up in Data Science firms or take up real-world projects in Data Science sooner or later. Unfortunately in real-world projects, the data won't be available to us in a neat .csv file. There we have to extract data from different sources like images, pdf files, doc files, image files, etc. In this article, we will see the perfect start to tackle those situations.

Below we will see how to extract relevant information from multiple such sources.

1. Multiple Sheet Excel Files

Note that if the Excel file has a single sheet then the same method to read CSV file (pd.read_csv('File.xlsx')) might work. But it won't in the case of multiple sheet files as shown in the below image where there are 3 sheets( Sheet1, Sheet2, Sheet3). In this case, it will just return the first sheet.

Excel sheet used: Click Here.

Example: We will see how to read this excel-file.

Python3
# import Pandas library import pandas as pd  # Read our file. Here sheet_name=1 # means we are reading the 2nd sheet or Sheet2 df = pd.read_excel('Sample1.xlsx', sheet_name = 1) df.head() 

Output:

Now let's read a selected column of the same sheet:

Python3
# Read only column A, B, C of all # the four columns A,B,C,D in Sheet2 df=pd.read_excel('Sample1.xlsx',                  sheet_name = 1, usecols = 'A : C') df.head() 

Output:

Now let's read all sheet together:

Sheet1 contains columns A, B, C; Sheet2 contains A, B, C, D and Sheet3 contains B, D. We will see a simple example below on how to read all the 3 sheets together and merge them into common columns.

Python3
df2 = pd.DataFrame() for i in df.keys():     df2 = pd.concat([df2, df[i]],                      axis = 0)  display(df2) 

Output:

2. Extract Text From Images

Now we will discuss how to extract text from images.

For enabling our python program to have Character recognition capabilities, we would be making use of pytesseract OCR library. The library could be installed onto our python environment by executing the following command in the command interpreter of the OS:-

pip install pytesseract

The library (if used on Windows OS) requires the tesseract.exe binary to be also present for proper installation of the library. During the installation of the aforementioned executable, we would be prompted to specify a path for it. This path needs to be remembered as it would be utilized later on in the code. For most installations the path would be C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe. 

Image for demonstration:

Python3
# We import necessary libraries.  # The PIL Library is used to read the images from PIL import Image import pytesseract  # Read the image image = Image.open(r'pic.png')  # Perform the information extraction from images # Note below, put the address where tesseract.exe  # file is located in your system pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  print(pytesseract.image_to_string(image)) 

Output:

GeeksforGeeks

3. Extracting text from Doc File

Here we will extract text from the doc file using docx module.

For installation:

pip install python-docx

Image for demonstration:  Aniket_Doc.docx 

Example 1: First we'll extract the title:

Python3
# Importing our library and reading the doc file import docx doc = docx.Document('csv/g.docx')  # Printing the title print(doc.paragraphs[0].text) 

Output:

My Name Aniket

Example 2: Then we'll extract the different texts present(excluding the table).

Python3
# Getting all the text in the doc file l=[doc.paragraphs[i].text for i in range(len(doc.paragraphs))]  # There might be many useless empty # strings present so removing them l=[i for i in l if len(i)!=0] print(l) 

Output:

['My Name Aniket', '               Hello I am Aniket', 'I am giving tutorial on how to extract text from MS Doc.', 'Please go through it carefully.']

Example 3: Now we'll extract the table:

Python3
# Since there are only one table in # our doc file we are using 0. For multiple tables # you can use suitable for toop table = doc.tables[0]  # Initializing some empty list list1 = [] list2 = []  # Looping through each row of table for i in range(len(table.rows)):        # Looping through each column of a row     for j in range(len(table.columns)):          # Extracting the required text         list1.append(table.rows[i].cells[j].paragraphs[0].text)      list2.append(list1[:])     list1.clear()  print(list2) 

Output:

[['A', 'B', 'C'], ['12', 'aNIKET', '@@@'], ['3', 'SOM', '+12&']]

4. Extracting Data From PDF File

The task is to extract Data( Image, text) from PDF in Python. We will extract the images from PDF files and save them using PyMuPDF library. First, we would have to install the PyMuPDF library using Pillow.

pip install PyMuPDF Pillow

Example 1:

Now we will extract data from the pdf version of the same doc file.

Python3
# import module import fitz  # Reading our pdf file docu=fitz.open('file.pdf')  # Initializing an empty list where we will put all text text_list=[]  # Looping through all pages of the pdf file for i in range(docu.pageCount):       # Loading each page   pg=docu.loadPage(i)      # Extracting text from each page   pg_txt=pg.getText('text')      # Appending text to the empty list   text_list.append(pg_txt)    # Cleaning the text by removing useless # empty strings and unicode character '\u200b' text_list=[i.replace(u'\u200b','') for i in text_list[0].split('\n') if len(i.strip()) ! = 0] print(text_list) 

Output:

['My Name Aniket ', '               Hello I am Aniket ', 'I am giving tutorial on how to extract text from MS Doc. ', 'Please go through it carefully. ', 'A ', 'B ', 'C ', '12 ', 'aNIKET ', '@@@ ', '3 ', 'SOM ', '+12& ']

Example 2: Extract image from PDF.

Python3
# Iterating through the pages for current_page in range(len(docu)):      # Getting the images in that page   for image in docu.getPageImageList(current_page):          # get the XREF of the image . XREF can be thought of a     # container holding the location of the image     xref=image[0]          # extract the object i.e,     # the image in our pdf file at that XREF     pix=fitz.Pixmap(docu,xref)          # Storing the image as .png     pix.writePNG('page %s - %s.png'%(current_page,xref)) 

The image is stored in our current file location as in format page_no.-xref.png. In our case, its name is page 0-7.png.

Now let's plot view the image.

Python3
# Import necessary library import matplotlib.pyplot as plt  # Read and display the image img=plt.imread('page 0 - 7.png') plt.imshow(img) 

Output:


Next Article
How to compare two text files in python?

A

aniketmitra
Improve
Article Tags :
  • Technical Scripter
  • Data Science
  • Technical Scripter 2020
  • python-utility
  • python
Practice Tags :
  • python

Similar Reads

  • How to delete data from file in Python
    When data is no longer needed, it’s important to free up space for more relevant information. Python's file handling capabilities allow us to manage files easily, whether it's deleting entire files, clearing contents or removing specific data. For more on file handling, check out: File Handling in P
    3 min read
  • How to Convert Excel to XML Format in Python?
    Python proves to be a powerful language when the requirement is to convert a file from one format to the other. It supports tools that can be employed to easily achieve the functionality. In this article, we'll find out how we will convert from an Excel file to Extensible terminology (XML) files wit
    3 min read
  • How to extract images from PDF in Python?
    The task in this article is to extract images from PDFs and convert them to Image to PDF and PDF to Image in Python. To extract the images from PDF files and save them, we use the PyMuPDF library. First, we would have to install the PyMuPDF library using Pillow. pip install PyMuPDF PillowPyMuPDF is
    3 min read
  • What is the common header format of Python files?
    When writing Python scripts, it's important to maintain a clean and well-documented codebase. One of the key practices for achieving this is adding a header to each Python file. The header provides essential information about the script, such as its functionality, author and dependencies, which can
    3 min read
  • How to compare two text files in python?
    Python has provided the methods to manipulate files that too in a very concise manner. In this article we are going to discuss one of the applications of the Python's file handling features i.e. the comparison of files. Files in use: Text File 1Text File 2Method 1: Comparing complete file at once Py
    3 min read
  • Visualize data from CSV file in Python
    CSV stands for Comma-Separated Values, which means that the data in a CSV file is separated by commas, making it easy to store tabular data. The file extension for CSV files is .csv, and these files are commonly used with spreadsheet applications like Google Sheets and Microsoft Excel. A CSV file co
    4 min read
  • How to remove blank lines from a .txt file in Python
    Many times we face the problem where we need to remove blank lines or empty spaces in lines between our text to make it look more structured, concise, and organized. This article is going to cover two different methods to remove those blank lines from a .txt file using Python code. This is going to
    3 min read
  • Convert Text file to JSON in Python
    JSON (JavaScript Object Notation) is a data-interchange format that is human-readable text and is used to transmit data, especially between web applications and servers. The JSON files will be like nested dictionaries in Python. To convert a text file into JSON, there is a json module in Python. Thi
    4 min read
  • Python | shutil.get_archive_formats() method
    Shutil module in Python provides many functions of high-level operations on files and collections of files. It comes under Python’s standard utility modules. This module helps in automating process of copying and removal of files and directories. shutil.get_archive_formats() method in Python is used
    1 min read
  • How to Brute Force ZIP File Passwords in Python?
    In this article, we will see a Python program that will crack the zip file's password using the brute force method. The ZIP file format is a common archive and compression standard. It is used to compress files. Sometimes, compressed files are confidential and the owner doesn't want to give its acce
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences