Cannot import excel into a dataframe

nfmarquez · February 28, 2024, 10:22am

Hi everyone. I’m quite new to JupyterLab and am doing an introductory Python course. However I need to do something work-related that requires to import some data stored in an Excel file. I’ve dragged and dropped the .xlsx file into the working folder (just where the .ipynb is) and have done the following:

import pandas as pd

dataset = pd.read_excel(“XL_test.xlsx”)

(My Excel file is quite vasic, with 3 columns and a bit over 500 rows, starting on cell A1)

However I consistently get an error:

ValueError Traceback (most recent call last)
Cell In[51], line 1
----> 1 dataset = pd.read_excel(“XL_test.xlsx”)

File /lib/python3.11/site-packages/pandas/util/_decorators.py:211, in deprecate_kwarg.._deprecate_kwarg..wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
→ 211 return func(*args, **kwargs)

File /lib/python3.11/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
→ 331 return func(*args, **kwargs)

File /lib/python3.11/site-packages/pandas/io/excel/_base.py:482, in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, thousands, decimal, comment, skipfooter, convert_float, mangle_dupe_cols, storage_options)
480 if not isinstance(io, ExcelFile):
481 should_close = True
→ 482 io = ExcelFile(io, storage_options=storage_options, engine=engine)
483 elif engine and engine != io.engine:
484 raise ValueError(
485 "Engine should not be specified when passing "
486 “an ExcelFile - ExcelFile already has the engine set”
487 )

File /lib/python3.11/site-packages/pandas/io/excel/_base.py:1656, in ExcelFile.init(self, path_or_buffer, engine, storage_options)
1652 ext = inspect_excel_format(
1653 content_or_path=path_or_buffer, storage_options=storage_options
1654 )
1655 if ext is None:
→ 1656 raise ValueError(
1657 “Excel file format cannot be determined, you must specify "
1658 “an engine manually.”
1659 )
1661 engine = config.get_option(f"io.excel.{ext}.reader”, silent=True)
1662 if engine == “auto”:

ValueError: Excel file format cannot be determined, you must specify an engine manually.

It appears to return an error related to the extension. I guess I’ve read something about an Excel engine, but my attempts were unsuccessful.

Anyone can help, please?

fomightez · February 28, 2024, 1:27pm

If you just search the error, you’ll find lots of help pointing you in the right direction to deal with the error. Right now when I search that, in the top links lead me to here that says read the documentation on pandas.read_excel() and see that it says:

““xlrd” supports old-style Excel files (.xls). “openpyxl” supports newer Excel file formats.”

.xlsx file extension for the file you have indicates the newer format than .xls and so you want openpyxl as your engine. You may need to install it (see below).

The actual documentation for that [here](pandas.read_excel — pandas 2.2.1 documentation has recently changed since that post copied some info; however, you want to note it also says there now:

" engine {‘openpyxl’, ‘calamine’, ‘odf’, ‘pyxlsb’, ‘xlrd’}, default None"

That default None is important. It means you need to set it, which is what that error you are getting is trying to indicate, as welll.

And then you specify it like in this other example that came up towards the top of my search on the internet with that error as prompt just now here. In that there is a line like this specifying the engine:

df = pd.read_excel(file.read(), engine='openpyxl')

Translating that to your example, you’d run the code like this:

dataset = pd.read_excel(“XL_test.xlsx”, engine='openpyxl')

When you first try to run that, you may get an error if you haven’t yet installed openpyxl where your kernel is running. So you need to install that. You haven’t said if Anaconda/conda is your primary package manager or not? If it is then you can install it by making a new cell in the notebook where you want to use it and executing %conda install conda-forge::openpyxl based on here or %conda install anaconda::openpyxl based on here, and what version you want/package channel you typicall use. If you don’t use Anaconda/conda, you can use pip. So in a new cell to install from inside the running notebook, you’d run %pip install openpyxl, based on here. No matter what approach you take, make sure you restart the kernel after if you install within the running notebook.

This is not pertinent to this forum about Jupyter. If you ran this code directly in a Python console/interpreter or as part of a traditional Python script that you trigger running on the command line by pointing Python to execute it, then you’d get the same result. Meaning it is independent of Jupyter and comes from Python, in particular the pandas package you are trying to use. Try to always do this thought experiment when deciding where best to seek help.

Topic		Replies	Views
[Browser] No module xlrd General	2	7226	October 31, 2020
[Errno 2] No such file or directory, new JupyterLab	3	2433	December 4, 2023
Unable to import csv file in jupyter note book General how-to , help-wanted	6	3932	March 17, 2024
Cant load data into a dataframe General	2	807	November 9, 2020
Running pd.read_csv JupyterLab	5	4017	January 31, 2024

Cannot import excel into a dataframe

ValueError: Excel file format cannot be determined, you must specify an engine manually.

Related topics