So you have tried all the tools and libraries, but extracting data from
giant Excel sheets is still too slow ?
On the menu for this blog post we have:
- Quick dive into the structure of XLSX files
- The lxml library and the XSLT transformation
- How to use XSLT to parse XLSX sheets
- Full listing of the code
While the method presented here-under is fast, it cuts a lot of
corners, it is only meant to extract raw data.
XLSX file structure and the performance implications
Basically an XLSX file is a zip archive of a bunch of xml files. This is how a basic sheet looks in Excel:
And this is the content of the archive (seen inside emacs) corresponding
to the above screenshot:
Filemode Length Date Time File
- ---------- -------- ----------- -------- -----------------------------------
-rw-rw-rw- 1811 1-Jan-1980 00:00:00 [Content_Types].xml
-rw-rw-rw- 588 1-Jan-1980 00:00:00 _rels/.rels
-rw-rw-rw- 1266 1-Jan-1980 00:00:00 xl/_rels/workbook.xml.rels
-rw-rw-rw- 1883 1-Jan-1980 00:00:00 xl/workbook.xml
-rw-rw-rw- 199 1-Jan-1980 00:00:00 xl/sharedStrings.xml
-rw-rw-rw- 304 1-Jan-1980 00:00:00 xl/worksheets/_rels/sheet1.xml.rels
-rw-rw-rw- 8390 1-Jan-1980 00:00:00 xl/theme/theme1.xml
-rw-rw-rw- 1618 1-Jan-1980 00:00:00 xl/styles.xml
-rw-rw-rw- 1464 1-Jan-1980 00:00:00 xl/worksheets/sheet1.xml
-rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps2.xml
-rw-rw-rw- 72 1-Jan-1980 00:00:00 xl/customProperty1.bin
-rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item1.xml.rels
-rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item2.xml.rels
-rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item3.xml.rels
-rw-rw-rw- 296 1-Jan-1980 00:00:00 customXml/_rels/item4.xml.rels
-rw-rw-rw- 41 1-Jan-1980 00:00:00 customXml/item1.xml
-rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps1.xml
-rw-rw-rw- 991 1-Jan-1980 00:00:00 docProps/app.xml
-rw-rw-rw- 49 1-Jan-1980 00:00:00 customXml/item3.xml
-rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps3.xml
-rw-rw-rw- 49 1-Jan-1980 00:00:00 customXml/item4.xml
-rw-rw-rw- 235 1-Jan-1980 00:00:00 customXml/itemProps4.xml
-rw-rw-rw- 623 1-Jan-1980 00:00:00 docProps/core.xml
-rw-rw-rw- 41 1-Jan-1980 00:00:00 customXml/item2.xml
- ---------- -------- ----------- -------- -----------------------------------
21513 24 files
Workbook
The usual entry point is the xl/workbook.xml
file which contains a
mapping between the sheet names and the corresponding file. In our
minimal example the exercise is pointless since we only have one sheet,
but let’s see have a look anyway:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" ... >
...
<sheets>
<sheet name="Sheet1" sheetId="1" r:id="rId1"/>
</sheets>
...
</workbook>
For the sake of brevity, most of the content has been replaced by dots.
So to extract this information we use XPath
, but because this file
is usually rather small any other method would fit the bill.
ns = {
'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
}
fh = zipfile.ZipFile(file_path)
name = 'xl/workbook.xml'
root = etree.parse(fh.open(name))
workbook = {}
for el in etree.XPath("//ns:sheet", namespaces=ns)(root):
workbook[el.attrib['name']] = el.attrib['sheetId']
So after running the above code, workbook
is a dictionary containing
a mapping between a sheet name and the corresponding id. In our case:
{'Sheet1': '1'}
.
Shared strings
Similarly, we have to also extract data out of
xl/sharedStrings.xml
. It contains all the strings that will appear
in the sheets. So when a cell contains a string, Excel actually store
an id that refers to the position of the corresponding string in the
sharedStrings.xml
file. This is the full content for our example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="2" uniqueCount="2">
<si><t>Ham</t></si>
<si><t>Spam</t></si>
</sst>
So parsing this xml is also done with XPath:
name = 'xl/sharedStrings.xml'
root = etree.parse(fh.open(name))
res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=ns)(root)
shared = {
str(pos): el.text
for pos, el in enumerate(res)
}
This shared string structure is one of the indirections (but not the
only one) you have to solve when parsing XLSX files. There is no easy
way to extract information as-is, hence the raw representation has to
be post-processed, that is why Python suffers when parsing XLSX.
La pièce de résistance
Now that we have collected the workbook and shared string data, we can
move our attention to the sheets themselves. This is where performance
matters because those are the parts that growth when sheets content
gets larger.
For our example, xl/worksheets/sheet1.xml
looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" ... >
<dimension ref="A1:B4"/>
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="C4" sqref="C4"/>
</sheetView>
</sheetViews>
<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25"/>
<sheetData>
<row r="1" spans="1:2" x14ac:dyDescent="0.25">
<c r="A1" t="s">
<v>0</v>
</c>
<c r="B1" t="s">
<v>1</v>
</c>
</row>
<row r="2" spans="1:2" x14ac:dyDescent="0.25">
<c r="A2">
<v>1</v>
</c>
<c r="B2">
<v>2</v>
</c>
</row>
<row r="3" spans="1:2" x14ac:dyDescent="0.25">
<c r="A3">
<v>3</v>
</c>
<c r="B3">
<v>4</v>
</c>
</row>
<row r="4" spans="1:2" x14ac:dyDescent="0.25">
<c r="A4">
<v>5</v>
</c>
<c r="B4">
<v>6</v>
</c>
</row>
</sheetData>
</worksheet>
To be able to extract sheet content we have to collect cells values,
their types and their positions.
- A cell is of type string when there is
t="s"
on the cell element (c
). - The cell position is given by the
r
attribute (ex:r="B3"
). We
have to rely on this attribute because Excel sometimes likes to
shuffle thoses. - The cell value is in the
v
element (which can be missing if the
actual cell is empty)
So you can see that cell A1 contains the shared string 0
which is
“Ham”. B2 is 1
and “Spam”.
Parsing this info is not doable with XPath, even in several passes (as
we would loose relations between different bits of info). Another way
to do that would be to use a DOM parser and then walk across the
structure to gather what we want, it works but it tend to be slow and
memory heavy on large files.
The ugly duck saves the day
XSLT is an XML to transform XML into XML (and it’s Turing-complete!), so
nobody wants to deal with it. But:
- It’s available in lxml and it’s fast, at least faster than a
pure-python equivalent. - It can spit out any text content (not only XML).
The strategy here is to use XSLT to generate a CSV (it’s fast) and
parse it again with pandas’ read_csv
(it’s fast too). As long as
pandas does not provide a read_xml
method, it’s the best way to plug
those two together.
This is how it looks:
sheet_xslt = etree.XML('''
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:sp="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
>
<xsl:output method="text"/>
<xsl:template match="sp:row">
<xsl:for-each select="sp:c">
<xsl:value-of select="parent::*/@r"/> <!-- ROW -->
<xsl:text>,</xsl:text>
<xsl:value-of select="@r"/> <!--CELL-->
<xsl:text>,</xsl:text>
<xsl:value-of select="@t"/> <!-- TYPE -->
<xsl:text>,</xsl:text>
<xsl:value-of select="sp:v/text()"/> <!-- VALUE -->
<xsl:text>\n</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
''')
It’s ugly and it took a lot of trial and error to get right. But
the most difficult part was to figure out which combination of
namespaces would appease the gods of XML.
What happens here:
- The line
<xsl:output method="text"/>
ask the transformer to output
us a simple test file. <xsl:template match="sp:row">
will select allrow
elements.<xsl:for-each select="sp:c"> ... </xsl:for-each>
is a loop onc
elements.<xsl:text>something</xsl:text>
will outputsomething
<xsl:value-of select="@t"/>
extract thet
attribute of the
currentc
tag.
The result of this transformation is a csv that contains one line per
excel cell and whose columns are ‘row’, ‘cell’, ‘type’ and ‘value’.
Once we have the correct XSLT, it’s straightforward to plug the output
of the transformation to read_csv
:
sheet_name = 'Sheet1'
sheet_id = workbook[sheet_name]
sheet_path = 'xl/worksheets/sheet%s.xml' % sheet_id
root = etree.parse(fh.open(sheet_path))
transform = etree.XSLT(sheet_xslt)
result = transform(root)
df = read_csv(BuffIO(str(result)),
header=None, dtype=str,
names=['row', 'cell', 'type', 'value'],
)
So at this point in the code, you have a DataFrame which reflects the
xml content of the sheet, something like this:
(Pdb) print(df)
row cell type value
0 1 A1 s 0
1 1 B1 s 1
2 2 A2 NaN 1
3 2 B2 NaN 2
4 3 A3 NaN 3
5 3 B3 NaN 4
6 4 A4 NaN 5
7 4 B4 NaN 6
The next steps are:
# First row numbers are sometimes filled with nan
df['row'] = to_numeric(df['row'].fillna(0))
# Translate string contents
cond = (df.type == 's') & (~df.value.isnull())
df.loc[cond, 'value'] = df[cond]['value'].map(shared)
# Add column label ('AA99' becomes 'AA')
df['col'] = df.cell.str.replace(r'[0-9]+', '')
df = df.sort_values(by='row')
# Pivot everything
df = df.pivot(
index='row', columns='col', values='value'
).reset_index(drop=True).reset_index(drop=True)
df.columns.name = None # pivot adds a name to the "columns" array
# Sort columns (pivot will put AA before B)
cols = sorted(df.columns, key=lambda x: (len(x), x))
df = df[cols]
df = df.dropna(how='all') # Ignore empty lines
df = df.dropna(how='all', axis=1) # Ignore empty cols
As you can see, there is no for-loop in Python, every low-level
operation is done through vectorized pandas method. This is the
resulting DataFrame (yay!):
A B
0 Ham Spam
1 1 2
2 3 4
3 5 6
Benchmark & Closing words
The XSLT method is twice as fast as pandas’ read_excel
. The
benchmarks were run on a large sheet of 537 lines and 341 columns:
$ python fast_xlsx_parsing.py
xslt: 3.311
pandas: 6.248
Incidentally, googling for "xslt" "pandas" "csv"
returns this
unanswered StackOverflow question with the same approach,
as usual it’s easier to find a solution when you already know the
answer. So I’m not the first to combine XSLT
and read_csv
, but I
probably am in the context of XLSX file parsing.
Full code
import io
import zipfile
from lxml import etree
from pandas import read_csv, to_numeric
class XLSX:
sheet_xslt = etree.XML('''
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:sp="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
>
<xsl:output method="text"/>
<xsl:template match="sp:row">
<xsl:for-each select="sp:c">
<xsl:value-of select="parent::*/@r"/> <!-- ROW -->
<xsl:text>,</xsl:text>
<xsl:value-of select="@r"/> <!--REMOVEME-->
<xsl:text>,</xsl:text>
<xsl:value-of select="@t"/> <!-- TYPE -->
<xsl:text>,</xsl:text>
<xsl:value-of select="sp:v/text()"/> <!-- VALUE -->
<xsl:text>\n</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
''')
def __init__(self, file_path):
self.ns = {
'ns': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main',
}
self.fh = zipfile.ZipFile(file_path)
self.shared = self.load_shared()
self.workbook = self.load_workbook()
def load_workbook(self):
# Load workbook
name = 'xl/workbook.xml'
root = etree.parse(self.fh.open(name))
res = {}
for el in etree.XPath("//ns:sheet", namespaces=self.ns)(root):
res[el.attrib['name']] = el.attrib['sheetId']
return res
def load_shared(self):
# Load shared strings
name = 'xl/sharedStrings.xml'
root = etree.parse(self.fh.open(name))
res = etree.XPath("/ns:sst/ns:si/ns:t", namespaces=self.ns)(root)
return {
str(pos): el.text
for pos, el in enumerate(res)
}
def _parse_sheet(self, root):
transform = etree.XSLT(self.sheet_xslt)
result = transform(root)
df = read_csv(io.StringIO(str(result)),
header=None, dtype=str,
names=['row', 'cell', 'type', 'value'],
)
return df
def read(self, sheet_name):
sheet_id = self.workbook[sheet_name]
sheet_path = 'xl/worksheets/sheet%s.xml' % sheet_id
root = etree.parse(self.fh.open(sheet_path))
df = self._parse_sheet(root)
# First row numbers are filled with nan
df['row'] = to_numeric(df['row'].fillna(0))
# Translate string contents
cond = (df.type == 's') & (~df.value.isnull())
df.loc[cond, 'value'] = df[cond]['value'].map(self.shared)
# Add column number and sort rows
df['col'] = df.cell.str.replace(r'[0-9]+', '')
df = df.sort_values(by='row')
# Pivot everything
df = df.pivot(
index='row', columns='col', values='value'
).reset_index(drop=True).reset_index(drop=True)
df.columns.name = None # pivot adds a name to the "columns" array
# Sort columns (pivot will put AA before B)
cols = sorted(df.columns, key=lambda x: (len(x), x))
df = df[cols]
df = df.dropna(how='all') # Ignore empty lines
df = df.dropna(how='all', axis=1) # Ignore empty cols
return df
if __name__ == '__main__':
xlsx = XLSX('example.xlsx')
xlsx.read('Sheet1')