600字范文,内容丰富有趣,生活中的好帮手!
600字范文 > Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

时间:2023-06-21 19:23:37

相关推荐

Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

文章目录

代码介绍1,初识XLSX格式前言Excel 之 Xlsx格式详解了解一下 XLSX, (Office Open XML, Spreadsheet ML)看看它的结构:_relsdocProps[Content_Types].xml 2,Xlsx格式详解Package StructureContent TypesRelationshipsParts Specific to SpreadsheetML DocumentsParts Shared by Other OOXML Documents Spreadsheet Content OverviewThe Grid(主要用到这些)TablesPivot TablespivotCacheDefinitionpivotCacheRecordspivotTable Spreadsheet Styles(也会用到一部分)Text-Level FormattingCell-Level FormattingTable-Level FormattingConditional Formatting 3,Xlsx格式内容补充3.2.9.2.1 Shared Formulas3.2.9 Cell3.2.9.1 Cell Values 4,代码简单讲解前提条件第一步,参考一下官方文档的示例代码第二步,改吧改吧变成自己的代码

代码介绍

此代码来源于我开发的一个小工具jar包,项目名称为ExcelToDB,顾名思义,从各种文件系统读取Excel提取数据插入到各种数据库。

特点:轻量级jar包(小jar包灵活性强,而不是kettle那么笨重),配置好后一键运行,可搭载到Job调度中一键调用,甚至可以把它集成到你们公司的管理系统后端🤭。

内存消耗小,速度快,灵活性强,编写代码自动化配置既可实现一键导入,否则需要简单的手工配置。

读取几十万行的.xlsx文件,并将数据转换为标准SQL并执行的一个小jar包,打包后17MB左右,优化后可以更小,可配置日志记录内容,可写到多种目标数据库,从多种文件系统读取Excel,只需要简单的XML配置即可。

转换并执行的速度:

实际测试中,8172行,8MB,每行60个字段的Excel里的真实业务数据,8700ms即可导入到数据库,初版未优化,预计优化后速度可以更快。

/airdest/ExcelToPostgreDB

认准v1.0.4版本哦!项目已开源,拿走不谢哈哈哈哈哈哈

下面就是研发过程需要的知识了,进行了详细整理(●ˇ∀ˇ●),不想浪费时间再造一遍轮子的,可以去我项目里把写好的代码拿来改吧改吧,就是你的了嘿嘿嘿嘿。

1,初识XLSX格式

前言

最近被要求用Java读取大文件Xlsx(超过10MB的Excel文档),并导入数据库,百度搜到了POI。

一开始对POI了解不够深入,于是轻松地用POI的usermodel开发出了第一版,然鹅,usermodel对于内存的要求太高,只要文件超过5MB就开始GC overhead limit exceeded了。

(而且读取6MB的xlsx文件居然能占用3个G的内存??)

于是我决定使用eventmodel模式,然鹅,由于excel里带有公式的缘故,需要开发的代码量非常大,而且在百度很难找到使用event model的POI的案例,百度,谷歌找了一整天都没发现合适的案例,最终决定自己从底层开始,一点一点啃掉这个知识点。顺便整理个文档,输入输出,福曼学习法

也不是没考虑过市面上其它的工具,比如:

easyexcelhutool操作poiexcel-streaming-readergridexceljxl

然鹅这些工具都不能同时满足以下几个特性:

低内存消耗(至少不能导致GC overhead limit exceeded,或者OOM)自动处理复杂公式(最难受的是把/12/20,ABC/BCS-DS/BUC这样的日期字符串读取为公式,总之现有的类库对公式的支持都不够多)读取超大文件(增量数据20MB以上,全量数据将近500MB)

也许easyexcel可以做到,然鹅黑盒子不敢用呀,而且easyexcel抽象级别过高,不够灵活,没有文档,学习成本过高的同时无法满足项目需求。

Excel 之 Xlsx格式详解

以下内容部分来自于以下链接,并对其内容做了部分修改。

https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml

了解一下 XLSX, (Office Open XML, Spreadsheet ML)

Office Open XML简称 OOXML,是xlsx格式的标准,它在ECMA-376中制定,

下面这个链接是ECMA-376的文档下载地址。

http://www.ecma-/publications/standards/Ecma-376.htm

还有关于OOXML,wiki上有详细的介绍

/wiki/Office_Open_XML

The Open Office XML-based spreadsheet format using .xlsx as a file extension has been the default format produced for new documents by versions of Microsoft Excel since Excel .

The format was designed to be equivalent to the binary .xls format produced by earlier versions of Microsoft Excel (see MS-XLS).

For convenience, this format description uses XLSX to identify the corresponding format.

The primary content of a XLSX file is marked up in SpreadsheetML, which is specified in parts 1 and 4 of ISO/IEC 29500, Information technology – Document description and processing languages –Office Open XML File Formats (OOXML).

This description focuses on the specification in ISO/IEC 29500: and represents the format variant known as “Transitional.”

Although editions of ISO 29500 were published in , , , and , the specification has had very few changes other than clarifications and corrections to match actual usage in documents sinceSpreadsheetMLwas first standardized in ECMA-376, Part 1 in .

This description can be read as applying to all SpreadsheetML versions published by ECMA International and by ISO/IEC through .

See Notes below for more detail on the chronological versions and differences.

The XLSX format uses the SpreadsheetML markup language and schema to represent a spreadsheet “document.”

Conceptually, using the terminology of the Spreadsheet ML specification in ISO/IEC 29500-1, the document comprises one or more worksheets in a workbook.

A worksheet typically consists of a rectangular grid of cells.

Each cell can contain a value or a formula, which will be used to calculate a value, with a cached value usually stored pending the next recalculation.

A single spreadsheet document may serve several purposes: as a container for data values;

as program code (based on the formulas in cells) to perform analyses on those values; and as one or more formatted reports (including charts) of the analyses.

Beyond basics, spreadsheet applications have introduced support for more advanced features over time.

These include mechanisms to extract data dynamically from external sources, to support collaborative work, and to perform an increasing number of functions that would have required a database application in the past, such as sorting and filtering of entries in a table to display a temporary subset.

The markup specification must support both basic and more advanced functionalities in a structure that supports the robust performance expected by users.

看看它的结构:

An XLSX file is packaged using the Open Packaging Conventions (OPC/OOXML_, itself based on ZIP_6_2_0).

The package can be explored, by opening with ZIP software, typically by changing the file extension to.zip.

The top level of a minimal package will typically have three folders (_rels,docProps, andxl) and one file part ([Content_Types].xml).

Thexlfolder holds the primary content of the document including the file partworkbook.xmland aworksheetsfolder containing a file for each worksheet, as well as other files and folders that support functionality (such as controlling calculation order) and presentation (such as formatting styles for cells) for the spreadsheet.

Any embedded graphics are also stored in thexlfolder as additional parts.

The other folders and parts at the top level of the package support efficient navigation and manipulation of the package:

_rels

_relsis a Relationships folder, containing a single file.rels(which may be hidden from file listings, depending on operating system and settings).

It lists and links to the key parts in the package, usingURIsto identify the type of relationship of each key part to the package.

In particular it specifies a relationship to the primary officeDocument (typically named/xl/workbook.xml) and typically to parts withindocPropsas core and extended properties.

docProps

docPropsis a folder that contains properties for the document as a whole, typically including a set of core properties, a set of extended or application-specific properties, and a thumbnail preview for the document.

[Content_Types].xml

[Content_Types].xmlis a file part, a mandatory part in anyOPC package, that lists the content types (usingMIME Internet Media Typesas defined in RFC 6838) for parts within the package.

The standards documents that specify this format run toover six thousand pages.

Useful introductions to the XLSX format can be found at:

Anatomy of a SpreadsheetML File by Daniel Dick of Reuters.

Structure of a SpreadsheetML document from Open XML SDK documentation. Includes diagram showing typical spreadsheet document parts.

下一篇:Apache POI 用eventmodel 读取大文件Excel (2) Xlsx格式内容详解

2,Xlsx格式详解

此节部分内容来自于:

/anatomyofOOXML-xlsx.php

/SScontentOverview.php

/SSstyles.php

Package Structure

A SpreadsheetML or.xlsxfile is a zip file (a package) containing a number of “parts” (typically UTF-8 or UTF-16 encoded) or XML files.

The package may also contain other media files such as images.

The structure is organized according to the Open Packaging Conventions as outlined in Part 2 of the OOXML standard ECMA-376.

You can look at the file structure and the files that comprise a SpreadsheetML file by simply unzipping the.xlsxfile.

The number and types of parts will vary based on what is in the spreadsheet, but there will always be a[Content_Types].xml, one or more relationship parts, a workbook part , and at least one worksheet.

The core data of the spreadsheet is contained within the worksheet part(s), discussed in more detail at xsxl Content Overview.

Content Types

Every package must have a[Content_Types].xml, found at the root of the package.

This file contains a list of all of the content types of the parts in the package.

Every part and its type must be listed in[Content_Types].xml.

The following is a content type for the main content part:

<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>

It’s important to keep this in mind when adding new parts to the package.

Relationships

Every package contains a relationships part that defines the relationships between the other parts and to resources outside of the package.

This separates the relationships from content and makes it easy to change relationships without changing the sources that reference targets.

For anOOXMLpackage, there is always a relationships part (.rels) within the_relsfolder that identifies the starting parts of the package, or the package relationships.

For example, the following defines the identity of the start part for the content:

<Relationship Id="rId1" Type="/officeDocument//relationships/officeDocument" Target="xl/workbook.xml"/>.

There are also typically relationships within.relsforapp.xmlandcore.xml.

In addition to the relationships part for the package, each part that is the source of one or more relationships will have its own relationships part.

Each such relationship part is found within a_relssub-folder of the part and is named by appending'.rels'to the name of the part.

Typically the main content part (workbook.xml) has its own relationships part (workbook.xml.rels).

It will contain relationships to the other parts of the content, such assheet1.xml,sharedStrings.xml,styles.xml,theme1.xml, as well as theURIsfor external links.

A relationship can be either explicit or implicit.

For an explicit relationship, a resource is referenced using the Id attribute of a<Relationship>element.

That is, the Id in the source maps directly to an Id of a relationship item, with an explicit reference to the target.

For example, a worksheet might contain a hyperlink such as this:

<w:hyperlink ref="A11" r:id="rId4">

The r:id=“rId4” references the following relationship within the relationships part for the worksheet (worksheet1.xml.rels).

<Relationship Id="rId4" Type="http://. . ./hyperlink" Target="/" TargetMode="External"/>

For an implicit relationship, there is no such direct reference to a<Relationship>Id.

Instead, the reference is understood.

Parts Specific to SpreadsheetML Documents

Below is a list of the possible parts of a SpreadsheetML package that are specific to SpreadsheetML spreadsheets.

Keep in mind that a spreadsheet may only have a few of these parts.

For example, if a spreadsheet has no pivot table, then a pivot table part will not be included in the package.

Parts Shared by Other OOXML Documents

There are a number of part types that may appear in any OOXML package.

Below are some of the more relevant parts for SpreadsheetML documents.

Spreadsheet Content Overview

A SpreadsheetML document is a package containing a number of different parts, mostly XML files.

However, most of the actual content is found within one or more worksheet parts (one for each worksheet), and one sharedStrings part.

For Microsoft Excel, the content is found within an xl folder, and the worksheets are within a worksheet sub-folder.

The workbook part contains no actual content but merely some properties of the spreadsheet, with references to the separate worksheet parts which contain the data.

<workbook . . .>. . .<workbookPr . . ./><sheets><sheet name="sheet1" r:id="rId1"><sheet name="sheet2" r:id="rId2"><sheet name="sheet3" r:id="rId3"></sheets>. . .</workbook>

A worksheet can be either a grid, a chart, or a dialog sheet.

The Grid(主要用到这些)

A grid of cells (or a “cell table”) is the most common type or worksheet.

Cells can contain text, booleans, numbers, dates, and formulas.

It is important to understand from the outset that most text values are not stored within a worksheept part.

In an effort to minimize duplication of values, a cell value that is a string is stored separately in the shareStrings part.(There is an exception to this generalization, however. A cell can be of type inlineStr, in which case the string is stored in the cell itself, within an is element.)

All other cell values–booleans, numbers, dates, and formulas (as well as the values of formulas) are stored within the cell.

Some properties for the sheet are at the beginning of the root<worksheet>element.

The number and sizes of the columns of the grid are defined within a<cols>.

And then the core data of the worksheet follows within the<sheetData>element.

The sheet data is divided into rows (<row>), and within each row are cells (<c>).

Rowsare numbered or indexed, beginning with 1, with therattribute (e.g., rowr="1").

Each cell in the row also has a reference attribute which combines the row number with the column to make the reference attribute (e.g.,<c r="D3">).

If a cell within a row has no content, then the cell is omitted from the row definition.

<worksheet . . .>. . .<cols><col min="1" max="1" width="26.140625" customWidth="1"/>. . .</cols><sheetData><row r="1"><c r="A1" s="1" t="s"><v>0</v>. . .</c></row>. . .</sheetData>. . .<mergeCells count="1"><mergeCell ref="B12:J16"/></mergeCells><pageMargins . . ./><pageSetup . . ./><tableParts ccount="1"><tableParts count="1"></tablePart r:id="rId2"/></worksheet>

The make-up of a cell is important in understanding the overall architecture of the spreadsheet content.

Each cell specifies its type with thetattribute.

Possible values include:

bfor booleandfor dateefor errorinlineStrfor an inline string (i.e., not stored in the shared strings part, but directly in the cell)nfor numbersfor shared string (so stored in the shared strings part and not in the cell)strfor a formula (a string representing the formula)

When a cell is a number, then the value is stored in the<v>element as a child of<c>(the cell element).

<c r="B2" s="5" t="n"><v>400</v></c>

A date is the same, though the date is stored as a value in theISO 8601format.

For inline strings, the value is within an<is>element.

But of course the actual text is further nested within atsince the text can be formatted.

<c r="C4" s="2" t="inlineStr"><is><t>my string</t></is></c>

For a formula, the formula itself is stored within anfelement as a child element of<c>.

Following the formula is the actual calculated value within a<v>element.

<c r="B9" s="3" t="str"><f>SUM(B2:B8)</f><v>2105</v></c>

When the data type of the cell is s for shared string, then the string is stored in the shared strings part.

However, the cell still contains a value within a<v>element, and that value is the index (zero-based) of the stored string in the shared strings part.

So, for example, in the example below, the actual string is the 9th occurrence of the<si>element within the shared strings part.

<c r="C1" s="4" t="s"><v>8</v></c>

The shared string part may look like this:

(感觉shared string应该是类似于常量的东西)

<sst xmls="/spreadsheetml//main" count="19" uniqueCount="13"><si><t>Expenses</t></si><si><t>Amount</t></si><si><t>Food</t></si><si><t>Totals</t></si><si><t>Entertainment</t></si><si><t>Car Payment</t></si><si><t>Rent</t></si><si><t>Utilities</t></si><si><t>Insurance</t></si><si><t>Date Paid</t></si>. . .</sst>

Tables

Data on a worksheet can be organized into tables.

Tables help provide structure and formatting to the data by having clearly labeled columns, rows, and data regions.

Rows and columns can be added easily, and filter and sort abilities are automatically added with the drop down arrows.

The actual table data for the cells is usually stored in the worksheet part as any other data, but the definition of the table is stored in a separate table part which is referenced from the worksheet in which the table appears.

<worksheet . . .>. . .<sheetData>. . .</sheetData>. . .<tableParts count="1"><tableParts count="1"></tablePart r:id="rId2"/></worksheet>

Within the rels part for the worksheet is the following:

<Relationship Id="rId2" Type="/officeDocument//relationships/table" Target=".. /tables/table1.xml"/>

The table part is shown below.

The content of the table part is below.

<table xmlns="/spreadsheetml//main" id="1" name="Table1" displayName="Table1" ref="A18:C22" totalRowShown="0"><autoFilter ref="A18:C22"/><tableColumns count="3">tableColumn id="1" name="Expenses"tableColumn id="2" name="Amount"tableColumn id="3" name="Date Paid"</tableColumns><tableStyleInfo name="TableStyleMedium9" showFirstColumn="0" showLastColumn="0" showRowStripes="1" showColumnStripes="0"/></table>

The ref attribute in red above defines the range of cells within the worksheet that comprise the table.

Pivot Tables

Pivot tables are used to aggregate data, and to summarize and display it in an understandable layout.

For example, suppose I have a large spreasheet which captures the sales of four products in four cities.

I may have a column for the product, date, quantity sold, city, and state.

Each day has an entry for each product in each city, or 16 entries per day.

So even with only 4 products in 4 cities, I could have 5840 rows of data for a year.

What if wanted to determine what city had the most sales in the spring months?

What product was improving?

What city had the greatest sales of red widgets?

Pivot tables help to summarize the data and quickly provide the answers to these questions.

Pivot tables have a row axis, a column axis, a values area, and a report filter area.

Each table also has a field list from which users can select which fields to include in the pivot table.

Below is a pivot table that summarizes the sales and revenue by product.

A pivot table is comprised of the following components.

There is the underlying data that the pivot table summarizes. This data may be on the same worksheet as the pivot table, on a different worksheet, or it may be from an external source.A cache or copy of that data is created in a part called the pivotCacheRecords part; a cache is needed when, e.g., the external data source is unavailable.There is a pivotCacheDefinition part that defines each field in the pivot table and contains shared items, much like the sharedStrings part contains strings to remove redundancy in a worksheet.The pivotTable part defines the layout of the pivot table itself, specifying what fields are on the row axix, the column axix, the report filter, and the values area.

The workbook points to and owns the pivotCacheDefinition part.

There is the reference in the workbook to the cache of data for the pivot table, following the references to the worksheets:

<pivotCaches><pivotCache cacheId="13" r:id="rId4"/></pivotCaches>

The rels part for the workbook contains that reference:

<Relationship Id="rId4" Type="/officeDocument//relationships/pivotCacheDefinition" Target="pivotCache/pivotCacheDefinition1.xml"/>

The pivotCacheDefinition part in turn points to the pivotCacheRecords part.

<pivotCacheDefinition xmlns="/spreadsheetml//main" xmlns:r="/spreadsheetml//relationships" r:id="rId1" refreshBy="XXXX" refreshedDate="41059.666109143516" createdVersion="1" refreshedVersion="3" recordCount="32" upgradeOnRefresh="1">. . .</pivotCacheDefinition>

The rels part for the pivotCacheDefinition contains that reference:

<Relationship Id="rId1" Type="/officeDocument//relationships/pivotCacheRecord" Target="pivotCacheRecords1.xml"/>

The pivotCacheDefinition part also references the source data in its

<cacheSource> element:<cacheSource type="worksheet"><worksheetSource ref="A1:F33" sheet="Sheet1"/></cacheSource>

The worksheet that contains the pivot table references the pivotTable part.

(There may be more than one, since a worksheet can have more than one pivot table.)

The rels part for the worksheet contains that reference:

<Relationship Id="rId1" Type="/officeDocument//relationships/pivotTable" Target="../pivotTables/pivotTable1.xml"/>

The pivotTable part references the pivotCacheDefinitions part.

The rels part for the pivotTable part contains that reference:

<Relationship Id="rId1" Type="/officeDocument//relationships/pivotCacheDefinition" Target="../pivotCache/pivotCacheDefinition1.xml"/>

pivotCacheDefinition

Now let’s look briefly at these parts and try to make sense out of them.

Let’s begin with the pivotCacheDefinition.

As mentioned above, it specifies the location of the source data.

It also defines each field (such as data type and formatting to be used) in the source data, including those not used in the pivot table.

(What fields are actually used is specified in the pivot table part.) And it is used as a cache for shared strings, just as the SharedStrings part is used to store strings that appear in worksheets.

The definition of the six fields in our example worksheet is below.

<pivotCacheDefinition . . .><cacheSource type="worksheet">>worksheetSource ref="A1:F33" sheet="Sheet1"/></cacheSource><cacheFields count="6"><cacheField name="Product" numFmtId="0"><sharedItems count="4"><s v="Green Widget"/><s v="Red Widget"/><s v="Grey Widget"/><s v="Blue Widget"/></sharedItems></cacheField><cacheField name="Quantity Sold" numFmtId="0"><sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="1" maxValue="9"/></cacheField><cacheField name="Date" numFmtId="14"><sharedItems containsSemiMixedTypes="0" containsNoDate="0" containsDate="1" containsString="0" minDate="-03-04T00:00:00" maxDate="-03-06T00:00:00 count=2"><d v="-03-04T00:00:00"/><d v="-03-05T00:00:00"/></sharedItems></cacheField><cacheField name="Revenue" numFmtId="165"><sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="1" maxValue="9"/></cacheField><cacheField name="City" numFmtId="0"><sharedItems count="4"><s v="Rochester"/><s v="Albany"/><s v="Pittsburgh"/><s v="Philadelphia"/></sharedItems></cacheField><cacheField name="State" numFmtId="0"><sharedItems count=2"><s v="NY"/><s v="PA"/></sharedItems></cacheField></cacheFields></pivotCacheDefinition>

The first field defined above is the product field.

It consists of shared string values.

If the field does not have shared string values (such as the second field defined above–the Quantity Sold field), then the values are stored directly in the pivotCacheRecords part.

pivotCacheRecords

Let’s look at the pivotCacheRecords part to see how the field definitions relate to the cached data.

Below are the first two rows of data in the cache.

<pivotCacheRecords . . .><r><x v="0"/><n v="2"/><x v="0"/><n v="2"/><x v="0"/><x v="0"/></r><r><x v="1"/><n v="3"/><x v="0"/><n v="3"/><x v="0"/><x v="0"/></r>. . .</pivotCacheRecords>

This corresponds to the data from the worksheet shown below.

Note first that each record (<r>) of the cached data has the same number of values as are defined in the pivotCacheDefinition–in our case, six.

Within each record are the following possible elements:

<x>- indicating the index value referencing an item for the field as defined in the pivotCacheDefinition

<s>- indicating a string value is being expressed inline in the record

<n>- indicating a numeric value is being expressed inline in the record

Looking at the two sample records from the pivotCacheRecords above, we know from the pivotCacheDefinition that the six values are product, quantity, date, revenue, city, and state in that order.

The Product field in the first record is<x v="0"/>, so the value (0) is an index into the items listed in the product field.

The first one listed (index 0) is Green Widget.

The second or quantity field value is<n v="2"/>, so the value (2) is a numeric value expressed inline.

The third or date field value is<x v="0"/>, so the value (0) is an index into the items listed in the date field(-03-04T00:00:00 or 3/4). Etc.

pivotTable

Now let’s look at the pivotTable part.

The root element is the<pivotTableDefinition>element.

There are several components within this.

First, the location of the pivot table on the worksheet is specified.

The location is straightforward.

Note that both the first header and data columns are specified.

<pivotTableDefinition . . .><location ref="B37:G43 firstHeaderRow="1" firstDataRow="2" firstDataCol="4"/></cacheSource>

The order of items for fields and other field information for each field is then specified by<pivotField>elements within a<pivotFields>.

<pivotFields count="6"><pivotField axis="axisRow" compact="0" outline="0" subtotalTop="0" showAll="0" includeNewItemsFilter="1"><items count="5"><item sd="0" x="3"/><item sd="0" x="0"/><item sd="0" x="2"/><item sd="0" x="1"/><item t="default"/></items></pivotField><pivotField dataField="1" compact="0" outline="0" subtotalTop="0" showAll="0" includeNewItemsFilter="1"/><pivotField axis="axisRow" compact="0" numFmtId="14" outline="0" subtotalTop="0" showAll="0" includeNewItemsFilter="1"><items count="3"><item sd="0" x="0"/><item sd="0" x="1"/><item t="default"/></items></pivotField>. . .</pivotFields count="6">

From the pivotCacheDefinition we know that the first<pivotField>above is the product. It has 5 items listed.

The first one is<item sd="0" x="3"/>.

Thesdattribute indicates whether the item is hidden.

A value of 0 means the item is not hidden.

Thexattribute is the index for the items in the<cacheField>for the product in the pivotCacheDefinition.

The<cacheField>is shown below.

Note that the value of the item at index 3 is Blue Widget, so Blue Widget should appear first in the pivot table if and where the product field is shown.

<cacheField name="Product" numFmtId="0"><sharedItems count="4"><s v="Green Widget"/><s v="Red Widget"/><s v="Grey Widget"/><s v="Blue Widget"/></sharedItems></cacheField>

The second item has an index of 0, or “Green Widget”, the third is 2 or “Grey Widget,” and the fourth is 1 or “Red Widget.”

Note that<item t="default"/>indicates a subtotal or total.

Following the<pivotFields>collection is the<rowFields>collection.

This collection specifies what fields are actually in the pivot table on the row axis, and in what order.

In our example, when we fully expand the first row, we see that a row consists of first a product, then a city, followed by a state, and then a date.

These are the row fields.

Following the index order in the<pivotFields>collection, this is 0, 4, 5, 2. The corresponding<rowFields>looks like this.

<rowFields count="4"><field x="0"/><field x="4"/><field x="5"/><field x="2"/></rowFields>

After the<rowFields>collection is the<rowItems>collection.

This is a collection of all the values in the row axis.

There is an<i>element for each row in the pivot table.

And for each<i>there are as many<x>elements as there are item values in the row.

Thevattribute is a zero-based index referencing a<pivotField>item value.

If there is novthen the value is assumed to be 0.

The value of grand fortindicates a grand total as the last row item value.

<rowItems count="5"><i><x/></i><i><x v="1"/></i><i><x v="2"/></i><i><x v="3"/></i><i t="grand"><x/></i></rowItems>

The<colFields>collection follows, indicating which fields are on the column axis of the pivot table.

Here again<x>is an index into the<pivotField>collection.

<colFields count="1"><field x="-2"/></colFields>

The<colItems>collection follows, listing all of the values on the column axis.

<colItems count="2"><i><x/></i><i i="1"><x v="1"/></i></colItems>

There may also be a<pageFields>collection which describes which fields are found in the report filter area.

Finally, there is a<dataFields>collection, which describes what fields are found in the values area of the pivot table.

In our example, there are two fields in the values area – sum of quantity sold and sum of revenue.

Below is the collection.

Thefldattribute is the index of the field being summarized.

<dataFields count="2"><dataField name="Sum of Quantity Sold" fld="1" baseField="0" baseItem="0"/><dataField name="Sum of Revenue" fld="3" baseField="0" baseItem="0"/></dataFields>

Spreadsheet Styles(也会用到一部分)

Spreadsheets can be styled using styles, themes, and direct formatting.

There are cell styles, table styles, and pivot styles.

However, unlike in WordprocessingML, styling XML never appears with the content in a worksheet.

The formatting is always stored separately within a single styles part for the workbook.

There is also a single theme part for the entire workbook.

A cell style can specify number format, cell alignment, font information, cell borders, and background/foreground fills.

Table styles specify formatting for regions of a table, such as, e.g., headers are bold or a gray fill should be applied to alternating rows.

Pivot table styles specify formatting for regions of a pivot table, such as colors for totals or for the row axis.

Themese define a set of colors, font information, and effects on shapes.

A style or formatting element can define a color, font, or effect by referencing a theme, but of course that format may change if the thme is changed.

Text-Level Formatting

Before getting to the styles applied to a worksheet, however, let’s first cover formatting at the text level, that is, not formatting applied to the entire cell, but formatting that might change from word to word, such as different colors or effects.

For example, see cell A13 below, with blue color for the first word and orange underline for the second.

Obviously this cannot be accomplished with a cell style.

This formatting is done within the shared string part where the text of the cell is stored.

Let’s look at the XML for the first cell in row 13 of the worksheet part.

We know from the type attribute for the cellt="s"that the text is stored in the shared strings part, and from<v="25"/>we know that string is the 26th string or . (Remember that it is a zero-based index.)

<row r="13"><c r="A13" t="s"><v>25</v></c>. . .</row>

The XML for the string is below.

Note that the formatting is applied directly within the string item, just as direct formatting is applied to text runs (<r>) using run properties (<rPr>) within wordprocessingML (docx) documents.

<si><r><rPr><sz val="11"/><color theme="4"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t>Blue</t></r><r><rPr><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t xml:space="preserve"> </t></r><r><rPr><u/><sz val="11"/><color theme="9"/><rFont val="Calibri"/><family val="2"/><scheme val="minor"/></rPr><t>Widget</t></r></si>

Cell-Level Formatting

Now let’s return to cell styles.

Styles within spreadsheetML are implemented to minimize repetition, and this is done with collections.

Within the styles part there are the collections shown below.

<stylesheet xmls="/spreadsheetml//main"><numFmts/><fonts/><fills/><borders/><cellStyleXfs/><cellXfs/><cellStyles/><dxfs/><tableStyles/></stylesheet>

Most of the collections above (except for<dxfs>and<tableStyles>) relate to cells.

And the first four–numFmts, fonts, fills, and borders–contain all of the possible charateristics for every cell in the workbook.

Each may have many elements, each one defining the characteristics for a set of cells that have the same such characteristics.

For example, below is a sample of the<fills>for a workbook.

Every cell in the workbook will use one of these fill definitions.

<fills count="5"><fill><patternFill patternType="none"/></fill><fill><patternFill patternType="gray125"/></fill><fill><patternFill patternType="solid"><fgColor rgb="FFFFEB9C"/></patternFill/></fill><fill><patternFill patternType="solid"><fgColor theme="5" tint="0.39997558519241921"/><bgColor indexed="65"/></patternFill/></fill><fill><patternFill patternType="solid"><fgColor rgb="FFC6EFCE"/></patternFill/></fill></fills>

The<fill>for a particular cell is specified with a zero-based index into the above fills collection.

The same is true of the font for the cell, the number format, and the borders.

So the formatting for a cell can be specified with a list or collection of indices into these four collections.

And in fact, that is what the<cellXfs>is.

It contains a collection of groups of indices, one group for every combination of cell formatting characteristics found in the workbook.

Below is one such grouping.

<cellXfs count="14"><xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>. . .</cellXfs>

Every cell will have a reference to one<xf>in the<cellXfs>collection.

This is direct formatting for the cell.

To apply a style to the cell, the<xf>references the style using thexfIdattribute.

ThexfIdattribute is an index into the<cellStyleXFs>collection, which collects the cell styles available to the user.

The<cellStyleXFs>contains one<xf>for each style.

Each such<xf>is tied to its name via an index (in itsxfIdattribute) from the<cellStyles>collection.

Let’s try and tie it all together by looking at a sample.

Consider row 10 in the sample below.

The first cell A10 has a cell style applied.

The XML for the cell in the worksheet is below.

<row r="10"><c r="A10" s="12" t="s"><v>6</v></c>. . .</row>

From the attributes="12"we know that the cell’s formatting is stored at the 13th (zero-based index)<xf>within the<cellXfs>collection in the styles part.

The 13th<xf>is below.

<xf numFmtId="0" fontId="8" fillId="4" borderId="0" xfId="3"/>

So for this cell, the number format is the first (index value is 0) within the<numFmts>collection.

The cell uses the font format found within the 9th<font>in the<fonts>collection, the 5th<fill>within the<fills>collection (which references a theme for the green), and the first<border>within the<borders>collection.

This cell also applies a style (xfId="3")–the 4th<xf>within the<cellStyleXfs>collection.

The style is shown below.

<xf numFmtId="0" fontId="8" fillId="4" borderId="0" applyNumberFormat="0" applyBorder="0" applyAlignment="0" applyProtection="0"/>

The formatting of the style is same as the direct formatting, and the attributes applyNumberFormat, applyBorder, applyAlignment, and applyProtection, each with values of 0, tell us not to apply the corresponding values of the style but instead apply the values for the direct formatting.

In this case they are the same, so there is no difference anyway.

Table-Level Formatting

A table applies a table style by specifying a<tableStyleInfo>element within the table definition in the tables part.

For example, the following sample table definition specifies the TableStyleMedium9 style.

Note that it is specified by name.

Note also that not only is the style specified, but the specification also tells us which aspects of the style are turned on (e.g., showRowStripes=“1”) and which are turned off (e.g., showLastColumn=“0”).

Each table style is made up of a collection of formatting definitions, each of which corresponds to a particular region of the table–e.g., whole table, first column stripe, first row stripe, first column, header column, first header cell, etc.

Each of these formatting definitions can be turned on or off.

<table xmlns="/spreadsheetml//main" id="1" name="Table1" displayName="Table1" ref="A18:C22" totalRowShown="0"><autoFilter ref="A18:C22"/><tableColumns count="3">tableColumn id="1" name="Expenses"tableColumn id="2" name="Amount"tableColumn id="3" name="Date Paid"</tableColumns><tableStyleInfo name="TableStyleMedium9" showFirstColumn="0" showLastColumn="0" showRowStripes="1" showColumnStripes="0"/></table>

Annex G of the ECMA-376, 3rd Edition (June, ) OOXMLspecification defines built-in styles for cells, tables, and pivot tables, and style TableStyleMedium9 is among the built-in table styles.

The built-in table and pivot table styles are not stored in the styles part–only custom styles are.

Below is a custom style defined in the styles part, based on theTableStyleMedium9style.

<tableStyles count="1" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16"><tableStyle name="My Custom Table Style" pivot="0" count="3"><tableStyleElement type="wholeTable" dxfId="2"><tableStyleElement type="headerRow" dxfId="1"><tableStyleElement type="firstColumn" dxfId="0"></tableStyle></tableStyles>

The style looks like this:

The style definition above uses differential formatting records (<dxf>elements referenced from the dxfId attribute), which enables subsets of formatting to be specified instead of specifying all formatting.

Looking at the sample above, we see that the default style is theTableStyleMedium9style.

From that we are altering three aspects – the wholeTable, headerRow, and firstColumn.

Each of these elements references (again using a zero-based index) the<dxfs>collection within the styles part.

For example, the<headerRow>element references the second<dxf>(dxfId=“1”).

It applies bold and a fill background color to the default table styledefaultTableStyle="TableStyleMedium9".

<dxfs count="3"><dxf><font><b val="0"/><i/><strike/></font><fill><patternFill><bgColor theme="2" tint="-0.2499465926081701"/></patternFill></fill></dxf><dxf><font><b/><i val="0"/><strike val="0"/></font><fill><patternFill><bgColor theme="8" tint="0.59996337778862885"/></patternFill></fill></dxf><dxf><fill><patternFill><bgColor theme="5" tint="0.59996337778862885"/></patternFill></fill><border><left style="hair"><color auto="1"/></left><right style="hair"><color auto="1"/></right><top style="hair"><color auto="1"/></top><bottom style="hair"><color auto="1"/></bottom><vertical style="hair"><color auto="1"/></vertical><horizontal style="hair"><color auto="1"/></horizontal></border></dxf></dxfs>

Conditional Formatting

Conditional formatting is a format such as cell shading or font color that a spreadsheet can apply automatically to cells if a specified condition is true.

For example, you can specify that a cell fill color should be red if the value in the cell is above 50.

It can be a very effective tool for visually highlighting important aspects of the data in a worksheet.

Conditional formatting rules are stored in the worksheet part, within a<conditionalFormatting>element after the<sheetData>element.

The range of cells to which the formatting applies is specified with thesqrefattribute.

Each condition is within a<cfRule>element.

Multiple rules can be set, each with a different priority.

There are several different types of rules to specify different conditions.

For example,type="cellIs"will determine a cell format based on whether a cell value is greater than or less than a specified value, or between two values.

Atype="dataBar"will display a bar of varying length within a cell based on the value in the cell.

Theses types are set with the type attribute on<cfRule>. Atype="iconSet"will display an icon in the cell based upon the value in a cell.

Below is a sample table which applies two conditions to the cells B2:B7 - one which applies a pink color fill if the value of the cell is greater than 500 and the other which applies a green fill if the value is less than 300.

The XML for the conditions is below.

<conditionalFormatting sqref="B2:B7"><cfRule type="cellIs" dxfId="0" priority="2" operator="greaterThan"><formula>500</formula></cfRule><cfRule type="cellIs" dxfId="1" priority="1" operator="lessThan"><formula>300</formula></cfRule></conditionalFormatting>

一口气看到这里,说明你对Xlsx的格式已经有了基本的了解,加油,后面还有呐!

3,Xlsx格式内容补充

在上一节中,已经了解了OOXML大部分的构成,然鹅在实际操作中,遇到了t="shared"这样的共享公式,是上一篇中没有出现过的内容,stackOverflow上查到是共享公式shared formula但是更详细的内容就没了。

在百度,谷歌,stackOverflow上搜到的描述,讲解OOXML的信息很少,导致最后不得不追根溯源,去wiki上看了一下,发现OOXML是ECMA-376制定的。

.xlsx格式出现的时间是,且它遵守了这个标准,那这个标准诞生的时间必须在以及之前,经过查询发现ECMA-376 1st 第一版就是发布的,所以把它下载下来,看看内容。

经过查询发现,关于.xlsx的XML的cell里的字段定义是在SpreadsheetML里定义的,以下是ECMA原文内容:

8.4 SpreadsheetML

This subclause introduces the overall form of a SpreadsheetML package, and identifies some of its main element types.

(See Part 3 for a more detailed introduction.) A SpreadsheetML package

has a relationship of type officeDocument, which specifies the

location of the main part in the package. For a SpreadsheetML

document, that part contains the workbook definition.

这里它说详细的内容在第三部分,于是我找到第三部分,发现果然是有的:

经过查找,补充一部分需要使用的内容,这些都来自于ECMA-376 1st part3

3.2.9.2.1 Shared Formulas

<row r="7" spans="4:8"><c r="H7" s="1"><f t="shared" ref="H7:H11" ce="1" si="0">SUM(E7:G7)</f><v>1.0246225028914113</v></c></row><row r="8" spans="4:8"><c r="H8" s="1"><f t="shared" ce="1" si="0">SUM(E8:G8)</f><v>0.9063376048733931</v></c></row>

Just as strings in cells can be extremely pervasive and redundant in a sheet (and therefore must be optimized), formulas are also extremely pervasive in a sheet, and often can be optimized.

Consider the table in the above example, where column H contains a formula that sums the numbers in columns E through G, for each row.

The only difference between the formulas in H6:H12 is that the reference increases by 1 row from one row to the next.

Therefore, an optimization is created where only the formula in H6 needs to be written out, with some additional information indicating how far to propagate the formula once loaded.

This enables the loading application to load and parse only the first of the shared formulas, and then more quickly apply the necessary transforms to produce the additional related formulas in subsequent cells.

Note that while formulas can be shared, it is desirable to enable easy access to the contents of a cell.

Therefore, it is allowed that all formulas may be written out, but only the primary formula in a shared formula need be loaded and parsed.

3.2.9 Cell

<c r="B3"><f>B2+1</f><v>2</v></c>

The cell itself is expressed by theccollection.

Each cell indicates it’s location in the grid using A1-style reference notation.

A cell can also indicate a style identifier (attributes) and a data type (attributet).

The cell types include string, number, and Boolean.

In order to optimize load/save operations, default data values are not written out.

3.2.9.1 Cell Values

Cells contain values, whether the values were directly typed in (e.g., cell A2 in our example has the value External Link:) or are the result of a calculation (e.g., cell B3 in our example has the formula B2+1).

String values in a cell are not stored in the cell table unless they are the result of a calculation.

Therefore, instead of seeing External Link: as the content of the cell’svnode, instead you see a zero-based index into the shared string table where that string is stored uniquely.

This is done to optimize load/save performance and to reduce duplication of information.

To determine whether the 0 invis a number or an index to a string, the cell’s data type must be examined.

When the data type indicates string, then it is an index and not a numeric value.

在ECMA part4中,终于找到了关于公式的定义

还有普通类型的定义

当前就补充到这里,如果后续有需要再返回来补充

4,代码简单讲解

前提条件

使用场景:

1,大文件Excel,比如100MB,几十万行

2,仅限于.xlsx格式的Excel

3,要求低内存消耗

<dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>4.1.2</version></dependency>

第一步,参考一下官方文档的示例代码

相信在看到这篇博客之前,你已经了解了POI官网对于Event API的代码示例:

/components/spreadsheet/how-to.html#xssf_sax_api

import java.io.InputStream;import java.util.Iterator;import org.apache.poi.util.XMLHelper;import org.apache.poi.openxml4j.opc.OPCPackage;import org.apache.poi.xssf.eventusermodel.XSSFReader;import org.apache.poi.xssf.model.SharedStringsTable;import org.xml.sax.Attributes;import org.xml.sax.ContentHandler;import org.xml.sax.InputSource;import org.xml.sax.SAXException;import org.xml.sax.XMLReader;import org.xml.sax.helpers.DefaultHandler;import javax.xml.parsers.ParserConfigurationException;public class ExampleEventUserModel {public void processOneSheet(String filename) throws Exception {OPCPackage pkg = OPCPackage.open(filename);XSSFReader r = new XSSFReader( pkg );SharedStringsTable sst = r.getSharedStringsTable();XMLReader parser = fetchSheetParser(sst);// To look up the Sheet Name / Sheet Order / rID,// you need to process the core Workbook stream.// Normally it's of the form rId# or rSheet#InputStream sheet2 = r.getSheet("rId2");InputSource sheetSource = new InputSource(sheet2);parser.parse(sheetSource);sheet2.close();}public void processAllSheets(String filename) throws Exception {OPCPackage pkg = OPCPackage.open(filename);XSSFReader r = new XSSFReader( pkg );SharedStringsTable sst = r.getSharedStringsTable();XMLReader parser = fetchSheetParser(sst);Iterator<InputStream> sheets = r.getSheetsData();while(sheets.hasNext()) {System.out.println("Processing new sheet:\n");InputStream sheet = sheets.next();InputSource sheetSource = new InputSource(sheet);parser.parse(sheetSource);sheet.close();System.out.println("");}}public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {XMLReader parser = XMLHelper.newXMLReader();ContentHandler handler = new SheetHandler(sst);parser.setContentHandler(handler);return parser;}/*** See org.xml.sax.helpers.DefaultHandler javadocs*/private static class SheetHandler extends DefaultHandler {private SharedStringsTable sst;private String lastContents;private boolean nextIsString;private SheetHandler(SharedStringsTable sst) {this.sst = sst;}public void startElement(String uri, String localName, String name,Attributes attributes) throws SAXException {// c => cellif(name.equals("c")) {// Print the cell referenceSystem.out.print(attributes.getValue("r") + " - ");// Figure out if the value is an index in the SSTString cellType = attributes.getValue("t");if(cellType != null && cellType.equals("s")) {nextIsString = true;} else {nextIsString = false;}}// Clear contents cachelastContents = "";}public void endElement(String uri, String localName, String name)throws SAXException {// Process the last contents as required.// Do now, as characters() may be called more than onceif(nextIsString) {int idx = Integer.parseInt(lastContents);lastContents = sst.getItemAt(idx).getString();nextIsString = false;}// v => contents of a cell// Output after we've seen the string contentsif(name.equals("v")) {System.out.println(lastContents);}}public void characters(char[] ch, int start, int length) {lastContents += new String(ch, start, length);}}public static void main(String[] args) throws Exception {ExampleEventUserModel example = new ExampleEventUserModel();example.processOneSheet(args[0]);example.processAllSheets(args[0]);}}

第二步,改吧改吧变成自己的代码

如下代码处理每一个Element的数据,并将其拼接成为SQL

很抱歉,暂时为了保密,只能展示部分代码,希望这些代码能对你有所帮助。

SheetHandler

package com.airde.handler;/*** Author : Airde* Date: /12/7 14:36*/import com.airde.dto.Constants;import com.airde.dto.DBInfo;import com.airde.pojo.Column;import com.airde.pojo.JDBC;import com.airde.util.JdbcUtil;import com.airde.util.SqlUtil;import lombok.extern.slf4j.Slf4j;import org.apache.poi.xssf.model.SharedStringsTable;import org.xml.sax.Attributes;import org.xml.sax.SAXException;import org.xml.sax.helpers.DefaultHandler;import java.util.*;/*** @ClassName XlsxHandler* @Description 采用SAX方式依次处理每一个element* 使用map限定元素数量,避免VALUES长度不一致的问题。* @Author airde* @Date /12/7 14:36* @Version 1.0**/@Slf4jpublic class SheetHandler extends DefaultHandler {//共享区private SharedStringsTable sharedStringsTable;private String nowCellColumnNo;//内容区private String lastContents;private boolean nextIsSharedString;//拼接SQLprivate StringBuilder insertSql;private String nowRowNum;//通用属性区private JDBC jdbc;private DBInfo dbInfo;private String insertSqlHead;//计数器private int valuesNum;private int commitNum;private int valuesNumCount = 0;private int commitNumCount = 0;private int dataStartNum;private int dataStopNum;private int skipNumCount = 0;//拼接区private List<String> columnNoList = new ArrayList<>();private Map<String, String> valuesMap = new HashMap<>();//对ColumnList的每一个Column,都产生对应的值//在每一个columnNo都生成对应的值之后,检查每一对值是否为空//如果出现为空的值,那就填补为空字符串,最后再进行拼接(endEle遇到row进行整理拼接)public void initValuesMap() {for (String columnNo : columnNoList) {valuesMap.put(columnNo, "");}}@Overridepublic void startDocument() throws SAXException {super.startDocument();//初始化this.insertSqlHead = SqlUtil.generateInsertSqlHead(dbInfo);insertSql = new StringBuilder("");insertSql.append(insertSqlHead);initValuesMap();JdbcUtil.openFullConnection(jdbc);}@Overridepublic void endDocument() throws SAXException {//防止没到指定次数的剩余记录未提交,只要不为Null或空字符串,就执行提交if (insertSql.indexOf(",") != -1) {if (insertSql != null && !"".equals(insertSql.toString())) {//最后一次拼接与执行insertSql.delete(insertSql.lastIndexOf(","), insertSql.length());insertSql.append(";");//执行try {JdbcUtil.executeByFullConnection(insertSql.toString());} catch (Exception e) {e.printStackTrace();log.error("execute sql error!" + nowRowNum);}}}//用完连接就关上mitFullConnection();JdbcUtil.closeFullConnection();super.endDocument();}public SheetHandler(SharedStringsTable sst, DBInfo dbInfo, JDBC jdbc, List<Column> columnList, String dataStartNum, String dataStopNum) {//默认一条insert 包含50个记录this.valuesNum = 50;//默认每1000个insert commit一次mitNum = 1000;//默认起始数据行this.dataStartNum = Integer.parseInt(dataStartNum) - 1;this.dataStopNum = Integer.parseInt(dataStopNum) - 2;for (Column column : columnList) {columnNoList.add(column.getNo());}this.dbInfo = dbInfo;this.jdbc = jdbc;this.sharedStringsTable = sst;}/*** 遇到一个元素之前,进行什么动作** @param uri XML命名空间标识符* @param localName 不带前缀的元素名* @param qName当前Element的元素名* @param attributes Cell里对应的属性列表* @throws SAXException SAX解析异常*/@Overridepublic void startElement(String uri, String localName, String qName,Attributes attributes) throws SAXException {if ("row".equals(qName)) {nowRowNum = attributes.getValue("r");//新的row对应新的一条插入语句} else if ("c".equals(qName)) {// c => cellString cellType = attributes.getValue("t");nowCellColumnNo = attributes.getValue("r");//去掉结尾的数字nowCellColumnNo = nowCellColumnNo.substring(0, nowCellColumnNo.lastIndexOf(nowRowNum));//下一个元素是否为sharedString?nextIsSharedString = "s".equals(cellType);}// Clear contents cachelastContents = "";}/*** 遇到一个元素之后,进行什么动作** @param uri XML命名空间标识符* @param localName 不带前缀的元素名* @param qName当前Element的元素名* @throws SAXException SAX解析异常*/@Overridepublic void endElement(String uri, String localName, String qName) throws SAXException {if ("row".equals(qName)) {//如果当前已积累skipNumCount不等于skipNum,说明当前读取的行还不是数据行,那就不拼接if (skipNumCount < dataStartNum || skipNumCount >= dataStopNum) {//log.error("skip this row!");//每个row完了,就+1skipNumCount += 1;}else {//map形式拼接StringBuilder singleMapSql = new StringBuilder("");singleMapSql.append("(");for (String columnNo : columnNoList) {if ("-1".equals(columnNo)){singleMapSql.append("'',");}else {singleMapSql.append("'");singleMapSql.append(valuesMap.get(columnNo));singleMapSql.append("',");}}//删除多出来的逗号,闭合singleMapSql.delete(singleMapSql.lastIndexOf(","), singleMapSql.length());singleMapSql.append("),");//每处理完一行,就要初始化mapinitValuesMap();//拼接完毕,就把它摁到InsertSql里insertSql.append(singleMapSql);//每一行结束,都得把值+1commitNumCount += 1;valuesNumCount += 1;}//如果达到了对应执行的阈值,就进行执行处理if (valuesNumCount == valuesNum) {insertSql.delete(insertSql.lastIndexOf(","), insertSql.length());insertSql.append(";");//执行try {//log.info(insertSql.toString());JdbcUtil.executeByFullConnection(insertSql.toString());} catch (Exception e) {e.printStackTrace();log.error("execute sql error" + nowRowNum);}//执行完初始化insertSqlinsertSql = new StringBuilder("");insertSql.append(insertSqlHead);//初始化valuesNumCountvaluesNumCount = 0;}//如果达到了对应commit的阈值,就进行commit处理if (commitNumCount == commitNum) {mitFullConnection();//执行完初始化commitNumCountcommitNumCount = 0;}} else {// TODO 这里否则之后默认为cell的情况?//处理SharedString的情况,可能也可以换成//if(Constants.CELL_TYPE_SHAREDSTRING.equals(nowCellType)){//}if (nextIsSharedString) {int idx = Integer.parseInt(lastContents);lastContents = sharedStringsTable.getItemAt(idx).getString();nextIsSharedString = false;}//如果nowCellNo不在columnNoList里,则不拼接这个属性,否则才拼接这个属性boolean ifInColumnNoList = false;for (String columnNo : columnNoList) {if (nowCellColumnNo != null && nowCellColumnNo.equals(columnNo)) {ifInColumnNoList = true;break;}}// v => contents of a cell// Output after we've seen the string contents// 拼接SQLif ("v".equals(qName)) {if (ifInColumnNoList) {valuesMap.put(nowCellColumnNo, lastContents);//同时检查put进去的是否为Null,不为null就行为null就替换为""空字符串if (valuesMap.get(nowCellColumnNo) == null) {valuesMap.put(nowCellColumnNo, "");}}}}}@Overridepublic void characters(char[] ch, int start, int length) {lastContents += new String(ch, start, length);}}

SheetHandlerWrapper

package com.airde.handler;/*** Author : Airde* Date: /12/9 9:52*/import com.airde.pojo.Task;import com.airde.util.SqlUtil;import lombok.extern.slf4j.Slf4j;import org.apache.poi.openxml4j.opc.OPCPackage;import org.apache.poi.xssf.eventusermodel.XSSFReader;import org.apache.poi.xssf.model.SharedStringsTable;import org.xml.sax.ContentHandler;import org.xml.sax.InputSource;import org.xml.sax.SAXException;import org.xml.sax.XMLReader;import org.xml.sax.helpers.XMLReaderFactory;import javax.xml.parsers.ParserConfigurationException;import java.io.InputStream;/*** @ClassName SheetHandlerWrapper* @Description 用于触发sheethandler的执行* @Author airde* @Date /12/9 9:52* @Version 1.0**/@Slf4jpublic class SheetHandlerWrapper {public static void generateInsertSqlAndExecute(Task task){try {if ("false".equals(task.getIsFtp() )){OPCPackage opcPackage = OPCPackage.open(task.getSrcPath());XSSFReader xssfReader = new XSSFReader( opcPackage );SharedStringsTable sharedStringsTable = xssfReader.getSharedStringsTable();XMLReader xmlReader = getXmlReader(sharedStringsTable, task);//虽然现在必须使用rId1,rId2这种形式,但是可以通过workbookdata根据sheetname查找sheetid,InputStream sheet = xssfReader.getSheet("rId"+task.getSheet().getSheetId());InputSource sheetSource = new InputSource(sheet);//智能产生插入SQL并执行xmlReader.parse(sheetSource);sheet.close();}else {log.error("This is a ftp srcPath ,this version can't handle it, skip this Task");return;}}catch (Exception e){e.printStackTrace();log.error("GenerateInsertSqlAndExecute ERROR!");}}public static XMLReader getXmlReader(SharedStringsTable sst,Task task) throws SAXException, ParserConfigurationException {try {XMLReader xmlReader = XMLReaderFactory.createXMLReader();ContentHandler handler = new SheetHandler(sst,SqlUtil.getDbInfoFromTask(task),task.getJdbc(),task.getColumnList(),task.getSheet().getDataLine(),task.getSheet().getDataEndLine());xmlReader.setContentHandler(handler);return xmlReader;}catch (Exception e){e.printStackTrace();log.error("The function getXmlReader Error,it's weired to find this problem.");}return null;}}

你居然有耐心看完!相信你一定收获颇丰!一键三连!拜托了!(●’◡’●)

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。