http://zorba.io/modules/csv

Description

Before using any of the functions below please remember to import the module namespace:

import module namespace csv = "http://zorba.io/modules/csv";

Function library providing converters from CSV/TXT to XML and back. The functions are optimized to work with large amounts of data, in a streaming way.

Module code

Here is the actual XQuery module code.

Imported Schemas

http://zorba.io/modules/csv-options

Please note that the schemas are not automatically imported in the modules that import this module.

In order to import and use the schemas, please add:

import schema namespace csv-options =  "http://zorba.io/modules/csv-options";

Imported modules

http://zorba.io/modules/schema

Authors

Daniel Turcanu

Version Declaration

xquery version "3.0" encoding "utf-8";

Namespaces

csv	http://zorba.io/modules/csv
csv-options	http://zorba.io/modules/csv-options
schemaOptions	http://zorba.io/modules/schema
ver	http://zorba.io/options/versioning

Function Summary

parse($csv as xs:string, $options as element(csv-options:options)?) as element() Parse a CSV or fixed size text and convert to XML.
parse-internal($csv as xs:string, $options as element(csv-options:options,csv-options:optionsType)?) as element() external
serialize($xml as element(), $options as element(csv-options:options)?) as xs:string Convert XML into CSV or fixed size text.
serialize-internal($xml as element(), $options as element(csv-options:options,csv-options:optionsType)?) as xs:string external

Functions

parse#2

declare function csv:parse(
    $csv as xs:string,
    $options as element(csv-options:options)?
) as element(*)*

Parse a CSV or fixed size text and convert to XML.
By default each line is converted to a <row> element, and each field to a <column> element inside <row>.
The format of the param $options is:

    <csv-options:options>
        <csv  [separator="default comma ,"] ?
          [quote-char="default double quotes &quote;"]?
          [quote-escape="default double double quotes &quote;&quote;"]? />
        or
        <column-widths>
          <column-width>[column fixed width, unsigned int]<column-width>*
        </column-widths>
        or
        <column-positions>
          <column-position>[column position on line, unsigned int]<column-position>*
        </column-positions>
        <first-row-is-header [line="first_line[-last_line]?"]?/>?
        <start-from-row line="first_line[-last_line]?"/>?
        <add-last-void-columns/>?
        <xml-nodes>
          [<row-name>
            [<column-name/>]?
          </row-name>]?
        </xml-nodes>?
    </csv-options:options>

All the parameters are optional and can appear in any order.
All the parameters are case sensitive. The namespace used is "http://zorba.io/modules/csv-options".
All strings must have UTF-8 encoding.
Parameters csv, column-widths, column-positions are mutually exclusive. If none is specified, the input string is assumed to be csv.
Description of parameters:

csv

Specifies the parameters for parsing a csv string.

separator: The character or group of characters used to separating fields in a row. If it is not specified, it defaults to comma ','.
quote-char: The character or group of characters used for quoting the fields that may contain special characters, like separator, new line or this quote char. The default value is double quote ".
quote-escape: The group of characters used for escaping the quote char inside a field. The whole quote escape group is translated to a quote char during parsing. The default value is double double quotes "".

column-widths

Specifies the column widths for fixed size text. It contains multiple column-width child elements specifying the fixed width of each column, from left to right.
If the line has more fields than specified, they are ignored.

column-positions

This is an alternative to column-widths, and specifies instead the starting position of each column. Column positions are 1 based, and are specified in order from left to right. The last column is read until end of line. The first column position can be greater than 1, if you want to parse only a part of the input text.

first-row-is-header

The presence of this element indicates that the first row is to be treated as the name of the columns. If it is not present, then each field is enclosed in a <column> element, or how it is specified in <xml-nodes> parameter.
If the first row is the header, then each field is enclosed in an element with the corresponding name from the header.
For example, the csv:

        ID,Name,Occupation
        1,John,student

is parsed into:

        <row>
        <ID>1</ID>
        <Name>John</Name>
        <Occupation>student</Occupation>
        </row>

If the header names contain characters that cannot be used in a QName, they are replaced with underscore '_'.
The namespace for the header QNames is taken from the column name specified in xml-nodes parameter, or from the row name, or if that doesn't exist either then empty namespace is used.
If the header is not the first line in the input string, the starting line can be specified in the line attribute.
If a column does not have a name, a new name is constructed in the form columnN where N is the position of the column, starting from 1.
Subheaders
If the header consists of more than one line, this can be specified in the line attribute in the form "first_line - last_line". Having more lines as the header translates into a hierarchy of elements in the xml.
For example, the csv:

        ID,Name,,Occupation
        ,First Name,Last Name,
        1,John,Howard,student

is parsed into:

        <row>
        <ID>1</ID>
        <Name>
          <First_Name>John</First_Name>
          <Last_Name>Howard</Last_Name>
        </Name>
        <Occupation>student</Occupation>
        </row>

This element can have an attribute "accept-all-lines" with values "false" or "true" (default "false"). When set to true it tells the parser to not report lines that do not have the same number of items as the header. If set to false, the parser will raise a csv:WrongInput error for these lines.

start-from-row

If the data does not start from line 1 or immediately after the header, you can specify the starting line in the line attribute.
Also you can use this attribute in the form "first_line - last_line" to specify also the last line if you don't want the whole csv to be parsed.

add-last-void-columns

In the case when using headers and some data lines are shorter than the header, by default the excess columns are ignored for those lines. You can set the add-last-void-columns parameter to make all the columns appear in xml even if they are void.

xml-nodes

With this parameter you can specify the names for the row element and for the column element if there is no header.
The first element child of this element specifies the desired QName of the row element in the output xml. The name of this element will be used as the name of the row element.
The element child of this row element is the column element, and its name will be used as the name of the column elements that enclose the fields in the output xml if there is no header.
If the csv has a header, only the namespace is used from the column element.
For example, with parameter:

        <xml-nodes>
        <r>
          <c/>
        </r>
        </xml-nodes>

the output for each line will look like:

        <r>
          <c>field1</c>
          <c>field2</c>
          .......
        </r>

Parameters

$csv
the string containing the csv or fixed size text.
$options
this parameter is validated against "http://zorba.io/modules/csv-options" schema. If this parameter is not specified, the row name is by default "row" and the column name is by default "column".

Returns

element(*)*
a sequence of row elements, one for each line in csv

Errors

csv:CSV001
if the input string is streamable string and cannot be rewinded
csv:WrongInput
if the input string has lines with variable number of items, and the csv has headers and the options do not specify the ignore-foreign-input attribute
err:XQDY0027
if $options can not be validated against the csv-options schema
err:XQDY0084
if the options parameter doesn't have the name "csv-options:options".

Examples

parse-internal#2

declare %:private function csv:parse-internal(
    $csv as xs:string,
    $options as element(csv-options:options,csv-options:optionsType)?
) as element(*)* external

Returns

element(*)*

serialize#2

declare function csv:serialize(
    $xml as element(*)*,
    $options as element(csv-options:options)?
) as xs:string

Convert XML into CSV or fixed size text. Note: if you want to serialize out the result, make sure that the serializer method is set to "text". For example, in zorba command line, you have to set the param --serialize-text. When using the

file:write(...)

function, you have to set the method serialization parameter to "text":

 <output:serialization-parameters<
   <output:method value="text"/<
 </output:serialization-parameters<

The

$options

parameter must have the following format:

    <csv-options:options>

        <csv  [separator="default comma ,"] ? 

          [quote-char="default double quotes &quote;"]? 

          [quote-escape="default double double quotes &quote;&quote;"]? /> 

        

        or

        <column-widths [align="left|right"]?>

          <column-width [align="left|right"]?>[column fixed width, unsigned int]<column-width>*

        </column-widths>

        

        or

        <column-positions [align="left|right"]?>

          <column-position [align="left|right"]?>[column position on line, unsigned int]<column-position>*

        </column-positions>

        

        <first-row-is-header/>?

    </csv-options:options>

All the parameters are optional and can appear in any order.
All the parameters are case sensitive. The namespace used is "http://zorba.io/modules/csv-options".
All strings must have UTF-8 encoding.
Parameters csv, column-widths, column-positions are mutually exclusive. If none is specified, the xml is converted to csv. Description of parameters:

csv

Specifies the parameters for converting to csv.

separator: The character or group of characters used to separating fields in a row. If it is not specified, it defaults to comma ','.
quote-char: The character or group of characters used for quoting the fields that may contain special characters, like separator, new line or this quote char. The default value is double quote ".
quote-escape: The group of characters used for escaping the quote char inside a field. The whole quote escape group is translated to a quote char during parsing. The default value is double double quotes "".

column-widths

Specifies the column widths for fixed size text. It contains multiple column-width child elements specifying the fixed width of each column, from left to right.
With the attribute align you can specify how to align fields that are smaller than the column width. The default alignment is left.

column-positions

This is an alternative to column-widths, and specifies instead the starting position of each column. Column positions are 1 based, and are specified in order from left to right. The last column has a variable length.
With the attribute align you can specify how to align fields that are smaller than the column width. The default alignment is left. The last column does not need alignment.

first-row-is-header

The presence of this element indicates that the first row will contain the header, that is, the names of the column elements. Only the column names from the first row element are taken into account.
For example, the row xml:
<row>
<ID>1</ID>
<Name>John</Name>
<Occupation>student</Occupation>
</row>

is converted to
ID,Name,Occupation
1,John,student

The header names are the localnames of the column elements, and the namespace is ignored.
Subheaders
If the row-column hierarchy is more complex, then subheaders are also generated on subsequent lines. The number of subheaders depends on the depth of the column hierarchy.
When generating the subheaders, the non-whitespace text nodes are also taken into account, and a separate column is generated for them too.
For example, the xml row element:
<row>
<ID>1</ID>
<Name>
Mr.
<First_Name>John</First_Name>
<Last_Name>Howard</Last_Name>
</Name>
<Occupation>student</Occupation>
</row>
is converted to
ID,Name,,Occupation
,,First Name,Last Name,
1,Mr.,John,Howard,student

If first-row-is-header is not specified and the columns have a deeper hierarchy, only the first layer of columns is processed, and the fields are the string values of each column.
This element can have an attribute "ignore-foreign-input" with values "false" or "true" (default "false"). When set to true it tells the serializer to ignore elements that to not match the header names. If set to false, the serializer will raise a csv:ForeignInput error for these elements.

Parameters

$xml
a sequence of elements, each element representing a row. The name of each row element is ignored. The childs of each row are the column fields.
$options
The options parameter. See the function description for details. This parameter is validated against "http://zorba.io/modules/csv-options" schema.

Returns

xs:string
the csv or fixed size text as string containing all the lines

Errors

csv:CSV003
if the serialize output is streamable string and cannot be reset
csv:ForeignInput
if there are input elements in subsequent rows that do not match the headers, and the options specify first-row-is-header and do not specify the ignore-foreign-input attribute
err:XQDY0027
if $options can not be validated against csv-options schema
err:XQDY0084
if the options parameter doesn't have the name "csv-options:options".

Examples

serialize-internal#2

declare %:private function csv:serialize-internal(
    $xml as element(*)*,
    $options as element(csv-options:options,csv-options:optionsType)?
) as xs:string external

Returns

xs:string