SlideShare a Scribd company logo
Michael Rys
Principal Program Manager, Big Data @ Microsoft
@MikeDoesBigData, {mrys, usql}@microsoft.com
U-SQL Reading & Writing Files
•
•
•
•
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in Extractors: Csv, Tsv, Text with lots of options
• Custom Extractors: e.g., JSON, XML, etc.
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in Outputters: Csv, Tsv, Text
• Custom Outputters: e.g., JSON, XML, etc.
Filepath URIs
• Relative URI to default ADL Storage account: "filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
U-SQL Reading & Writing Files (SQLBits 2016)
•
•
•
•
Built-In Extractors and Outputters
• Extractors.Csv(), Extractors.Tsv(), Extractors.Text()
• Outputters.Csv(), Outputters.Tsv(), Outputters.Text()
Parallel Execution Extractors
• Every file is stored in Extents of about 250MB
• One Extract Vertex gets 4 extract processes each working on one extent
• Today:
• Upload Data as row-oriented files
• Use CR/LF as row-delimiters
• This will align row-boundaries to extend boundaries
• Otherwise: you can get data corruption or errors
Parallel Outputters
• Writes parallel extents
• Supports ORDER BY
• Stitching of extents to files
• Meta Data operation for adl:// files
• Expensive copy operation for wasb:// files!!!
Limits
• row size: 4MB
• String column: 128kB; byte[]: up to 4MB
• SQL.MAP, SQL.ARRAY not supported (transform needed)
• delimiter: column delimiter (char; Text() only)
• encoding: file encoding (System.Text.Encoding)
• Encoding.[ASCII] (7-bit)
• Encoding.BigEndianUnicode
• Encoding.Unicode
• Encoding.UTF7
• Encoding.UTF8 (This is the default)
• Encoding.UTF32
• CAVEAT: No ANSI support yet!
• escapeCharacter: escaping of delimiters (including CR/LF)
• nullEscape: allows surrogate for null value
• quoting: quoted column using "
• Default is on
• Does NOT guard row delimiter!!! (use escapeCharacter)
• rowDelimiter: row delimiter
• Default: CR LF
• silent: allows skipping rows with invalid number of columns
and nulls data type conversion errors (Extractors only)
• CAVEAT: Does not skip encoding errors
E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER Invalid character for UTF8 encoding in input stream.
Message: Invalid character for UTF8 encoding in input record at around line 0
Resolution: Correct the invalid character in the input file or correct encoding in extractor and try again. Details: 0xFF 0xFE
0x31 0x0 0x9 0x0 0x4D 0x0
U-SQL Reading & Writing Files (SQLBits 2016)
U-SQL Reading & Writing Files (SQLBits 2016)
•
•
•
•
Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Today: Limits on number of files (between 800 and 3000)
Virtual columns
EXTRACT
name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in query to get partition elimination
• Virtual columns need to be referenced for DateTime columns and
if no wildcard has been given
OUTPUT
OUTPUT @rs TO "/output/file_{*}.csv" USING Outputters.Csv();
• One file per outputter invocation. * is unique GUID
Additional
Resources
Documentation
Built-in Extractors: https://msdn.microsoft.com/en-
us/library/azure/mt621366.aspx
Built-in Outputters:
https://msdn.microsoft.com/en-us/library/azure/mt621345.aspx
FileSet: https://msdn.microsoft.com/en-
us/library/azure/mt621294.aspx
Sample Data
https://github.com/Azure/usql/blob/master/Examples/Samples/Da
ta/AmbulanceData/Drivers.txt
Sample Project
https://github.com/Azure/usql/tree/master/Examples/Builtin-
UDOs/
http://aka.ms/AzureDataLake

More Related Content

U-SQL Reading & Writing Files (SQLBits 2016)

  • 1. Michael Rys Principal Program Manager, Big Data @ Microsoft @MikeDoesBigData, {mrys, usql}@microsoft.com U-SQL Reading & Writing Files
  • 2. • • • • EXTRACT Expression @s = EXTRACT a string, b int FROM "filepath/file.csv" USING Extractors.Csv(encoding: Encoding.Unicode); • Built-in Extractors: Csv, Tsv, Text with lots of options • Custom Extractors: e.g., JSON, XML, etc. OUTPUT Expression OUTPUT @s TO "filepath/file.csv" USING Outputters.Csv(); • Built-in Outputters: Csv, Tsv, Text • Custom Outputters: e.g., JSON, XML, etc. Filepath URIs • Relative URI to default ADL Storage account: "filepath/file.csv" • Absolute URIs: • ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv" • WASB: "wasb://container@account/filepath/file.csv"
  • 4. • • • • Built-In Extractors and Outputters • Extractors.Csv(), Extractors.Tsv(), Extractors.Text() • Outputters.Csv(), Outputters.Tsv(), Outputters.Text() Parallel Execution Extractors • Every file is stored in Extents of about 250MB • One Extract Vertex gets 4 extract processes each working on one extent • Today: • Upload Data as row-oriented files • Use CR/LF as row-delimiters • This will align row-boundaries to extend boundaries • Otherwise: you can get data corruption or errors Parallel Outputters • Writes parallel extents • Supports ORDER BY • Stitching of extents to files • Meta Data operation for adl:// files • Expensive copy operation for wasb:// files!!! Limits • row size: 4MB • String column: 128kB; byte[]: up to 4MB • SQL.MAP, SQL.ARRAY not supported (transform needed)
  • 5. • delimiter: column delimiter (char; Text() only) • encoding: file encoding (System.Text.Encoding) • Encoding.[ASCII] (7-bit) • Encoding.BigEndianUnicode • Encoding.Unicode • Encoding.UTF7 • Encoding.UTF8 (This is the default) • Encoding.UTF32 • CAVEAT: No ANSI support yet! • escapeCharacter: escaping of delimiters (including CR/LF) • nullEscape: allows surrogate for null value • quoting: quoted column using " • Default is on • Does NOT guard row delimiter!!! (use escapeCharacter) • rowDelimiter: row delimiter • Default: CR LF • silent: allows skipping rows with invalid number of columns and nulls data type conversion errors (Extractors only) • CAVEAT: Does not skip encoding errors
  • 6. E_RUNTIME_USER_EXTRACT_INVALID_CHARACTER Invalid character for UTF8 encoding in input stream. Message: Invalid character for UTF8 encoding in input record at around line 0 Resolution: Correct the invalid character in the input file or correct encoding in extractor and try again. Details: 0xFF 0xFE 0x31 0x0 0x9 0x0 0x4D 0x0
  • 9. • • • • Simple pattern language on filename and path @pattern string = "/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}"; • Binds two columns date and suffix • Wildcards the filename • Today: Limits on number of files (between 800 and 3000) Virtual columns EXTRACT name string , suffix string // virtual column , date DateTime // virtual column FROM @pattern USING Extractors.Csv(); • Refer to virtual columns in query to get partition elimination • Virtual columns need to be referenced for DateTime columns and if no wildcard has been given OUTPUT OUTPUT @rs TO "/output/file_{*}.csv" USING Outputters.Csv(); • One file per outputter invocation. * is unique GUID
  • 10. Additional Resources Documentation Built-in Extractors: https://msdn.microsoft.com/en- us/library/azure/mt621366.aspx Built-in Outputters: https://msdn.microsoft.com/en-us/library/azure/mt621345.aspx FileSet: https://msdn.microsoft.com/en- us/library/azure/mt621294.aspx Sample Data https://github.com/Azure/usql/blob/master/Examples/Samples/Da ta/AmbulanceData/Drivers.txt Sample Project https://github.com/Azure/usql/tree/master/Examples/Builtin- UDOs/

Editor's Notes

  1. Shows simple Extract, OUTPUT Then simple extensibility with string functions.
  2. https://github.com/Azure/usql/tree/master/Examples/Builtin-UDOs/
  3. Add file sets.