UTF8Filer v1.4

UTF8Filer is an easy to use class which can read and write any UTF-8 (Mixed Byte) files being raw text, Delimited (CSV, Tab etc) or Fixed Width data files.
It includes functions to convert to and from Unicode as well as some useful mixed byte text handling functions.
ASP (pre .NET) does not support mixed byte files or text streams - only pure Single Byte and Unicode, so we need use this class to help us out.
Full Microsoft Excel delimited rules are also adhered to:

The VBScript function Split does not handle the CSV (Comma Separated Values) format correctly. There is more to CSV files than simply being comma separated or delimited. This class contains a function which works just like Split, except that it applies extra (standard) rules above.

If you are only using single byte (English) files, then use my TextFiler class. It is smaller and better on memory with large files.

For example, take this line from a csv file (generated from MS Excel or any other program). See how the different functions interpret the line differently:

Original lines from csv file

LNG,"Language Code",123,"Text, and Text","and ""this""","first line
second line of same field"

Split

LNG "Language Code" 123 "Text and Text" "and ""this""" "first line

TextFiler's Split Delimiter

LNG Language Code 123 Text, and Text and "this" first line
second line of same field

See UTF8FilerDemo.asp for a demo of text reading/writing text or UTF8FilerDataDemo.asp for structured data.

Usage

A simplistic version of your ASP could look something like this:

<!-- #include file=UTF8Filer.asp -->
<%

' Initialise the class
Dim MyUTF8File
Set MyUTF8File= New UTF8Filer

'Initialise the charset we are going to use
Session.CodePage = 932
MyUTF8File.UnicodeCharset = "shift_jis"

'Open the UTF-8 file
MyUTF8File.OpenFile("demo.htm")

'Convert it to Unicode so ASP functions can handle it
MyUTF8File.cTextBuffer2Unicode

' Do something with the file
Response.Write(MyUTF8File.TextBuffer)

'Clean up
set MyUTF8File = nothing

%>

Properties

ErrorText
String. Error Description if a method reported False.

VirtualFileName
String. Contains the virtual path and file name.

AbsoluteFileName
String. Contains the physical path and file name.

Big5Space
Big5 (T.Chinese) space (pseudo-constant) can be used in setting FieldPadding.

CharNumber
Long. Character index (place marker) from the ReadLine method.
Can be used to determine how far through the file we are.

Delimiter
Character. Only applicable to delimited files. , = Comma (default), vbTab = Tab, etc
Setting this property will instruct the class to run in Delimited  mode. Set FieldWidths to swap to Fixed Width mode.

FieldWidths
String. Only applicable to Fixed Width files. Widths are in characters (not bytes) and are comma separated. ie "10,5,20,8"
When reading this property, it returns the widths converted to Integers in an array.
Setting this property will instruct the class to run in Fixed Width mode. Set Delimiter to swap to Delimited mode.

FieldPadding
String. Only applicable to Fixed Width files. Left/Right + single byte/unicode (Non-UTF8) char padding with comma separator. ie "R ,,L0,R-,R" & chrw(&H3000).
Default is "R " or right space. The other common one is "L0" or left zeros.
When reading this property, it returns the padding converted to L/R and char in an array.

Fields
Array. Array of fields read by ReadLine method.

LineNumber
Long. Line index (pseudo place marker) for the ReadLine and WriteLine methods.
You can see how many lines have been read / written.

LineDelimiter
String. vbCRLF = carriage return & line feed (default), vbLF = Line feed, etc

TextBuffer
String. Contains the file opened by the LoadFile method, or new text you add.
If you do not load a file, but place data into this buffer from another source, then it is assumed that you are placing Unicode data into it.

TextBufferType
Integer. Type of the text in the TextBuffer. 1 = Single Byte, 2 = Unicode/Double Byte, 3 = Mixed Byte

UnicodeCharset
String. Name of the character set the data / file is in.
Note: Session.CodePage must be set to the equivalent value.
These are some common names: Windows-1252, X-ANSI, big5, gb2312, shift_jis, EUC-KR, UTF-8, UTF-7, ASCII, etc

Methods

LoadFile
Returns: True if the file opened successfully
Parameters: Absolute or Virtual path and file name. Must not be relative (start with "../")
Syntax: LoadFile(FileName)
Example: if not LoadFile("myfile.htm") then 'do error handling
Loads (reads) the entire file into the TextBuffer string. The string at this point will hold the unconverted UTF-8 characters.
cUFT8Unicode must then be run for most other ASP functions to handle it without corrupting the contents

SaveFile
Returns: True if the file saved successfully
Parameters: Absolute or Virtual path and file name. Must not be relative (start with "../")
Syntax: SaveFile(FileName)
Example: if not SaveFile("myfile.htm") then 'do error handling
Saves the TextBuffer string to a UTF-8 file. The TextBuffer can be in either Unicode or UTF-8 at this point.
The system needs to convert to UTF-8 and save it in one movement, so there is no point running the cUnicode2UTF8 method before you save the file.

cTextBuffer2UTF8
Returns: nothing
Parameters: none
Syntax: cTextBuffer2UTF8
Example: .cTextBuffer2UTF8
Converts the TextBuffer from Unicode to UTF-8. If it already is UTF-8 then it does nothing.

cUnicode2UTF8
Returns: nothing
Parameters: Unicode string
Syntax: cUnicode2UTF8(MyString)
Example: .cUnicode2UTF8(MyString)
Converts the string from Unicode to UTF-8.

cTextBuffer2Unicode
Returns: nothing
Parameters: none
Syntax: cTextBuffer2Unicode
Example: .cTextBuffer2Unicode
Converts the TextBuffer from UTF-8 to Unicode. If it already is Unicode then it does nothing.

cUTF82Unicode
Returns: nothing
Parameters: UTF-8 string
Syntax: cUFT8Unicode(MyString)
Example: .cUFT8Unicode(MyString)
Converts the string from UTF-8 to Unicode.

EOF
Returns: True if ReadLine (CharNumber) is at the End of the File (TextBuffer)
Parameters: none
Syntax: EOF
Example: while .EOF 'do .ReadLine etc

ReadLine
Returns: If neither Delimiter or FixedWidths has been set, returns the next line of data from TextBuffer, otherwise if they have been set, returns an array of the next line of data from TextBuffer.
Also updates Fields array if Delimiter or FixedWidths has been set
Parameters: none
Syntax: ReadLine
Example: x = .ReadLine
Reads 1 line (up to the next Line Feed) from the TextBuffer, and returns the data or text depending on configuration options set

WriteLine
Returns: True if the line was written successfully
Parameters: line or array of data to write
Syntax: WriteLine(myLine)
Example: if not WriteLine(myLine) then 'do error handling
Writes a line to the end of the TextBuffer using the method previously configured. If Delimiter or FixedWidths has been set, the field/column of data is written in that format, otherwise, the pure text is written. The line is then delimited/terminated with the line delimiter specified in the LineDelimiter property.

SplitDelimiter
Returns: populates Fields property
Parameters: String of field data
Syntax: SplitDelimiter(LineString)
Example: SplitDelimiter("1243,abcd,4321,dcba")
Converts a string to an array. This is not normally used directly, but is exposed for you to use if you have the need.

SplitFixed
Returns: populates Fields property
Parameters: String of field data. Reads FieldWidths property
Syntax: SplitFixed(LineString)
Example: SplitFixed("1243abcd   4321           dcba")
Converts a string to an array. This is not normally used directly, but is exposed for you to use if you have the need.

Below are some bonus functions. The first 2 are used internally by the class and the others I wrote/used these a while ago, but they are not really needed with way this class works.
I am adding them in just in case you find the need for them:

CountChar
Returns: Number of times SearchChar appears in SourceString
Parameters: String, Character
Syntax: CountChar(SourceString,SearchChar)
Counts the number of times SearchChar occurs in SourceLine.

InstrMB
Returns: Position of SearchChar in SourceString
Parameters: Binary String, Character
Syntax: InstrMB(SourceString,SearchChar)
Search for char in Mixed Byte string. Instr in binary mode doesn't seem to work with a binary string array.

LeftMB
Same input/output as LeftB
Return left # of UTF-8 (mixed byte) chars in a Unicode stream.

LenMB
Same input/output as LenB
Count UTF-8 (mixed byte) chars in a Unicode stream.

Important Notes

See UTF8FilerDemo.asp for a demo of text reading/writing text or UTF8FilerDataDemo.asp for structured data.

If you improve this code, please send me a copy! Thanks!
Special thanks to Lewis Moten and Cakkie (see Planet Source Code) for their techniques on UTF-8 conversion.

Hunter Beanland
hunter @ beanland.net.au
http://www.beanland.net.au/programming/

Version History
1.4 Slight optimisations
1.3 Added Padding support
1.2 Fixed 3 bugs in the Delimited Unicode mode of ReadLine and WriteLine.
1.1 Added functionality from my TextFiler class to handle Delimited and Fixed Width data files
1.0 First version.