- Convert different types of JavaScript
String
to/fromUint8Array
. - Check for
String
encoding.
The main target of this library is the Browser, where there is no Buffer
type.
Node.js is welcome too, except for toString('base64')
which depends on btoa
.
See Node.js equivalents.
npm i -S string-encode
Or add it directly to the browser:
<script src="https://unpkg.com/string-encode"></script>
<script>
const { str2buffer, buffer2str /* ... */ } = stringEncode;
// ...
</script>
The most important functions of this library are str2buffer(str, asUtf8)
and buffer2str(buf, asUtf8)
for converting any String
, including multibyte, to and from Uint8Array
.
import { str2buffer, buffer2str } from 'string-encode';
// When you know your string doesn't contain multibyte characters:
let buffer = str2buffer(binaryString, false);
// ... do something with buffer ...
let processedSting = buffer2str(buffer, false);
// When you know your string might contain multibyte characters:
let buffer = str2buffer(mbString, true);
// ...
let processedMbString = buffer2str(buffer, true);
// Let it guess whether to utf8 encode/decode or not - not recommended:
let buffer = str2buffer(anyStr);
// ...
let processedSting = buffer2str(buffer);
Simple sha1
function using crypto
for Browser, that works with String
and is compatible with the PHP counterpart:
import { str2buffer, toString } from 'string-encode';
const crypto = window.crypto || window.msCrypto || window.webkitCrypto;
const subtle = crypto.subtle || crypto.webkitSubtle;
async function sha1(str, enc='hex') {
let buf = str2buffer(str, true);
buf = await subtle.digest('SHA-1', buf);
buf = new Uint8Array(buf);
return toString.call(buf, enc);
}
How to use this sha1
function:
await sha1('something'); // "1af17e73721dbe0c40011b82ed4bb1a7dbe3ce29"
await sha1('something', false); // "\u001añ~sr\u001d¾\f@\u0001\u001b\u0082íK±§ÛãÎ)"
await sha1('что-то'); // "991fe0590dfec23402d71c0e817bc7a7ab217e2b"
await sha1('что-то', 'base64'); // "mR/gWQ3+wjQC1xwOgXvHp6shfis="
Base64 encode/decode a multibyte string:
import { utf8Encode, utf8Decode } from 'string-encode';
btoa(utf8Encode('⚔ или 😄')); // "4pqUINC40LvQuCDwn5iE"
utf8Decode(atob('4pqUINC40LvQuCDwn5iE')); // "⚔ или 😄"
string-encode in Browser |
Buffer in Node.js |
---|---|
str2buffer(str, false) | Buffer.from(str, 'binary') |
str2buffer(str, true) | Buffer.from(str, 'utf8') |
hex2buffer(str) | Buffer.from(str, 'hex') |
str2buffer(atob(str), false) | Buffer.from(str, 'base64') |
- | - |
buffer2str(str, false) | Buffer.toString('binary') |
buffer2str(str, true) | Buffer.toString('utf8') |
buffer2hex(str) | Buffer.toString('hex') |
btoa(buffer2str(str, false)) | Buffer.toString('base64') |
If you want your Uint8Array
to be one step closer to the Node.js's Buffer
,
just add the .toString()
method to it.
import { toString } from 'string-encode';
let buf = Uint8Array.from([65, 108, 111, 104, 97, 44]);
buf.toString = toString; // the magic method
console.log(buf + ' world!');
buf.toString('hex'); // "416c6f68612c"
buf.toString('base64'); // "QWxvaGEs"
Besides encoding/decoding, there are few more functions for testing string encoding.
A JavaScript String is a unicode string, which means that it is a list of unicode characters, not a list of bytes!
And it does not map one-to-one to an array of bytes without some encoding either.
This is because a unicode character requires 3 bytes to be able to encode any of the growing list of about 144 000 symbols.
Thus String
is not the best data type for working with binary data.
This is the main reason why the Node.js devs have come up with the Buffer type.
Later on there have been invented the TypedArray standard to the rescue and the Node.js devs have adopted the new type, namely Uint8Array
, as the parent type for the existing Buffer
type, starting with Node.js v4.
Meanwhile there have been written many libraries to encode, encrypt, hash or otherwise transform the data, all using the plain String
type that was available to the community since the beginning of JS.
Even some browser built-in functions that came before the TypedArray
standard rely on the String
type to do their encoding (eg. btoa == "binary to ASCII").
Today, if you want to manipulate some bytes in JavaScript, you most likely need a Uint8Array
instead of a String
for best performance and compatibility with other environments and tools.
Judging by content, there are a few kinds of JS String
s used in almost all applications.
Any String
that do not contain multibyte characters can be considered a binary string.
In other words, each character's code is in the range [0..255].
These strings can be mapped one-to-one to arrays of bytes, which Uint8Array
s basically are.
const binStr = 'when © × ® = ?';
isBinary(binStr); // true
hasMultibyte(binStr); // false
btoa(binStr); // "qSBpcyCu"
str2buffer(binStr); // Uint8Array([169, 32, 105, 115, 32, 174])
Most old-fashion encoding functions accept only this type of strings (eg. btoa
).
In JS the most common string is a Multibyte string, one that contains unicode characters, which require more than a byte of memory.
const mbStr = '$ ⚔ ₽ 😄 € ™';
isBinary(mbStr); // false
hasMultibyte(mbStr); // '⚔'
ord(mbStr[2]); // 9876
Most encoding algorithms would not accept a multibyte String
.
If you try to run btoa('€')
, you'll get an error like:
Uncaught DOMException:
Failed to execute 'btoa' on 'Window':
The string to be encoded contains characters outside of the Latin1 range.
Because €
is a multibyte character.
The solution is to encode the multibyte string into a singe-byte string somehow.
UTF8 is the most widely used byte encoding of unicode/multibyte strings in computers today.
It is the default encoding of web pages that travel over the wire (content-type: text/html; charset=UTF-8
)
and the default in many programing languages.
The important feature of UTF8 is that it is fully compatible with ASCII strings,
which means any ASCII string is also a valid UTF8 encoded string.
Unless you need symbols outside the ASCII table, this encoding is very compact,
and uses more than a byte per character only where needed.
In fact, UTF8 should be the default choice of encoding you use in a program.
const mbStr = '$ ⚔ ₽ 😄 € ™';
const utf8Str = utf8Encode(mbStr);
isBinary(utf8Str); // true
isUTF8(utf8Str); // true
isUTF8(asciiStr); // true
btoa(utf8Str); // '4oK9IOKalCAkIPCfmIQg4oKsIOKEog=='
str2buffer(utf8Str); // Uint8Array([226, 130, 189, 32, 226, 154, 148, 32, 36, 32, 240, 159, 152, 132, 32, 226, 130, 172, 32, 226, 132, 162])
Even though utf8Str
is still of type String
, it is no longer a multibyte string,
and thus can be manipulated as an array of bytes.
A subset of binary strings is ASCII only strings, which represent the class of strings with character codes in the range [0..127]. Each ASCII character can be represented with only 7 bits.
const asciiStr = 'Any text using the 26 English letters, digits and punctuation!';
isASCII(asciiStr); // true
isASCII(binStr); // false
isASCII(utf8Str); // false
All table headings are functions exported by this library.
String | guessEncoding | hasMultibyte | isBinary | isASCII | isUTF8 | utf8bytes |
---|---|---|---|---|---|---|
"" | hex | false | true | true | true | 0 |
"English alphabet is 26" | ascii | false | true | true | true | 0 |
"$ ⚔ ₽ 😄 € ™" | mb | "⚔" | false | false | false | false |
utf8Encode("$ ⚔ ₽ 😄 € ™") | utf8 | false | true | false | true | 16 |
"when © × ® = ?" | binary | false | true | false | false | false |
"Xש" | utf8 | false | true | false | true | 2 |
utf8Decode("Xש") | mb | "Xש" | false | false | false | false |
"© binary? ×" | ~utf8 | false | true | false | false | false | 2 |
I did not add the isHEX
column because it is a trivial format - you can't confuse it with the others.
Note 1:
Sometimes you can't tell whether the string has been utf8Encode
ed
or it is just a unicode string that by coincidence is also a valid utf8 string.
In the table above "Xש"
could be the original string or could be the encoded string.
Note 2:
When slicing utf8 encoded strings, you might cut a multibyte character in half. What you get as a result could be considered a valid utf8 string, with async utf8 characters at the edges.
In the table above "© binary? ×"
is such a slice.
The "©"
symbol could be the last byte of a utf8 encoded character,
and "×"
- the first of the two bytes of another character.
To be continued...
Further reading: