FoCS 03 - Data Storage

The story of cover: the first hard drive in the world

Data types in Computer

Today, we can store any data we have in our computer, including: numbers, text, audio, image, video, etc.
Most of mordern computer are based on binary, in other words, the computer can only store and manipulate 0 and 1, so how did computer store the data?

It’s no doubt that we store all data as binary in our computer, there must be some methods of representation to convert 0 and 1 to what we can see.

What is bit?

Before talking about bit pattern, we need to know what is bit?
Simple, bit is the smallest unit in computer, which only means a 0 or 1.

What is bit pattern?

Bit pattern is a sequence of bits, and sometimes we call it bit stream as well.

For different use, we can define the binary numbers’ meaning, and let the computer know the mapping rules, so that we can use binary to represent the content. In theory, if the binary number is big enough, it could represent anything, even the universe.

Fixed point number storage

Fixed point number is usually used for integer, and the point is on the rightest position of number, so we call it fixed point number as well.

1
2


storage content point(which is fixed, but not in storage)
101011010101011 .           

For saving memory(make the length of binary shorter), we have two ways to present an integer: unsigned notation and signed notation.

Unsigned notation

Its range includes [0, 2^n) , n is the length of binary number.

How it works?

If I want to store a decimal number 7 into a 8-length binary unit:

1
2
3
4
5


convert the 8 to binary:
7 => 111

put it in a 7-length binary unit:
111 => 00000111

By the way, computers are used to use 8-length binary unit, and call it byte.

How to read?

Just like convert a binary number to deicmal number

If I want to read the number I just stored from a 8-length binary unit:

1
2


just convert it to decimal number:
00000111 => 7

Overflow

If you want to store a number which is bigger than the binary unit allowed, then the overflow will happen.

For example, if you want to store a decimal number 21 into a 4-length binary unit:

1
2
3
4


convert 21 into binary:
21 => 10101
put it in a 4-length binary unit:
10101 => 0101 

As you can see, the first 1 is dropped, because the 4-length binary unit only allows 4 numbers, the number 21 becomes 5.

When we use it?

When you need number which is >= 0. Here are some common use:

when we use number to present the quantity
when presenting a logical address in computer
when presenting other data types(text, image, audio, video, etc)

Signed absolute value

Because of its own problem, it’s not common in computer, and I will introduce what the problem is after a while.

Its range includes (-2^(n-1), 2^(n-1)) , n is the length of binary number.

How it works?

As we mentioned above, computer are used to use byte(8-length binary unit). For signed absolute value notation, the first bit in byte is the sign, 0 for positive, 1 for negative.

If I want to store decimal numbers +7 and -12

Decimal number	Sign	Binary data	Result
7	0	0000111	00000111
-12	1	0001100	10001100

How to read?

Here’s the steps:

read the first number, is the sign
convert remaining content to decimal number
get the result

For example, read the binary number 11001101 in signed absolute value

1
2
3
4
5
6
7


the first number for 11001101 is 1, which means it's a negative number

convert the remaining content to decimal number:
1001101 => 77

get the result:
-77

Overflow

Just as unsigned notation, if you put a number which is bigger than the binary unit allowed, the overflow will happen.

If I store a decimal number 130 to a 8-length bit sequence:

1
2
3
4
5


convert 130 into binary
130 => 10000010

put it in a byte
then you will get 10000010 

The number 130 becomes -2, because of the overflow.

The problem

For number 0, the data part is 0000000, but sign part could be 0 and 1, so there have two 0: +0 and -0.

Two’s complement

Two’s complement solves the problem of double zero, and it’s the standard method for integer storage in computer.
If you want to get the opposite number of a specific number, just calculating the two’s complement of the number.

Its range includes [-2^(n-1), 2^(n-1)) , n is the length of binary number.

How it works?

It is based on Signed absolute value, so they are similiar.

For positive number, they are the same.
For negative number, number needs to be converted to two’s complement form.

How to calculate two’s complement?

check whether the sign bit of number is 1, if yes, continue

from right to left, and all numbers (excepting the sign number) which is on the left of the first 1 needs to be reversed (1->0, 0->1)

How does it store a decimal number -125:

	Sign	Data
Signed absolute value	1	1111101
After two’s complement	1	0000011

So the result is: 10000011

How to read?

For positive numbers: the same as signed absolute value
For negative numbers: you need to get the two’s complement

We use the result from the previous step: converting 10000011 to a decimal number

	Sign	Data
The original number	1	0000011
After two’s complement	1	1111101

So the result is 11111101
As you can see, when you use two’s complement to a number for two times, it will be restored.

Overflow

Just as other notations, when you put a number which is not in the range of the notation, it will overflow.

If I store a decimal number -130 to a 8-length bit sequence:

	Sign	Data
Convert -130 to binary	1	10000010
After two’s complement	1	01111110
Stored in bit sequence	0	1111110

The number -130 becomes 126

Floating point number storage

Floating point number storage includes a floating point. By moving the point, we can change the number size easily.
It’s used for presenting numbers which includes very huge integer part or very small decimal part.

How it works

For any number uses floating number notation includes 3 parts:

Sign(S): it’s used to identify the number is positive or negative
Exponent(E): it’s used to express the position of point
Fraction(F): it’s used to present the number content

Scientific notation

It’s widely know that we can use scientific notation to shorten the length of number.
In floating point number notation, we will use the scientific notation.

Using scientific notation to convert a binary number 0.000000101 (As you can see, it has very small decimal part)

1
2
3
4
5
6
7


counting how many numbers between the point and the first number 1
number:      0.000000101
 count:        1234567
So the point needs to move right for 9 times to get behind first number 1

because the based number is 2, and we move to right for 7 times
the result is 1.01 * 2^(-7)

Standardization

In a non-zero binary number, the starting number is always 1, so we can ignore it.
If we do so, we could save memory(The less content you save, the less storage you use), we call this step as standardization.

Then the result becomes: .01 * 2^(-7)

Sign and Fraction

Ok, from the previous step, we get .01 * 2^(-7), and now we can identify the sign and the fraction parts.

Just as fixed point numbers, positive number’s sign is 0, negative is 1.
To get the fraction, the only thing we need to do is remove the point

So here it is:

Sign	Exponent	Fraction
0	unkown	01

Exponent

Storing a zero or positive number is easier than negative number, because we don’t need to care about sign. But the truth is the point can move to left or right, so the exponent could be positive or negative, which is annoying.

So we make an offset to make a negative number becomes a zero or positive number.
For a storage unit includes 4 bit size for exponent, we can store 16 (2^4) numbers in it:

Before offset	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8
After offset	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15

Why is the range [-7,8] rather than [-8,7]?

In short, it’s designed by IEEE.
But they said “There is no standard for offset binary, but most often…”.
You can check more details from here and here.

Let’s continue! As you can see, if you want to offset the range to non-negative, you need to offset for 2^(n-1)-1, which n is the bit size of storage unit.

We got 2^(-7) from the standardization. From the offset table, we can get 0 after offset.

Sign	Exponent	Fraction
0	0	01

So the storage content is depended on the storage unit.

Float32 & Float64

Float32 and Float64 are two standards made by IEEE, which is commonly used in floating number presenting.

Float32

Float32 has 32 bits totally, which includes:

Sign: 1 bit
Exponent: 8 bits
Fraction: 23 bits

Range calculating:

Min absolute value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


E                       F
00000000(8 totally)     1000...000(23 totally)
   |                       |
   V                       V
 2^0                    1-2^(-1)
   | remove offset         |
   V                       V
 2^(-127)               1-2^(-1)

max absolute value: (1-2^(-1))*2^(-127)

Max absolute value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


E                       F
11111111(8 totally)     1111...111(23 totally)
   |                       |
   V                       V
 2^255                  1-2^(-24)
   | remove offset         |
   V                       V
 2^128                  1-2^(-24)

max absolute value: (1-2^(-24))*2^128

The range of float32 is:
[-(1-2^(-24))*2^128, -(1-2^(-1))*2^(-127)] ∪ [(1-2^(-1))*2^(-127), (1-2^(-24))*2^128]

Just like others, if you store a number which is not in the range, it will overflow.

Float64

Float64 has 64 bits totally, which includes:

Sign: 1 bit
Exponent: 11 bits
Fraction: 52 bits

Range calculating:

Min absolute value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


E                       F
000...00(11 totally)    1000...000(52 totally)
   |                       |
   V                       V
 2^0                    1-2^(-1)
   | remove offset         |
   V                       V
 2^(-1023)              1-2^(-1)

min absolute value: (1-2^(-1))*2^(-1023)

Max absolute value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


E                       F
111...11(11 totally)    1111...111(52 totally)
   |                       |
   V                       V
 2^2047                 1-2^(-53)
   | remove offset         |
   V                       V
 2^1024                 1-2^(-53)

max absolute value: (1-2^(-53))*2^1024

The range of float64 is:
[-(1-2^(-53))*2^1024, -(1-2^(-1))*2^(-1023)] ∪ [(1-2^(-1))*2^(-1023), (1-2^(-53))*2^1024]

Float64 can present a huger number range than Float32, but it still might overflow.

Zero

You might noticed that whatever float32 or float64, their range don’t include number 0. Zero is the special occasion for floating numbers, it still could be stored in float32 and float64 in fact.

When sign, exponent, fraction are all 0, the number is 0.0

Text storage

Text is composed by characters, so it’s a character problem elemently. In other words, how we handle characters is how we handle texts.

Encoding method is the method handles characters. According to the method, we could map the character and a binary series.

ASCII

ASCII is the first mapping rule, and every character is presented in a 7-length binary sequence, so it can only present 127(2^7) types of character, which is very limited.

Unicode

It’s the most common character set we used today. It includes all character from different languages and marks and every character is presented in a 32-length binary sequence. ASCII is a part of Unicode, and because of its overwhelming convenience, other encodings become unpopular anymore.

Audio storage

Unlike numbers and characters, audio is not defined clearly, which means it cannot be stored lossless, we can only emulate how the audio generated.

Sampling rate

The sampling rate determines how many details can we emulate.
Just like the calculus, we cut the audio to n parts averagely. When n is larger, which means the sampling rate is higher and the more details we emulated. In contrast, when the n is smaller, the sampling rate is lower and less details we emulated.

Normally, 40000 times/second of sampling rate is good enough.

Quantification

We can using a number to present every sample we get, and this process is callled quantification.

For simplifing the process, we usually use integer to do the quantification.
For example, when the sample value is 17.8, we will use 18 instead of 17.8.

Encoding

In this step, the audio will become a binary sequence which could be stored in our computer.

Bits per sample

It’s the index to measure the quality of the sample. It defines how many bits we could use to present the quantification of sample. The bits per sample higher, the quality of emulation of every sample higher.

Code rate

Code rate = Sampling rate * Bits per sample
File size = Code rate * Audio duration

For 1-minute music with 40000 sampling rate and 16 bits per sample:

1
2
3
4
5


the code rate:
40000 times/second * 16 bits/times = 640000 bits/second = 640 Kb/s

the file size:
60 second * 640 Kb/s = 38400 Kb = 3.84 Mb 

Image storage

There has two types of images: Bitmap graphics and Vector graphics, and they use different principles to illustrate images.

Bitmap graphics

In this way, images are composed by pixels, so all things needed are describing the pixels:

How many pixels it has
What the color of the pixel

Resolution

Resolution is used for illustrate how many pixels the image has. Usually, the higher resolution, the more clear the image is.

Resolution is composed by pixels on width and pixels on height, like: 1920 * 1080, and here are some common resolutions:

1080p: 1920 * 1080, 1920 pixels on width, 1080 pixels on height
1440p: 2560 * 1440, 2560 pixels on width, 1440 pixels on height
4K: 3840 * 2160, 3840 pixels on width, 2160 pixels on height

Color depth

Also, we needs a binary sequence to describe the color of pixel, and the length of binary sequence is called color depth. So the larger color depth, the more color you can use, the larger the image.

Every color in computer is made up of red, green, blue(you can mix them in different ratio to generate new colors), so the color code is composed by the code of red, green, blue as well. The larger color code of specific color means the more specific color in the result color(the mixed color).

Here are some sample of 3-bits color:

Code	R	G	B	Color
111000000	111	000	000	red
000111000	000	111	000	green
000000111	000	000	111	blue
000000000	000	000	000	black
111111111	111	111	111	white

Usually the color depth is 8 bit, which could present 16777216 colors.

Pros and cons

Pros:

The image could present complex content
It’s commonly used in our daliy life(photography)

Cons:

The resolution is fixed. Once you zoom in, the image will be blurred.

Vector graphics

Just like programs, the vector graphics is composed by instructions which tells the computer how to draw it.

If I want to draw a circle, I need:

Where is the center of the circle
How many is the circle’s radius
What’s the color of the border of the circle
What’s the color in the circle

Vector graphics are generated by professional software.

Pros and cons

Pros:

When you zoom in, the graphic will be re-rendered, so it won’t be blurred.
The size is smaller than bitmaps graphics

Cons:

You can only generate it by using professional software
It cannot be used for illustrate complex things

Video storage

Videos is made up of a series of images and played rapidly. In other words, video file is a file which includes infomation of what the specific image on specific time. But if we do so, the video file must be very huge, so today, our video files are all compressed.