Programming from the Ground Up

Jonathan Bartlett
Edited by Dominick Bruno, Jr.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of the license is included in Appendix H. In addition, you are granted full rights to use the code examples for any purpose without even having to credit the authors.

To receive a copy of this book in electronic form, please visit the website http://savannah.nongnu.org/projects/pgubook/ This site contains the instructions for downloading a transparent copy of this book as defined by the GNU Free Documentation License.

All trademarks are property of their respective owners.

ISBN 0-9752838-4-7

Published by Bartlett Publishing in Broken Arrow, Oklahom

Library of Congress Control Number: 2004091465

Bartlett Publishing Cataloging-in-Publication Data

Bartlett, Jonathan, 1977-
Programming from the ground up / Jonathan Bartlett ; edited by Dominick
Bruno.
p. cm.
Includes index.

ISBN 0-9752838-4-7


1. Linux. 2. Operating systems (Computers) 3. Computer programming. I. Bruno, Dominick. II. Title.

QA76.76.O63 2004

005.268—dc22 2004091465

This book can be purchased at http://www.bartlettpublishing.com/

This book is not a reference book, it is an introductory book. It is therefore not suitable by itself to learn how to professionally program in x86 assembly language, as some details have been left out to make the learning process smoother. The point of the book is to help the student understand how assembly language and computer programming works, not to be a reference to the subject. Reference information about a particular processor can be obtained by contacting the company which makes it.

No comments :

Appendix I: Personal Dedication

There are so many people I could thank. I will name here but a few of the people who have brought me to where I am today. The many family members, Sunday School teachers, youth pastors, school teachers, friends, and other relationships that God has brought into my life to lead me, help me, and teach me are too many to count. This book is dedicated to you all.

There are some people, however, that I would like to thank specifically.

First of all, I want to thank the members of the Vineyard Christian Fellowship Church in Champaign, Illinois for everything that you have done to help me and my family in our times of crisis. It's been a long time since I've seen or heard from any of you, but I think about you always. You have been such a blessing to me, my wife, and Daniel, and I could never thank you enough for showing us Christ's love when we needed it most. I thank God every time I think of you - I thank Him for bringing you all to us in our deepest times of need. Even out in the middle of Illinois with no friends of family, God showed that He was still watching after us. Thank you for being His hands on Earth. Specifically, I'd like to thank Joe and Rhonda, Pam and Dell, and Herschel and Vicki. There were many, many others, too - so many people helped us that it would be impossible to list them all.

I also want to thank my parents, who gave me the example of perserverance and strength in hard times. Your example has helped me be a good father to my children, and a good husband to my wife.

I also want to thank my wife, who even from when we first started dating encouraged me to seek God in everything. Thank you for your support in writing this book, and more importantly, for your support in being obedient to God.

I also want to thanks the Little Light House school. My entire family is continually blessed by the help you give to our son.

I also want to thank Joe and D.A. Thank you for taking a chance on me in ministry. Being able to be a part of God's ministry again has helped me in so many ways.

You all have given me the strength I needed to write this book over the last few years. Without your support, I would have been too overwhelmed by personal crises to even think about anything more than getting through a day, much less putting this book together. You have all been a great blessing to me, and I will keep you in my prayers always.

No comments :

Appendix G: Document History

  • 12/17/2002 - Version 0.5 - Initial posting of book under GNU FDL

  • 07/18/2003 - Version 0.6 - Added ASCII appendix, finished the discussion of the CPU in the Memory chapter, reworked exercises into a new format, corrected several errors. Thanks to Harald Korneliussen for the many suggestions and the ASCII table.

  • 01/11/2004 - Version 0.7 - Added C translation appendix, added the beginnings of an appendix of x86 instructions, added the beginnings of a GDB appendix, finished out the files chapter, finished out the counting chapter, added a records chapter, created a source file of common linux definitions, corrected several errors, and lots of other fixes

  • 01/22/2004 - Version 0.8 - Finished GDB appendix, mostly finished w/appendix of x86 instructions, added section on planning programs, added lots of review questions, and got everything to a completed, initial draft state.

  • 01/29/2004 - Version 0.9 - Lots of editting of all chapters. Made code more consistent and made explanations clearer. Added some illustrations.

  • 01/31/2004 - Version 1.0 - Rewrote chapter 9. Added full index. Lots of minor corrections.

  • 04/18/2004 - Version 1.1 - Lots of minor updates based on reader comments. Made cleared distinction between dynamic and shared libraries.

No comments :

Appendix F: Using the GDB Debugger

Overview

By the time you read this appendix, you will likely have written at least one program with an error in it. In assembly language, even minor errors usually have results such as the whole program crashing with a segmentation fault error. In most programming languages, you can simply print out the values in your variables as you go along, and use that output to find out where you went wrong. In assembly language, calling output functions is not so easy. Therefore, to aid in determining the source of errors, you must use a source debugger.

A debugger is a program that helps you find bugs by stepping through the program one step at a time, letting you examine memory and register contents along the way. A source debugger is a debugger that allows you to tie the debugging operation directly to the source code of a program. This means that the debugger allows you to look at the source code as you typed it in - complete with symbols, labels, and comments.

The debugger we will be looking at is GDB - the GNU Debugger. This application is present on almost all GNU/Linux distributions. It can debug programs in multiple programming languages, including assembly language.

An Example Debugging Session

The best way to explain how a debugger works is by using it. The program we will be using the debugger on is the maximum program used in Chapter 3. Let's say that you entered the program perfectly, except that you left out the line:

 incl %edi

When you run the program, it just goes in an infinite loop - it never exits. To determine the cause, you need to run the program under GDB. However, to do this, you need to have the assembler include debugging information in the executable. All you need to do to enable this is to add the --gstabs option to the as command. So, you would assemble it like this:

as --gstabs maximum.s -o maximum.o

Linking would be the same as normal. "stabs" is the debugging format used by GDB. Now, to run the program under the debugger, you would type in gdb ./maximum. Be sure that the source files are in the current directory. The output should look similar to this:

GNU gdb Red Hat Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public
License, and you are welcome to change it and/or
distribute copies of it under certain conditions. Type
"show copying" to see the conditions. There is
absolutely no warranty for GDB. Type "show warranty"
for details.
This GDB was configured as "i386-redhat-linux"...
(gdb)

Depending on which version of GDB you are running, this output may vary slightly. At this point, the program is loaded, but is not running yet. The debugger is waiting your command. To run your program, just type in run. This will not return, because the program is running in an infinite loop. To stop the program, hit control-c. The screen will then say this:

Starting program: /home/johnnyb/maximum

Program received signal SIGINT, Interrupt.
start_loop () at maximum.s:34
34 movl data_items(,%edi,4), %eax
Current language: auto; currently asm
(gdb)

This tells you that the program was interrupted by the SIGINT signal (from your control-c), and was within the section labelled start_loop, and was executing on line 34 when it stopped. It gives you the code that it is about to execute.

Depending on exactly when you hit control-c, it may have stopped on a different line or a different instruction than the example.

One of the best ways to find bugs in a program is to follow the flow of the program to see where it is branching incorrectly. To follow the flow of this program, keep on entering stepi (for "step instruction"), which will cause the computer to execute one instruction at a time. If you do this several times, your output will look something like this:

(gdb) stepi
35 cmpl %ebx, %eax
(gdb) stepi
36 jle start_loop
(gdb) stepi
32 cmpl $0, %eax
(gdb) stepi
33 je loop_exit
(gdb) stepi
34 movl data_items(,%edi,4), %eax
(gdb) stepi
35 cmpl %ebx, %eax
(gdb) stepi
36 jle start_loop
(gdb) step
32 cmpl $0, %eax

As you can tell, it has looped. In general, this is good, since we wrote it to loop. However, the problem is that it is never stopping. Therefore, to find out what the problem is, let's look at the point in our code where we should be exitting the loop:

cmpl  $0, %eax
je loop_exit

Basically, it is checking to see if %eax hits zero. If so, it should exit the loop. There are several things to check here. First of all, you may have left this piece out altogether. It is not uncommon for a programmer to forget to include a way to exit a loop. However, this is not the case here. Second, you should make sure that loop_exit actually is outside the loop. If we put the label in the wrong place, strange things would happen. However, again, this is not the case.

Neither of those potential problems are the culprit. So, the next option is that perhaps %eax has the wrong value. There are two ways to check the contents of register in GDB. The first one is the command info register. This will display the contents of all registers in hexadecimal. However, we are only interested in %eax at this point. To just display %eax we can do print/$eax to print it in hexadecimal, or do print/d $eax to print it in decimal. Notice that in GDB, registers are prefixed with dollar signs rather than percent signs. Your screen should have this on it:

(gdb) print/d $eax
$1 = 3
(gdb)

This means that the result of your first inquiry is 3. Every inquiry you make will be assigned a number prefixed with a dollar sign. Now, if you look back into the code, you will find that 3 is the first number in the list of numbers to search through. If you step through the loop a few more times, you will find that in every loop iteration %eax has the number 3. This is not what should be happening. %eax should go to the next value in the list in every iteration.

Okay, now we know that %eax is being loaded with the same value over and over again. Let's search to see where %eax is being loaded from. The line of code is this:

 movl data_items(,%edi,4), %eax

So, step until this line of code is ready to execute. Now, this code depends on two values - data_items and %edi. data_items is a symbol, and therefore constant. It's a good idea to check your source code to make sure the label is in front of the right data, but in our case it is. Therefore, we need to look at %edi. So, we need to print it out. It will look like this:

(gdb) print/d $edi
$2 = 0
(gdb)

This indicates that %edi is set to zero, which is why it keeps on loading the first element of the array. This should cause you to ask yourself two questions - what is the purpose of %edi, and how should its value be changed? To answer the first question, we just need to look in the comments. %edi is holding the current index of data_items. Since our search is a sequential search through the list of numbers in data_items, it would make sense that %edi should be incremented with every loop iteration.

Scanning the code, there is no code which alters %edi at all. Therefore, we should add a line to increment %edi at the beginning of every loop iteration. This happens to be exactly the line we tossed out at the beginning. Assembling, linking, and running the program again will show that it now works correctly.

Hopefully this exercise provided some insight into using GDB to help you find errors in your programs.

Breakpoints and Other GDB Features

The program we entered in the last section had an infinite loop, and could be easily stopped using control-c. Other programs may simply abort or finish with errors. In these cases, control-c doesn't help, because by the time you press control-c, the program is already finished. To fix this, you need to set breakpoints. A breakpoint is a place in the source code that you have marked to indicate to the debugger that it should stop the program when it hits that point.

To set breakpoints you have to set them up before you run the program. Before issuing the run command, you can set up breakpoints using the break command. For example, to break on line 27, issue the command break 27. Then, when the program crosses line 27, it will stop running, and print out the current line and instruction. You can then step through the program from that point and examine registers and memory. To look at the lines and line numbers of your program, you can simply use the command 1. This will print out your program with line numbers a screen at a time.

When dealing with functions, you can also break on the function names. For example, in the factorial program in Chapter 4, we could set a breakpoint for the factorial function by typing in break factorial. This will cause the debugger to break immediately after the function call and the function setup (it skips the pushing of %ebp and the copying of %esp).

When stepping through code, you often don't want to have to step through every instruction of every function. Well-tested functions are usually a waste of time to step through except on rare occasion. Therefore, if you use the nexti command instead of the stepi command, GDB will wait until completion of the function before going on. Otherwise, with stepi, GDB would step you through every instruction within every called function.


Warning

One problem that GDB has is with handling interrupts. Often times GDB will miss the instruction that immediately follows an interrupt. The instruction is actually executed, but GDB doesn't step through it. This should not be a problem - just be aware that it may happen.


GDB Quick-Reference

This quick-reference table is copyright 2002 Robert M. Dondero, Jr., and is used by permission in this book. Parameters listed in brackets are optional.

Table F-1: Common GDB Debugging Commands

Miscellaneous

quit

Exit GDB

help [cmd]

Print description of debugger command cmd. Without cmd, prints a list of topics.

directory [dir1] [dir2] ...

Add directories dir1, dir2, etc. to the list of directories searched for source files.

Running the Program

run [arg1] [arg2] ...

Run the program with command line arguments arg1, arg2, etc.

set args arg1 [arg2] ...

Set the program's command-line arguments to arg1, arg2, etc.

show args

Print the program's command-line arguments.

Using Breakpoints

info breakpoints

Print a list of all breakpoints and their numbers (breakpoint numbers are used for other breakpoint commands).

break linenum

Set a breakpoint at line number linenum.

break *addr

Set a breakpoint at memory address addr.

break fn

Set a breakpoint at the beginning of function fn.

condition bpnum expr

Break at breakpoint bpnum only if expression expr is non-zero.

command [bpnum] cmd1 [cmd2] ...

Execute commands cmd1, cmd2, etc. whenever breakpoint bpnum (or the current breakpoint) is hit.

Continue

Continue executing the program.

Kill

Stop executing the program.

delete [bpnum1] [bpnum2] ...

Delete breakpoints bpnuml, bpnum2, etc., or all breakpoints if none specified.

clear *addr

Clear the breakpoint at memory address addr.

clear [fn]

Clear the breakpoint at function fn, or the current breakpoint.

clear linenum

Clear the breakpoint at line number linenum.

disable [bpnum1] [bpnum2] ...

Disable breakpoints bpnum1, bpnum2, etc., or all breakpoints if none specified.

enable [bpnum1] [bpnum2] ...

Enable breakpoints bpnum1, bpnum2, etc., or all breakpoints if none specified.

Stepping through the Program

nexti

"Step over" the next instruction (doesn't follow function calls).

stepi

"Step into" the next instruction (follows function calls).

finish

"Step out" of the current function.

Examining Registers and Memory

info registers

Print the contents of all registers.

print/f $reg

Print the contents of register reg using format f. The format can be x (hexadecimal), u (unsigned decimal), o (octal), a(address), c (character), or f (floating point).

x/rsf addr

Print the contents of memory address addr using repeat count r, size s, and format f. Repeat count defaults to 1 if not specified. Size can be b (byte), h (halfword), w (word), or g (double word). Size defaults to word if not specified. Format is the same as for print, with the additions of s (string) and i (instruction).

info display

Shows a numbered list of expressions set up to display automatically at each break.

display/f $reg

At each break, print the contents of register reg using format f.

display/si addr

At each break, print the contents of memory address addr using size s (same options as for the x command).

display/ss addr

At each break, print the string of size s that begins in memory address addr.

undisplay displaynum

Remove displaynum from the display list.

Examining the Call Stack

where

Print the call stack.

backtrace

Print the call stack.

frame

Print the top of the call stack.

up

Move the context toward the bottom of the call stack.

down

Move the context toward the top of the call stack.



No comments :

Appendix E: C Idioms in Assembly Language

This appendix is for C programmers learning assembly language. It is meant to give a general idea about how C constructs can be implemented in assembly language.

If Statement

In C, an if statement consists of three parts - the condition, the true branch, and the false branch. However, since assembly language is not a block structured language, you have to work a little to implement the block-like nature of C. For example, look at the following C code:

if (a == b)
{
/* True Branch Code Here */
}
else
{
/* False Branch Code Here */
}

/* At This Point, Reconverge */

In assembly language, this can be rendered as:

 #Move a and b into registers for comparison
movl a, %eax
movl b, %ebx

#Compare
cmpl %eax, %ebx

#If True, go to true branch
je true_branch
false_branch: #This label is unnecessary,
#only here for documentation
#False Branch Code Here

#Jump to recovergence point
jmp reconverge


true_branch:
#True Branch Code Here


reconverge:
#Both branches recoverge to this point

As you can see, since assembly language is linear, the blocks have to jump around each other. Recovergence is handled by the programmer, not the system.

A case statement is written just like a sequence of if statements.

Function Call

A function call in assembly language simply requires pushing the arguments to the function onto the stack in reverse order, and issuing a call instruction. After calling, the arguments are then popped back off of the stack. For example, consider the C code:

 printf("The number is %d", 88);

In assembly language, this would be rendered as:

 .section .data
text_string:
.ascii "The number is %d\0"
.section .text
pushl $88
pushl $text_string
call printf
popl %eax
popl %eax #%eax is just a dummy variable,
#nothing is actually being done
#with the value. You can also
#directly re-adjust %esp to the
#proper location.

Variables and Assignment

Global and static variables are declared using .data or .bss entries. Local variables are declared by reserving space on the stack at the beginning of the function. This space is given back at the end of the function.

Interestingly, global variables are accessed differently than local variables in assembly language. Global variables are accessed using direct addressing, while local variables are accessed using base pointer addressing. For example, consider the following C code:

int my_global_var;

int foo()
{
int my_local_var;

my_local_var = 1;
my_global_var = 2;

return 0;
}

This would be rendered in assembly language as:

 .section .data
.lcomm my_global_var, 4

.type foo, @function
foo:
pushl %ebp #Save old base pointer
movl %esp, $ebp #make stack pointer base pointer
subl $4, %esp #Make room for my_local_var
.equ my_local_var, -4 #Can now use my_local_var to
#find the local variable


movl $1, my_local_var(%ebp)
movl $2, my_global_var

movl %ebp, %esp #Clean up function and return
popl %ebp
ret

What may not be obvious is that accessing the global variable takes fewer machine cycles than accessing the local variable. However, that may not matter because the stack is more likely to be in physical memory (instead of swap) than the global variable is.

Also note that in the C programming language, after the compiler loads a value into a register, that value will likely stay in that register until that register is needed for something else. It may also move registers. For example, if you have a variable foo, it may start on the stack, but the compiler will eventually move it into registers for processing. If there aren't many variables in use, the value may simply stay in the register until it is needed again. Otherwise, when that register is needed for something else, the value, if it's changed, is copied back to its corresponding memory location. In C, you can use the keyword volatile to make sure all modifications and references to the variable are done to the memory location itself, rather than a register copy of it, in case other processes, threads, or hardware may be modifying the value while your function is running.

Loops

Loops work a lot like if statements in assembly language - the blocks are formed by jumping around. In C, a while loop consists of a loop body, and a test to determine whether or not it is time to exit the loop. A for loop is exactly the same, with optional initialization and counter-increment sections. These can simply be moved around to make a while loop.

In C, a while loop looks like this:

while(a < b)
{
/* Do stuff here */
}

/* Finished Looping */

This can be rendered in assembly language like this:

loop_begin:
movl a, %eax
movl b, %ebx
cmpl %eax, %ebx
jge loop_end

loop_body:
#Do stuff here

jmp loop_begin

loop_end:
#Finished looping

The x86 assembly language has some direct support for looping as well. The %ecx register can be used as a counter that ends with zero. The loop instruction will decrement %ecx and jump to a specified address unless %ecx is zero. For example, if you wanted to execute a statement 100 times, you would do this in C:

 for(i=0; i < 100; i++)
{
/* Do process here */
}

In assembly language it would be written like this:

loop_initialize:
movl $100, %ecx
loop_begin:
#
#Do Process Here
#

#Decrement %ecx and loops if not zero
loop loop_begin

rest_of_program:
#Continues on to here

One thing to notice is that the loop instruction requires you to be counting backwards to zero. If you need to count forwards or use another ending number, you should use the loop form which does not include the loop instruction.

For really tight loops of character string operations, there is also the rep instruction, but we will leave learning about that as an exercise to the reader.


Structs

Structs are simply descriptions of memory blocks. For example, in C you can say:

struct person {
char firstname[40];
char lastname[40];
int age;
};

This doesn't do anything by itself, except give you ways of intelligently using 84 bytes of data. You can do basically the same thing using .equ directives in assembly language. Like this:

 .equ PERSON_SIZE, 84
.equ PERSON_FIRSTNAME_OFFSET, 0
.equ PERSON_LASTNAME_OFFSET, 40
.equ PERSON_AGE_OFFSET, 80

When you declare a variable of this type, all you are doing is reserving 84 bytes of space. So, if you have this in C:

void foo()
{
struct person p;

/* Do stuff here */
}

In assembly language you would have:

foo:
#Standard header beginning
pushl %ebp
movl %esp, %ebp

#Reserve our local variable
subl $PERSON_SIZE, %esp
#This is the variable's offset from %ebp
.equ P_VAR, 0 - PERSON_SIZE

#Do Stuff Here

#Standard function ending
movl %ebp, %esp
popl %ebp
ret

To access structure members, you just have to use base pointer addressing with the offsets defined above. For example, in C you could set the person's age like this:

 p.age  =  30;

In assembly language it would look like this:

 movl $30, P_VAR + PERSON_AGE_OFFSET(%ebp)

Pointers

Pointers are very easy. Remember, pointers are simply the address that a value resides at. Let's start by taking a look at global variables. For example:

int global_data = 30;

In assembly language, this would be:

 .section .data
global_data:
.long 30

Taking the address of this data in C:

 a = &global_data;

Taking the address of this data in assembly language:

 movl $global_data, %eax

You see, with assembly language, you are almost always accessing memory through pointers. That's what direct addressing is. To get the pointer itself, you just have to go with immediate mode addressing.

Local variables are a little more difficult, but not much. Here is how you take the address of a local variable in C:

void foo()
{
int a;
int *b;

a = 30;

b = &a;

*b = 44;
}

The same code in assembly language:

foo:
#Standard opening
pushl %ebp
movl %esp, %ebp

#Reserve two words of memory
subl $8, $esp
.equ A_VAR, -4
.equ B_VAR, -8

#a = 30
movl $30, A_VAR(%ebp)

#b = &a
movl $A_VAR, B_VAR(%ebp)
addl %ebp, B_VAR(%ebp)

#*b = 30
movl B_VAR(%ebp), %eax
movl $30, (%eax)

#Standard closing
movl %ebp, %esp
popl %ebp
ret

As you can see, to take the address of a local variable, the address has to be computed the same way the computer computes the addresses in base pointer addressing. There is an easier way - the processor provides the instruction leal, which stands for "load effective address". This lets the computer compute the address, and then load it wherever you want. So, we could just say:

 #b = &a
leal A_VAR(%ebp), %eax
movl %eax, B_VAR(%ebp)

It's the same number of lines, but a little cleaner. Then, to use this value, you simply have to move it to a general-purpose register and use indirect addressing, as shown in the example above.

Getting GCC to Help

One of the nice things about GCC is its ability to spit out assembly language code. To convert a C language file to assembly, you can simply do:

gcc -S file.c

The output will be in file.s. It's not the most readable output - most of the variable names have been removed and replaced either with numeric stack locations or references to automatically-generated labels. To start with, you probably want to turn off optimizations with -O0 so that the assembly language output will follow your source code better.

Something else you might notice is that GCC reserves more stack space for local variables than we do, and then AND's %esp [1] This is to increase memory and cache efficiency by double-word aligning variables.

Finally, at the end of functions, we usually do the following instructions to clean up the stack before issuing a ret instruction:

 movl %ebp, %esp
popl %ebp

However, GCC output will usually just include the instruction leave. This instruction is simply the combination of the above two instructions. We do not use leave in this text because we want to be clear about exactly what is happening at the processor level.

I encourage you to take a C program you have written and compile it to assembly language and trace the logic. Then, add in optimizations and try again. See how the compiler chose to rearrange your program to be more optimized, and try to figure out why it chose the arrangement and instructions it did.

[1]Note that different versions of GCC do this differently.




No comments :

Appendix D: Table of ASCII Codes

To use this table, simply find the character or escape that you want the code for, and add the number on the left and the top.

Table D-1: Table of ASCII codes in decimal

+0

+1

+2

+3

+4

+5

+6

+7

0

NUL

SOH

STX

ETX

EOT

ENQ

ACK

BEL

8

BS

HT

LF

VT

FF

CR

SO

SI

16

DLE

DC1

DC2

DC3

DC4

NAK

SYN

ETB

24

CAN

EM

SUB

ESC

FS

GS

RS

US

32

!

"

#

$

%

&

'

40

(

)

*

+

,

-

.

/

48

0

1

2

3

4

5

6

7

56

8

9

:

;

<

=

>

?

64

@

A

B

C

D

E

F

G

72

H

I

J

K

L

M

N

O

80

P

Q

R

S

T

U

V

W

88

X

Y

Z

[

\

]

^

_

96

'

a

b

c

d

e

f

g

104

h

i

j

k

l

m

n

o

112

p

q

r

s

t

u

v

w

120

x

y

z

{

|

}

~

DEL

ASCII is actually being phased out in favor of an international standard known as Unicode, which allows you to display any character from any known writing system in the world. As you may have noticed, ASCII only has support for English characters. Unicode is much more complicated, however, because it requires more than one byte to encode a single character. There are several different methods for encoding Unicode characters. The most common is UTF-8 and UTF-32. UTF-8 is somewhat backwards-compatible with ASCII (it is stored the same for English characters, but expands into multiple byte for international characters). UTF-32 simply requires four bytes for each character rather than one. Windows® uses UTF-16, which is a variable-length encoding which requires at least 2 bytes per character, so it is not backwards-compatible with ASCII.

A good tutorial on internationalization issues, fonts, and Unicode is available in a great Article by Joe Spolsky, called "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", available online at http://www.joelonsoftware.com/articles/Unicode.html

No comments :

Appendix C: Important System Calls

These are some of the more important system calls to use when dealing with Linux. For most cases, however, it is best to use library functions rather than direct system calls, because the system calls were designed to be minimalistic while the library functions were designed to be easy to program with. For information about the Linux C library, see the manual at http://www.gnu.org/software/libc/manual/

Remember that %eax holds the system call numbers, and that the return values and error codes are also stored in %eax.

Table C-1: Important Linux System Calls

%eax

Name

%ebx

%ecx

%edx

Notes

1

exit

return value (int)

Exits the program

3

read

file descriptor

buffer start

buffer size (int)

Reads into the given buffer

4

write

file descriptor

buffer start

buffer size (int)

Writes the buffer to the file descriptor

5

open

null-terminate file name

option list

permission mode

Opens the given file. Returns the file descriptor or an error number.

6

close

file descriptor

Closes the give file descriptor

12

chdir

null-terminated directory name

Changes the current directory of your program.

19

lseek

file descriptor

offset

mode

Repositions where you are in the given file. The mode (called the "whence") should be 0 for absolute positioning, and 1 for relative positioning.

20

getpid

Returns the process ID of the current process.

39

mkdir

null-terminated directory name

permission mode

Creates the given directory. Assumes all directories leading up to it already exist.

40

rmdir

null-terminated directory name

Removes the given directory.

41

dup

file descriptor

Returns a new file descriptor pat works just like the existing file descriptor.

42

pipe

pipe array

Creates two file descriptors, where writing on one produces data to read on the other and vice-versa. %ebx is a pointer to two words of storage to hold the file descriptors.

45

brk

new system break

Sets the system break (i.e. -the end of the data section). If the system break is 0, it simply returns the current system break.

54

ioctl

file descriptor

request

arguments

This is used to set parameters on device files. Its actual usage varies based on the type of file or device your descriptor references.

A more complete listing of system calls, along with additional information is available at http://www.lxhp.in-berlin.de/lhpsyscal.html You can also get more information about a system call by typing in man 2 SYSCALLNAME which will return you the information about the system call from section 2 of the UNIX manual. However, this refers to the usage of the system call from the C programming language, and may or may not be directly helpful.

For information on how system calls are implemented on Linux, see the Linux Kernel 2.4 Internals section on how system calls are implemented at http://www.faqs.org/docs/kernel_2_4/lki-2.html#ss2.11

No comments :

Appendix B: Common x86 Instructions

Reading the Tables

The tables of instructions presented in this appendix include:

  • The instruction code

  • The operands used

  • The flags used

  • A brief description of what the instruction does

In the operands section, it will list the type of operands it takes. If it takes more than one operand, each operand will be separated by a comma. Each operand will have a list of codes which tell whether the operand can be an immediate-mode value (I), a register (R), or a memory address (M). For example, the movl instruction is listed as I/R/M, R/M. This means that the first operand can be any kind of value, while the second operand must be a register or memory location. Note, however, that in x86 assembly language you cannot have more than one operand be a memory location.

In the flags section, it lists the flags in the %eflags register affected by the instruction. The following flags are mentioned:

O

  • Overflow flag. This is set to true if the destination operand was not large enough to hold the result of the instruction.

S

  • Sign flag. This is set to the sign of the last result.

Z

  • Zero flag. This flag is set to true if the result of the instruction is zero.

A

  • Auxiliary carry flag. This flag is set for carries and borrows between the third and fourth bit. It is not often used.

P

  • Parity flag. This flag is set to true if the low byte of the last result had an even number of 1 bits.

C

  • Carry flag. Used in arithmetic to say whether or not the result should be carried over to an additional byte. If the carry flag is set, that usually means that the destination register could not hold the full result. It is up to the programmer to decide on what action to take (i.e. - propogate the result to another byte, signal an error, or ignore it entirely).

Other flags exist, but they are much less important.

Data Transfer Instructions

These instructions perform little, if any computation. Instead they are mostly used for moving data from one place to another.

Table B-1: Data Transfer Instructions

Instruction

Operands

Affected Flags

movl

I/R/M, I/R/M

O/S/Z/A/C

This copies a word of data from one location to another. movl %eax, %ebx copies the contents of %eax to %ebx

movb

I/R/M, I/R/M

O/S/Z/A/C

Same as movl, but operates on individual bytes.

leal

M, I/R/M

O/S/Z/A/C

This takes a memory location given in the standard format, and, instead of loading the contents of the memory location, loads the computed address. For example, leal 5 (%ebp, %ecx, 1), %eax loads the address computed by 5 + %ebp + 1*%ecx and stores that in %eax

popl

R/M

O/S/Z/A/C

Pops the top of the stack into the given location. This is equivalent to performing movl (%esp), R/M followed by addl $4, %esp.popfl is a variant which pops the top of the stack into the %eflags register.

pushl

I/R/M

O/S/Z/A/C

Pushes the given value onto the stack. This is the equivalent to performing subl $4, %esp followed by movl I/R/M, (%esp).pushfl is a variant which pushes the current contents of the %eflags register onto the top of the stack.

xchgl

R/M, R/M

O/S/Z/A/C

Exchange the values of the given operands.


Integer Instructions

These are basic calculating instructions that operate on signed or unsigned integers.

Table B-2: Integer Instructions

Instruction

Operands

Affected Flags

adcl

I/R/M, R/M

O/S/Z/A/P/C

Add with carry. Adds the carry bit and the first operand to the second, and, if there is an overflow, sets overflow and carry to true. This is usually used for operations larger than a machine word. The addition on the least-significant word would take place using addl, while additions to the other words would used the adcl instruction to take the carry from the previous add into account. For the usual case, this is not used, and addl is used instead.

addl

I/R/M, R/M

O/S/Z/A/P/C

Addition. Adds the first operand to the second, storing the result in the second. If the result is larger than the destination register, the overflow and carry bits are set to true. This instruction operates on both signed and unsigned integers.

cdq

O/S/Z/A/P/C

Converts the %eax word into the double-word consisting of %edx:%eax with sign extension. The q signifies that it is a quad-word. It's actually a double-word, but it's called a quad-word because of the terminology used in the 16-bit days. This is usually used before issuing an idivl instruction.

cmpl

I/R/M, R/M

O/S/Z/A/P/C

Compares two integers. It does this by subtracting the first operand from the second. It discards the results, but sets the flags accordingly. Usually used before a conditional jump.

decl

R/M

O/S/Z/A/P

Decrements the register or memory location. Use decb to decrement a byte instead of a word.

divl

R/M

O/S/Z/A/P

Performs unsigned division. Divides the contents of the double-word contained in the combined %edx:%eax registers by the value in the register or memory location specified. The %eax register contains the resulting quotient, and the %edx register contains the resulting remainder. If the quotient is too large to fit in %eax, it triggers a type 0 interrupt.

idivl

R/M

O/S/Z/A/P

Performs signed division. Operates just like divl above.

imull

R/M/I, R

O/S/Z/A/P/C

Performs signed multiplication and stores the result in the second operand. If the second operand is left out, it is assumed to be %eax, and the full result is stored in the double-word %edx:%eax.

incl

R/M

O/S/Z/A/P

Increments the given register or memory location. Use incb to increment a byte instead of a word.

mull

R/M/I, R

O/S/Z/A/P/C

Perform unsigned multiplication. Same rules as apply to imull.

negl

R/M

O/S/Z/A/P/C

Negates (gives the two's complement inversion of) the given register or memory location.

sbbl

I/R/M, R/M

O/S/Z/A/P/C

Subtract with borrowing. This is used in the same way that adc is, except for subtraction. Normally only subl is used.

subl

I/R/M, R/M

O/S/Z/A/P/C

Subtract the two operands. This subtracts the first operand from the second, and stores the result in the second operand. This instruction can be used on both signed and unsigned numbers.


Logic Instructions

These instructions operate on memory as bits instead of words.

Table B-3: Logic Instructions

Instruction

Operands

Affected Flags

andl

I/R/M, R/M

O/S/Z/P/C

Performs a logical and of the contents of the two operands, and stores the result in the second operand. Sets the overflow and carry flags to false.

notl

R/M

Performs a logical not on each bit in the operand. Also known as a one's complement.

orl

I/R/M, R/M

O/S/Z/A/P/C

Performs a logical or between the two operands, and stores the result in the second operand. Sets the overflow and carry flags to false.

rcll

I/%c1, R/M

O/C

Rotates the given location's bits to the left the number of times in the first operand, which is either an immediate-mode value or the register %cl. The carry flag is included in the rotation, making it use 33 bits instead of 32. Also sets the overflow flag.

rcrl

I/%cl, R/M

O/C

Same as above, but rotates right.

roll

I/%cl, R/M

O/C

Rotate bits to the left. It sets the overflow and carry flags, but does not count the carry flag as part of the rotation. The number of bits to roll is either specified in immediate mode or is contained in the %cl register.

rorl

I/%cl, R/M

O/C

Same as above, but rotates right.

sall

I/%cl, R/M

C

Arithmetic shift left. The sign bit is shifted out to the carry flag, and a zero bit is placed in the least significant bit. Other bits are simply shifted to the left. This is the same as the regular shift left. The number of bits to shift is either specified in immediate mode or is contained in the %cl register.

sarl

I/%cl, R/M

C

Arithmetic shift right. The least significant bit is shifted out to the carry flag. The sign bit is shifted in, and kept as the sign bit. Other bits are simply shifted to the right. The number of bits to shift is either specified in immediate mode or is contained in the %cl register.

shll

I/%cl, R/M

C

Logical shift left. This shifts all bits to the left (sign bit is not treated specially). The leftmost bit is pushed to the carry flag. The number of bits to shift is either specified in immediate mode or is contained in the %cl register.

shrl

I/%cl, R/M

C

Logical shift right. This shifts all bits in the register to the right (sign bit is not treated specially). The rightmost bit is pushed to the carry flag. The number of bits o shift is either specified in immediate mode or is contained in the %cl register.

testl

I/R/M, R/M

O/S/Z/A/P/C

Does a logical and of both operands and discards the results, but sets the flags accordingly.

xorl

I/R/M, R/M

O/S/Z/A/P/C

Does an exclusive or on the two operands, and stores the result in the second operand. Sets the overflow and carry flags to false.


Flow Control Instructions

These instructions may alter the flow of the program.

Table B-4: Flow Control Instructions

Instruction

Operands

Affected Flags

call

destination address

O/S/Z/A/C

This pushes what would be the next value for %eip onto the stack, and jumps to the destination address. Used for function calls. Alternatively, the destination address can be an asterisk followed by a register for an indirect function call. For example, call *%eax will call the function at the address in %eax.

int

I

O/S/Z/A/C

Causes an interrupt of the given number. This is usually used for system calls and other kernel interfaces.

Jcc

destination address

O/S/Z/A/C

Conditional branch. cc is the condition code. Jumps to the given address if the condition code is true (set from the previous instruction, probably a comparison). Otherwise, goes to the next instruction. The condition codes are:

  • [n] a [e] - above (unsigned greater than). An n can be added for "not" and an e can be added for "or equal to"

  • [n] b [e] - below (unsigned less than)

  • [n] e - equal to

  • [n]z - zero

  • [n] g [e] - greater than (signed comparison)

  • [n] l [e] - less than (signed comparison)

  • [n] c - carry flag set

  • [n] o - overflow flag set

  • [p] p - parity flag set

  • [n] s - sign flag set

  • ecxz - %ecx is zero

jmp

destination address

O/S/Z/A/C

An unconditional jump. This simply sets %eip to the destination address. Alternatively, the destination address can be an asterisk followed by a register for an indirect jump. For example, jmp *%eax will jump to the address in %eax.

ret

O/S/Z/A/C

Pops a value off of the stack and then sets %eip to that value. Used to return from function calls.


Assembler Directives

These are instructions to the assembler and linker, instead of instructions to the processor. These are used to help the assembler put your code together properly, and make it easier to use.

Table B-5: Assembler Directives

Directive

Operands

.ascii

QUOTED STRING

Takes the given quoted string and converts it into byte data.

.byte

ALLIESVALUES

Takes a comma-separated list of values and inserts them right there in the program as data.

.endr

Ends a repeating section defined with .rept.

.equ

LABEL, VALUE

Sets the given label equivalent to the given value. The value can be a number, a character, or an constant expression that evaluates to a a number or character. prom that point on, use of the label will be substituted for the given value.

.globl

LABEL

Sets the given label as global, meaning that it can be used from separately-compiled object files.

.include

FILE

Includes the given file just as if it were typed in right there.

.lcomm

SYMBOL, SIZE

This is used in the .bss section to specify storage that should be allocated when the program is executed. Defines the symbol with the address where the storage will be located, and makes sure that it is the given number of bytes long.

.long

VALUES

Takes a sequence of numbers separated by commas, and inserts those numbers as 4-byte words right where they are in the program.

.rept

COUNT

Repeats everything between this directive and the .endr directives the number of times specified.

.section

SECTION NAME

Switches the section that is being worked on. Common sections include .text (for code), .data (for data embedded in the program itself), and .bss (for uninitialized global data).

.type

SYMBOL, @function

Tells the linker that the given symbol is a function.


Differences in Other Syntaxes and Terminology

The syntax for assembly language used in this book is known at the AT&T syntax. It is the one supported by the GNU tool chain that comes standard with every Linux distribution. However, the official syntax for x86 assembly language (known as the Intel® syntax) is different. It is the same assembly language for the same platform, but it looks different. Some of the differences include:

  • In Intel syntax, the operands of instructions are often reversed. The destination operand is listed before the source operand.

  • In Intel syntax, registers are not prefixed with the percent sign (%).

  • In Intel syntax, a dollar-sign ($) is not required to do immediate-mode addressing. Instead, non-immediate addressing is accomplished by surrounding the address with brackets ([]).

  • In Intel syntax, the instruction name does not include the size of data being moved. If that is ambiguous, it is explicitly stated as BYTE, WORD, Or DWORD immediately after the instruction name.

  • The way that memory addresses are represented in Intel assembly language is much different (shown below).

  • Because the x86 processor line originally started out as a 16-bit processor, most literature about x86 processors refer to words as 16-bit values, and call 32-bit values double words. However, we use the term "word" to refer to the standard register size on a processor, which is 32 bits on an x86 processor. The syntax also keeps this naming convention - DWORD stands for "double word" in Intel syntax and is used for standard-sized registers, which we would call simply a "word".

  • Intel assembly language has the ability to address memory as a segment/offset pair. We do not mention this because Linux does not support segmented memory, and is therefore irrelevant to normal Linux programming.

Other differences exist, but they are small in comparison. To show some of the differences, consider the following instruction:

movl %eax, 8(%ebx,%edi,4)

In Intel syntax, this would be written as:

mov  [8 + %ebx + 1 * edi], eax

The memory reference is a bit easier to read than its AT&T counterpart because it spells out exactly how the address will be computed. However, but the order of operands in Intel syntax can be confusing.

Where to Go for More Information

Intel has a set of comprehensive guides to their processors. These are available at http://www.intel.com/design/pentium/manuals/ Note that all of these use the Intel syntax, not the AT&T syntax. The most important ones are their IA-32 Intel Architecture Software Developer's Manual in its three volumes::

In addition, you can find a lot of information in the manual for the GNU assembler, available online at http://www.gnu.org/software/binutils/manual/gas-2.9.1/as.html. Similarly, the manual for the GNU linker is available online at http://www.gnu.org/software/binutils/manual/ld-2.9.1/ld.html.


No comments :