First of all what is an assembler? [1] [2]:
“Typically a modern assembler creates object code by translating assembly instruction mnemonics into opcodes, and by resolving symbolic names for memory locations and other entities. The use of symbolic references is a key feature of assemblers, saving tedious calculations and manual address updates after program modifications.”
Types of assembler:
There are two types of assemblers based on how many passes through the source are needed to produce the executable program.
- One-pass assemblers go through the source code once and assume that all symbols will be defined before any instruction that references them.
- Two-pass assemblers create a table with all symbols and their values in the first pass, then use the table in a second pass to generate code. The assembler must at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated.
The advantage of the two-pass assembler is that symbols can be defined anywhere in program source code, allowing programs to be defined in more logical and meaningful ways, making two-pass assembler programs easier to read and maintain.
Starting now to make an assembler one first has to define an input and output form. That is what assembler takes as input and what kind of output it will provide. In here I’m trying to make an assembler based on book “System Programming and Operating System by D M Dhamdhere” so I’m assuming following input form:
start 101
read n
mover breg,one
movem breg,term
again: mult breg,term
add creg,one
movem creg,term
comp creg,four
bc le,again
div breg,two
movem breg,result
print result
stop
n ds 2
result ds 1
one dc ‘1’
term ds 1
two dc ‘2’
four dc ‘4
end
And output form:
101) 90113
102) 42116
103) 52117
104) 32117
105) 13116
106) 53117
107) 63119
108) 72104
109) 82118
110) 52115
111) 100115
112) 0 000
113)
115)
116)00 0 001
117)
118)00 0 001
119)00 0 001
120)
At first declare fix structures that contains default information to be used.
struct directive{ char symbol[10]; int code; }dir[5]={"start",1,"end",0,"origin",1,"ltorg",2}; struct mnemtab{ char mnem[20]; int opcode; int len; }l[12]={"stop",0,1,"add",1,1,"sub",2,1,"mult",3,1,"mover",4,1,"movm",5,1 "comp",6,1,"bc",7,1,"div",8,1,"read",9,1,"print",10,1}; struct regtab{ char regsym[10]; int code; }reg[4]={"areg",1,"breg",2,"creg",3,"dreg",4}; struct dclcode{ char mne[5]; int address; }dcl[2]={"ds",1,"dc",2}; struct compcode{ char sym[5]; int code; }ccode[7]={"lt",1,"le",2,"eq",3,"gt",4,"ge",5,"any",6};
In above code I have declared all required struct that holds default information used in our program. The first one is “directive” which holds predefined directives and its code. Second one “mnemtab” holds mnemonics, its opcode and its length. Third one “regtab” is used for holding register name and its code. Fourth one “dclcode” are used for declaration statements and its code. The final one “compcode” are used for comparison symbols and its code. Beside this create some more structure that we will be used to store information as it process the input file:
struct buffer{ char lbl[10]; char m[10]; char op1[10]; char op2[10]; }buf; struct symtab{ char symbol[10]; int add; }sym[50]; struct littab{ char lit[10]; int address; }literal[20]; struct tab_inc{ int index; long fpos; int type; }t_inc[10];
In above 4 structs the first one is used as buffer that will store information about the line that will be processed at the time. Second one is used as symbol table. This will store all symbol and its relevant addresses in it. The third one “littab” is used to store literals and its relevant address. Now create some getter, setter and update methods for above declared structs. First for getting directive code based on directive:
int get_dir(char a[10]){ int i; for(i=0;i<4;i++){ if(!strcmp(a,dir[i].symbol)) return dir[i].code; } return -1; }
Update symbol to add new symbol entry in it. This function will also check for duplicate declaration of symbol. If duplicate declaration found then shows warning as “Multiple declaration of [symbolName]”.
void update_symtab(FILE *fe){ int i,j; for(i=0;i<sc;i++){ if(!strcmp(sym[i].symbol,buf.m)){ if(sym[i].add==0) sym[i].add=lc; else{ printf("\nMultiple declaration of %s",buf.m); fprintf(fe,"\nMultiple declaration of %s",buf.m); } return; } } }
Get symbol from symbol table.
int get_sym(char s[10]){ int i,len=strlen(s); if(s[0]=='=') { for(i=0;i<ltc;i++){ if(!strcmp(literal[i].lit,s)) { t_inc[tc].type=0; return i; } } strcpy(literal[ltc].lit,s); ltc++; t_inc[tc].type=0; return (ltc-1); } else if(s[len-1]==':'){ s[len-1]='\0'; strcpy(sym[sc].symbol,s); sym[sc].add=lc; sc++; return (sc-1); } else{ for(i=0;i<sc;i++){ if(!strcmp(s,sym[i].symbol)) { t_inc[tc].type=1; return i; } } strcpy(sym[sc].symbol,s); t_inc[tc].type=1; sym[sc].add=0; sc++; return (sc-1); } }
Next is getting register code based on its name.
int get_reg(char a[]) { int i; for(i=0;i<3;i++) { if(!strcmp(a,reg[i].regsym)) return reg[i].code; } return 0; }
Get mnemonics opcode based on its name,
int get_btcode(char a[20]) { int i; for(i=0;i<11;i++) { if(strcmp(l[i].mnem,a)==0) return l[i].opcode; } return -1; }
Get declaration statement code.
int get_dcl(char a[10]) { int i; for(i=0;i<3;i++) { if(!strcmp(dcl[i].mne,a)) return dcl[i].address; } return -1; }
Clean buffer after processing on it.
void clean_buf() { strcpy(buf.lbl,""); strcpy(buf.m,""); strcpy(buf.op1,""); strcpy(buf.op2,""); }
Get comparison statement code based on its name.
int get_ccode(char a[5]){ int i; for(i=0;i<7;i++) { if(!strcmp(ccode[i].sym,a)) return ccode[i].code; } return -1; }
Up to now I have declared data structure kind thing for the assembler and now let’s start building processing of assembler. At first one need to parse the input file to get desired information for this I have used following method which tokenize file line by line & store each token into buffer.
void parser(FILE *fi) { char a[20]; int i; int ctr=0; while((c=getc(fi))!=EOF && (c!='\n')) { if((c!=' ') && (c!='\t') && (c!='\n')) { i=0; a[i++]=c; while(((c=getc(fi))!=' ') && (c!=' ') && (c!='\n') && (c!=EOF) && (c!=',')) { a[i++]=c; } a[i]='\0'; if(a[0]!='\0'){ if(ctr==0){ if(a[i-1]==':'){ strcpy(buf.lbl,a); printf("\nlbl:%s",buf.lbl); continue; } else{ strcpy(buf.m,a); printf("\nm:%s",buf.m); ctr++; } } else if(ctr==1){ strcpy(buf.op1,a); printf("\nop1:%s",buf.op1); ctr++; } else if(ctr==2){ strcpy(buf.op2,a); printf("\nop2:%s",buf.op2); ctr=0; } } if(c=='\n'){ ctr=0; lnctr++; return; } } } }
Now that data is parsed from file let’s start processing on this data. In pass-1 assemblers create a table with all symbols and their values. Also the assembler at least be able to determine the length of each instruction on the first pass so that the addresses of symbols can be calculated. I have created following method for Pass – 1 of assembler.
void pass1(FILE *fe,FILE *fo){ int code,opcode,dcode; static int fstop=1,fend=1; long fpos; if(fstop){ if(strcmp(buf.lbl,"")) get_sym(buf.lbl); opcode=get_btcode(buf.m); if(opcode==0){ fstop=0; fprintf(fo,"%d) 0 000 \n",lc); lc++; return; } if(opcode==7){ fprintf(fo,"%d) %d",lc,opcode); code=get_ccode(buf.op1); fprintf(fo,"%d",code); code=get_sym(buf.op2); fpos=ftell(fo); t_inc[tc].index=code; t_inc[tc].fpos=fpos; tc++; fprintf(fo," \n"); lc++; return; } if(opcode!=-1){ fprintf(fo,"%d) %d",lc,opcode); code=get_reg(buf.op1); if(code==0){ strcpy(buf.op2,buf.op1); } fprintf(fo,"%d",code); code=get_sym(buf.op2); fpos=ftell(fo); t_inc[tc].index=code; t_inc[tc].fpos=fpos; tc++; lc++; fprintf(fo," \n"); } else{ dcode=get_dir(buf.m); if(dcode!=-1) { switch(dcode){ case 1: lc=atoi(buf.op1); break; } } else{ fprintf(fo,"%d) ***",lc); code=get_reg(buf.op1); if(code==0) { if(!strcmp(buf.op2,"")) strcpy(buf.op2,buf.op1); else{ fprintf(fo," *** "); fprintf(fe," Error at line %d :: Illegal register %s",lnctr 1,buf.op1); code=get_sym(buf.op2); fpos=ftell(fo); t_inc[tc].index=code; t_inc[tc].fpos=fpos; tc++; lc++; } } fprintf(fo," %d",code); code=get_sym(buf.op2); fpos=ftell(fo); t_inc[tc].index=code; t_inc[tc].fpos=fpos; tc++; lc++; fprintf(fo," \n"); printf("\n Error at line %d :: Unknown mnemonics: %s",lnctr-1,buf.m); fprintf(fe,"\n Error at line %d :: Unknown mnemonics: %s",lnctr-1,buf.m); } } } else{ if(fend) { if(!strcmp(buf.m,"end")){ fprintf(fo,"%d) \n",lc); lc++; fend=0; } else{ code=get_dcl(buf.op1); switch(code) { case 1: fprintf(fo,"%d) \n",lc); update_symtab(fe); lc+=atoi(buf.op2); break; case 2: fprintf(fo,"%d)00 0 001 \n”,lc); update_symtab(fe); lc++; break; } } } else if(strcmp(buf.m,"")) { code=decode_lit(buf.m,fe); fprintf(fo,"%d) 00 0 %d \n",lc,code); lc++; } } clean_buf(); return; }
And for the final thing that is Pass – 2 of assembler. In pass-2 assembler uses the table created in first pass and generates code for that. For that I have created following method:
void pass2(FILE *fop){ int i,adr,x; long pos; for(i=0;i<tc;i++){ pos=t_inc[i].fpos; adr=t_inc[i].index; fseek(fop,pos,0); if(t_inc[i].type==0) { if(literal[adr].address!=0) fprintf(fop,"%d \t",literal[adr].address); else{ fprintf(fop,"***"); printf("\n Undelared literal: %s",literal[adr].lit); fprintf(fop,"\n Undelared literal: %s",literal[adr].lit); } } else{ if(sym[adr].add!=0) fprintf(fop,"%d\t",sym[adr].add); else{ fprintf(fop,"***"); printf("\n Undeclared symbol: %s",sym[adr].symbol); fprintf(fop,"\n Undeclared symbol: %s",sym[adr].symbol); } } } }
And it’s done. Assembler is ready to run. Go on try it and give your valuable response.
References:-
- http://en.wikipedia.org/wiki/Assembly_language#Assembler
- http://searchdatacenter.techtarget.com/definition/assembler
- http://www.cse.iitb.ac.in/~dmd/
Note: – This is not perfect assembler solution and may have some bugs. It is just for understanding basic assembler concept. Also above code is not an optimized version.
Let’s see your automated tests.
I have not done automated tests. I have tested this code manually by providing some different input files and analyzing the created output file.
what about the main() function ?
In main function you have to call main 3 methods that are parser, pass1 and pass2 one by one. First read the input file and loop over it contents. Call parser and pass1 on it. Then after its completion call pass2 method. Sample main method will look like this:
Also global variables that are shown above are declared as follow:
was searching for main() only thnx…
In pass1() it is giving error to decode_lit()….. Please help soon…..
Here it is:
Hope this helps.
Lets see….. I will try this and hope it helps me in solving my problem. I have to develop a generalized assembler which will catch all errors and if there are no errors then creates target code. Hope this helps me. And does this program checks for mnemonics?
You are at right place, this code does the same and Yes this store mnemonics in “mnemtab“.
And thanks a lot for decode_lit().
sorry fr late reply bt thnx alot for d code……it helped alot
You are welcome. You can appreciate my work by including my name and this website address, in the comments/documents/credits, where you use this code.
does this assembler work??? i havent seen yet…
…..reply plzz
Yes it works. Have you tried running it? Did you faced any issues in running it?
ohh….no..i didnt check it……….. can u tell me how to run it in ubuntu….i mean what is the process to run……and also i need assembler for 8085 processors…will it work for 8085 processor also…
??
Hi,
you can run it same as you run any C/C++ program on Ubuntu. You can easily get steps to run a C program in Ubuntu on internet. I will get back to you soon on the 8085 processor question.
Thanks.
can i know where “location counter” is used?? reply plzz…urgent
and another thing is that what are these “sc”,”lc” , “ltc”..and many more things that are used in this code???..like i<sc…what does this sc and other things refer to??
mr. harryjoy when i run your code, my result is kinda bit different from your output. I was wondering why does all the number 5 at the beginning turn to *** in my output ?
101) 90113
102) 42116
103) *** 2117
104) 32117
105) 13116
106) *** 3117
107) 63119
108) 72104
109) 82118
110) *** 2115
111) 100115
112) 0 000
113)
115)
11600 0 001
117)
11800 0 001
11900 0 001
120)
Hi Maria,
In normal scenarios, it should pint the stars if the symbol that it is trying to parse is invalid/unknown. So have a look into your configs and input file, you might be having a symbol or variable that is unknown to your assembler.
Thanks.
Dear,
Thanx for such a step by step analysis, yet after running, I have seen a error:
“Error at line 5 :: Unknown mnemonics: again:”. How can I resolve this issue?
Can u pls tell me, how can i generate internediate code from your pass 1
Dear,
Thank you very much for ur such a wealthy post and analysis. Yet, I found error after running the code which is :” Error at line 5 :: Unknown mnemonics: again:” and object code is not generated. Can you pls tell me how to resolve this? MOreover, I willbe glad if u advice me how to generate intermediate code from your pass1.
Thank You
same error with me reply if you have corrected it.
Basic reason could be the mnemonics you have used in input file is not available or misspelled in the code in mnemonics struct. First step would be check for it.
To generate intermediate output of pass1, drop calling pass2 and shut the file read/write and check the output file.
Regards