This project is read-only.

MOSA compiler questions

Topics: General
Jul 20, 2009 at 6:00 AM

Im working with the Cosmos team and we are looking at a new compiler and have a few questions about the MOSA compiler which i hope you can answer.

 

1. Why the IR representation ? It seems to complicate things quite a lot , why not work directly with CIL (MISL)  and convert to a platform dependent type ?

2. How do you plan to do cross assembly in lining ?

3. Do you stream the data ? eg create the instructions for each method build the x86 asm or machine code and then discard the CIL /IR Instructions ? 

4. Do you have some way of compacting the way labels are stored ?

5. Do you envisage multi threading the compile ?

The biggest issues we have run into is memory compiling a large assembly like mscorlib which  hits about 1.5G and generics requiring a recompile. We have considered an Intermediate representation  but this is basically a Platform specific machine instruction but using string labels for an address.  The main purpose being low memory usage and inlining..

Regards,

Ben Kloosterman

 PR: wait...  I: wait...  L: wait...  LD: wait...  I: wait... wait...  Rank: wait...  Traffic: wait...  Price: wait...  C: wait...
Jul 20, 2009 at 6:21 AM

Hi Ben,

I'm on my way to work, but I'll try to answer your questions.

1. IR

When I started the groundwork for the compiler more than two years ago, I looked at all available compilers out there and tried to learn something. One of the things I quickly learned is that they all operated on a RISC style instruction set - typically in a three address format. This format makes optimizations a whole lot easier than any other format. So the primary reason for the IR was to enable optimizations in a target platform neutal way. Another point for this was to support more than one source language beside CIL. My original goal was to be able to reuse most of the compiler for things like scripting languages, regular expression compilation etc. Having a common way to do this reduces the amount of testing needed dramatically.

I have specific plans in my head to optimize and reduce memory further, but my personal time is limited right now and I stepped a bit out of the way of this project.

2. Cross assembly inlining

We have a mosacl command line compiler (I sure hope its still there) which will accept more than one input assembly and generate one (optionally bootable) output assembly containing the native code for all input assemblies.

3. Streaming

That was a plan. Unfortunately I quickly realised that this is not really possible as optimization stages will need the instructions in some sort of tree or list model to be able to reorder, replace or modify them. Of course you could achieve the same by storing modification information, but we kept the design simple here on purpose. Get things running first, optimize later after we know where the bottlenecks are. There are plans to reduce the number of lists though.

4. Compacting the way labels are stored

Ben, I'm not sure I understand the question correctly. Could you please elaborate on this? Our labels in the IR are the original addresses in the CIL assembly, so these are Integers. Finally the addresses are attached to basic blocks and not to the instruction itself, to reduce weight some more. Currently the addresses hold an offset, but this is almost unused and easy to remove.

5. Multithreading

Yes, MT is possible. Methods can be compiled in parallel. However I don't think more than 4 methods is reasonable.

Mike

 

 

Jul 20, 2009 at 7:07 AM

Thanks Mike ,

From: __grover

Hi Ben,

I'm on my way to work, but I'll try to answer your questions.

1. IR

When I started the groundwork for the compiler more than two years ago, I looked at all available compilers out there and tried to learn something. One of the things I quickly learned is that they all operated on a RISC style instruction set - typically in a three address format. This format makes optimizations a whole lot easier than any other format. So the primary reason for the IR was to enable optimizations in a target platform neutal way. Another point for this was to support more than one source language beside CIL. My original goal was to be able to reuse most of the compiler for things like scripting languages, regular expression compilation etc. Having a common way to do this reduces the amount of testing needed dramatically.

I have specific plans in my head to optimize and reduce memory further, but my personal time is limited right now and I stepped a bit out of the way of this project.

2. Cross assembly inlining

We have a mosacl command line compiler (I sure hope its still there) which will accept more than one input assembly and generate one (optionally bootable) output assembly containing the native code for all input assemblies.

This may combine them but does it inline methods ?

3. Streaming

That was a plan. Unfortunately I quickly realised that this is not really possible as optimization stages will need the instructions in some sort of tree or list model to be able to reorder, replace or modify them. Of course you could achieve the same by storing modification information, but we kept the design simple here on purpose. Get things running first, optimize later after we know where the bottlenecks are. There are plans to reduce the number of lists though.

We have just been through that mscorlib was pretty good for finding them J  Question is do any of these instructions need to  be cross method ?  Cross method optimizations seem pretty hard to me.

4. Compacting the way labels are stored

Ben, I'm not sure I understand the question correctly. Could you please elaborate on this? Our labels in the IR are the original addresses in the CIL assembly, so these are Integers. Finally the addresses are attached to basic blocks and not to the instruction itself, to reduce weight some more. Currently the addresses hold an offset, but this is almost unused and easy to remove.

Labels was a bad word I should say method name ( we use the same field as label)  , for something with mscorlib there are 30K methods but it’s the calls that kill you . Changing method names to use string.Intern was a big win and we were working on an ElfHash.

5. Multithreading

Yes, MT is possible. Methods can be compiled in parallel. However I don't think more than 4 methods is reasonable.

Yes 2-8 , Just throw them on a WorkItem queue and let the OS work it out.

Regards,

Ben

Jul 20, 2009 at 8:47 AM

Hi Ben,

1. Why the IR representation? It seems to complicate things quite a lot, why not work directly with CIL (MISL) and convert to a platform dependent type ?

The CIL is not well suited for compiler optimizations. The CIL is stack-based. While modern processors have internal CPU registers, in addition random access memory and stack storage space, to significant improve execution speed. A stack based analysis would be unable to efficiently use these registers. A lower level representation is required to exploit the use of registers. This is just one of many reasons to translate from CIL to IR. In general, you can perform more type of optimizations and better optimization on IR than on CIL.

2. How do you plan to do cross assembly in lining?

As Mike (aka grover) stated, multiple assembly compiling is available today.

3. Do you stream the data ? eg create the instructions for each method build the x86 asm or machine code and then discard the CIL /IR Instructions ?

Just to add to Mike’s comments, MOSA does not generate x86 ASM as an immediate step. It directly outputs native executable x86 binary machine code. Right now, it’s written to disk as a compiled PE/COFF file.

4. Do you have some way of compacting the way labels are stored?

I don’t think there is a need yet to compact them further than an integer attribute on a block structure.

5. Do you envisage multi threading the compile?

Yes. The MOSA compiler framework is properly designed to support compiling assemblies and methods in parallel.

- Phil

 

Jul 20, 2009 at 9:24 AM

> This may combine them but does it inline methods?

The MOSA compiler currently does not implement any inline optimizations yet, let alone cross assembly (or whole program) optimization. Hopefully in the future. Instristic methods, however, would be fully inlined.

> Labels was a bad word I should say method name (we use the same field as label), for something with mscorlib there are 30K methods but it’s the calls that kill you .

That’s something we’ll have to keep in mind. For the moment, I think corlib should be AOT for performance and cached (with a secure hash). We want to avoid unnecessarily re-compiling corlib. No flames, I know some people may perfer to JIT it instead (and preform a slow AOT in the background).

> Yes 2-8 , Just throw them on a WorkItem queue and let the OS work it out.

Exactly!

 

Jul 20, 2009 at 10:46 AM

Thanks for the reply.

> This may combine them but does it inline methods?

The MOSA compiler currently does not implement any inline optimizations yet, let alone cross assembly (or whole program) optimization. Hopefully in the future. Instristic methods, however, would be fully inlined.

Cross assembly inlining is tricky  as you may need to get the method from an already compiled assembly so do you use reflection , disassemble or recompile the relevant method.

> Labels was a bad word I should say method name (we use the same field as label), for something with mscorlib there are 30K methods but it’s the calls that kill you .

That’s something we’ll have to keep in mind. For the moment, I think corlib should be AOT for performance and cached (with a secure hash). We want to avoid unnecessarily re-compiling corlib. No flames, I know some people may perfer to JIT it instead (and preform a slow AOT in the background).

That’s pretty much why we are considering a new compiler. Corlib with caching needing 1.5 Gig is a killer.  Btw  String.Intern helps a lot for method names and is trivial to implement.

BTW but probably no JIT for Cosmos . If you want to run an app the you need to install it  , the kernel then compiles it  and then creates a persistent executable security token  if allowed and with appropriate rights .

Regards,

Ben

Jul 20, 2009 at 1:45 PM

Hi,

two comments:

- Cross assembly inlining

mosacl basically loads all input assemblies into memory and has a working list of methods to compile independent of the origin of the method. We use our own internal APIs to access the raw CIL stream of the methods in question. The compiled executable finally contains the code of all input assemblies. The order is currently not defined, but its on the todo list. The thought I had with inlining in the initial stages was to look at the IL size of a method and automatically inline all of those, which are 64 bytes or less. Most property getters are less than this size, so the benefit would be huge. The idea of inlining was to replace the call statement with the IL statements of the target method and rewrite those to use new temporaries or the parameters directly, allowing for additional savings. I don't think this is a lot of work and is easily done with the current design. Essentially this causes the inlined method to be compiled at every use, as caching negates the positive effects of inlining mostly. An option would be to cache the intermediate representation of the inlined method, but again it needs to be rewritten for the destination anyways.

- Labels

I'm not sure how Cosmos wants to implement Reflection. The precompiled MOSA binaries will have full Metadata alongside with them and in some cases debug information too. In this situation you can't remove the method names from the binary. But there's no need to keep them in memory. Note that metadata and debug info is not yet supported and will be optional in any case.

Mike

Jul 21, 2009 at 8:52 AM

Hi is your IMetadataModule binary compatible with an MS compiled assembly file or just Mono ?

Regards,

Ben

Jul 21, 2009 at 9:55 AM

Our AssemblyLoader is able to load any CLI compliant PE file. This can be generated by csc, gmcs, vbc or any other compliant compiler. Our goal is for mosacl to emit CLI compliant native code PE files (similar to ngen), which are optionally bootable. Other binary formats (ELF) are in the works, however the PE format is the only "stable" one at the moment. The generated files will also be loadable by the assembly loader and represented as IMetadataModules in memory. Think of this interfaces as a lower layer beneath the standard Assembly and Module classes in the System and System.Reflection namespaces.

I'm personally working on emitting metadata and debug information into the generated executables, so that we'll have full VS integration including stepping through the original source code and dissassembly at the same time. This will be the foundation for our kernel debugger support through windbg and the Win32 kernel debugging protocol - at least until someone writes the gdb stub and emits gdb compliant symbol files.