lichao 发表于 2024-8-26 14:41:25

OLLVM学习之一 —— LLVM简介及Hikari模块化

本帖最后由 lichao 于 2024-9-23 08:14 编辑



# LLVM

LLVM(Low Level Virtual Machine)是苹果公司的开源编译器框架, 包含包括Clang在内的一系列编译相关工具, 于2000年左右开发, LLVM/Clang从XCode8起作为XCode默认编译器,
LLVM作为以下语言的开发工具链: C, C++, Objective-C, Swift, Ruby, Python, Haskell, Rust, D, PHP, Pure, Lua, Julia. <https://llvm.org>   

相比同样庞大但臃肿的GCC, LLVM的模块化设计更利于扩展和维护, 因此LLVM取代GCC是必然趋势, 这一点不会因特殊环境下的不兼容和BUG及某些人的意志而改变.


LLVM包含如下组件:

* Clang, 用于做C/C++/Objective-C的编译前端
* LLDB, 调试器
* libc++, 提供c++基础库
* compiler-rt
* MLIR
* OpenMP
* libclc
* klee
* LLD 链接器
* BOLT

第三方:
* rustc, 用于rust的编译前端
* swiftc, 用于swift的编译前端
* codon, 用于python的编译前端

## 历史更新功能点

* 由Chris Lattner于2000创建
* LLVM1.0(2003), 首次公开发布
* LLVM3.0(2012), 引入了新的JIT编译器, 支持C++11, 基于SSA的内存安全转换, 全局ISel重构
* LLVM3.7(2015), 支持OpenMP3.1, Clang Static Analyzer增强,AArch64支持
* LLVM5.0(2016), 支持C++14, 引入了新的代码分析和优化技术
* LLVM9.0(2019), 支持C++17, JIT支持WebAssembly, 优化RISC-V, 优化IR
* LLVM12.0(2021), 支持C++20, 引入LTO优化, 支持arm64e

## XCode与LLVM版本对应

|XCode|LLVM|
|-----|----|
|11.x |11|
|12.x |12|
|13.x |13|
|14.x |14|
|15.x |15|

# Hikari模块化

https://github.com/lich4/indep_hikari

原链接: `https://github.com/61bcdefg/Hikari-LLVM15`   
本项目是笔者研究Ollvm的基础, 为以后的项目做准备   

## 背景

* 由于LLVM代码量巨大, 编译一次需要小时为单位(Hikari原作者就有这个困扰), 在这种情况下去开发/调试/测试Ollvm会非常艰难, 因此笔者将Hikari独立化为动态库以提高编译速度
* 为什么不使用LLVM-Pass? 因为Hikari与之并不兼容, 所以无法用LLVM-Pass来做, 但基于LLVM-Pass的Ollvm也有人在做
* 如果不以研究为目的, 只是产出二进制则直接用Gihub Action即可

## 修改LLVM

```cpp
// 以19.0.0为基础 llvm/lib/Passes/PassBuilderPipelines.cpp
...
#include <dlfcn.h>
...
ModulePassManager PassBuilder::buildO0DefaultPipeline(OptimizationLevel Level, bool LTOPreLink) {
...
invokeOptimizerLastEPCallbacks(MPM, Level);

// add
if (!LTOPreLink) {
    dlopen("hikari.dylib", RTLD_NOW);
    void* sym = dlsym(RTLD_DEFAULT, "_Z13hikariAddPassRN4llvm11PassManagerINS_6ModuleENS_15AnalysisManagerIS1_JEEEJEEE");
    void (*hikariAddPass_)(ModulePassManager& man) = (__typeof(hikariAddPass_))sym;
    hikariAddPass_(MPM);
}

if (LTOPreLink)
    addRequiredLTOPreLinkPasses(MPM);

MPM.addPass(createModuleToFunctionPassAdaptor(AnnotationRemarksPass()));

return MPM;
}
```

## 编译

```bash
# 先编译llvm
cmake -S llvm -B Build -DLLVM_ENABLE_PROJECTS=clang -G Ninja -DCMAKE_BUILD_TYPE=Release -DLLVM_CREATE_XCODE_TOOLCHAIN=On -DCMAKE_INSTALL_PREFIX=$PWD/install
cmake --build build -j3
# 再编译hikari
cd Obfuscation
make -j3
cp -f hikari.dylib ../Build/lib/
```

注意: 如果需要编译其他组件, 需要指定`LLVM_ENABLE_PROJECTS`, 目前支持`clang;clang-tools-extra;lld;compiler-rt;lldb;openmp;mlir`, 如`-DLLVM_ENABLE_PROJECTS="clang;lld"`.   
注意: 如果要编译运行时, 需要指定`LLVM_ENABLE_RUNTIMES`, 目前支持`libc;libunwind;libcxxabi;pstl;libcxx;compiler-rt;openmp;llvm-libgcc`.   
注意: Hikari独立化后暂不支持mllvm参数

## LLVM IR初探

什么是IR? IR(Intermediate Representation), 是一种LLVM定义的介于源码和汇编的中间语言, 语法类似于汇编. IR主要用于解决跨平台编译的问题, 同时也能解决优化/混淆/扩展问题   

IR手册 <https://llvm.org/docs/LangRef.html>

* llc 将bitcode转换为asm/obj
* lld 将多个bitcode/obj编译为二进制
* lli bitcode解释器
* opt 优化bitcode
* llvm-ar 操作archive
* llvm-as 将ll转换为bitcode, ll为人类可读字节码格式
* llvm-cxxfilt c++修饰名转普通
* llvm-dis bitcode转ll
* llvm-extract 从bitcode提取函数
* llvm-link 将多个bitcode合并为一个bitcode
* clang -emit-llvm -c 源码编译为bitcode
* clang -emit-llvm -S 源码编译为ll

第三方:
* `swiftc -emit-assembly /tmp/1.swift -o /tmp/1.bc` Swift源码编译为汇编
* `swiftc -emit-bc /tmp/1.swift -o /tmp/1.bc` Swift源码编译为bitcode
* `swiftc -emit-ir /tmp/1.swift -o /tmp/1.ll` Swift源码编译为ll
* `cargo rustc -- --emit=asm`或`rustc --emit=asm 1.rs` Rust源码编译为汇编
* `cargo rustc -- --emit=llvm-bc`或`rustc --emit=llvm-bc 1.rs` Rust源码编译为bitcode
* `cargo rustc -- --emit=llvm-ir`或`rustc --emit=llvm-ir 1.rs` Rust源码编译为ll
* `codon build -llvm 1.py` Python源码编译为ll

测试用例:
```cpp
// 1.cpp
#include <stdio.h>
int main(int argc, char** argv) {
printf("Hello World!\n");
return 0;
}
```

### 源码交叉编译为bitcode/ll

```bash
# for MacOS x86_64
./clang -isysroot `xcrun --sdk macosx --show-sdk-path` -arch x86_64 -emit-llvm -c /tmp/1.cpp --output=/tmp/1.bc
./clang -isysroot `xcrun --sdk macosx --show-sdk-path` -arch x86_64 -emit-llvm -S /tmp/1.cpp --output=/tmp/1.ll
# 如果要用XCode自带clang需使用xcrun, 以下同, 不建议用XCode clang, 因为不同版本Clang/llc/lld/lli互相不兼容, 且XCode不提供llc/lld/lli
xcrun --sdk macosx clang -arch x86_64 -emit-llvm -c /tmp/1.cpp --output=/tmp/1.bc
```

```IR
; ModuleID = '/tmp/1.cpp'
source_filename = "/tmp/1.cpp"
target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx11.3.0"

@.str = private unnamed_addr constant c"Hello World!\0A\00", align 1

; Function Attrs: mustprogress noinline norecurse optnone ssp uwtable
define noundef i32 @main(i32 noundef %0, ptr noundef %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
%5 = alloca ptr, align 8
store i32 0, ptr %3, align 4
store i32 %0, ptr %4, align 4
store ptr %1, ptr %5, align 8
%6 = call i32 (ptr, ...) @printf(ptr noundef @.str)
ret i32 0
}

declare i32 @printf(ptr noundef, ...) #1

attributes #0 = { mustprogress noinline norecurse optnone ssp uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cmov,+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "tune-cpu"="generic" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="penryn" "target-features"="+cmov,+cx16,+cx8,+fxsr,+mmx,+sahf,+sse,+sse2,+sse3,+sse4.1,+ssse3,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 2, !"SDK Version", }
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"clang version 19.0.0git"}
```

```bash
# for iOS arm64
./clang -isysroot `xcrun --sdk iphoneos --show-sdk-path` -arch arm64 -emit-llvm -c /tmp/1.cpp --output=/tmp/1.bc
./clang -isysroot `xcrun --sdk iphoneos --show-sdk-path` -arch arm64 -emit-llvm -S /tmp/1.cpp --output=/tmp/1.ll
```

```IR
; ModuleID = '/tmp/1.cpp'
source_filename = "/tmp/1.cpp"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "arm64-apple-ios14.5.0"

@.str = private unnamed_addr constant c"Hello World!\0A\00", align 1

; Function Attrs: mustprogress noinline norecurse optnone ssp uwtable(sync)
define noundef i32 @main(i32 noundef %0, ptr noundef %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
%5 = alloca ptr, align 8
store i32 0, ptr %3, align 4
store i32 %0, ptr %4, align 4
store ptr %1, ptr %5, align 8
%6 = call i32 (ptr, ...) @printf(ptr noundef @.str)
ret i32 0
}

declare i32 @printf(ptr noundef, ...) #1

attributes #0 = { mustprogress noinline norecurse optnone ssp uwtable(sync) "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+fp-armv8,+neon,+perfmon,+sha2,+v8a,+zcm,+zcz" }
attributes #1 = { "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+fp-armv8,+neon,+perfmon,+sha2,+v8a,+zcm,+zcz" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 2, !"SDK Version", }
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 1}
!4 = !{i32 7, !"frame-pointer", i32 1}
!5 = !{!"clang version 19.0.0git"}; ModuleID = '/tmp/1.cpp'
source_filename = "/tmp/1.cpp"
target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128-Fn32"
target triple = "arm64-apple-ios14.5.0"

@.str = private unnamed_addr constant c"Hello World!\0A\00", align 1

; Function Attrs: mustprogress noinline norecurse optnone ssp uwtable(sync)
define noundef i32 @main(i32 noundef %0, ptr noundef %1) #0 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
%5 = alloca ptr, align 8
store i32 0, ptr %3, align 4
store i32 %0, ptr %4, align 4
store ptr %1, ptr %5, align 8
%6 = call i32 (ptr, ...) @printf(ptr noundef @.str)
ret i32 0
}

declare i32 @printf(ptr noundef, ...) #1

attributes #0 = { mustprogress noinline norecurse optnone ssp uwtable(sync) "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+fp-armv8,+neon,+perfmon,+sha2,+v8a,+zcm,+zcz" }
attributes #1 = { "frame-pointer"="non-leaf" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="apple-a7" "target-features"="+aes,+fp-armv8,+neon,+perfmon,+sha2,+v8a,+zcm,+zcz" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 2, !"SDK Version", }
!1 = !{i32 1, !"wchar_size", i32 4}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 1}
!4 = !{i32 7, !"frame-pointer", i32 1}
!5 = !{!"clang version 19.0.0git"}
```

### bitcode/ll编译为asm/obj

```bash
./llc --filetype=asm /tmp/1.ll -o /tmp/1.asm
./llc --filetype=obj /tmp/1.ll -o /tmp/1.obj
./llc --filetype=asm /tmp/1.bc -o /tmp/1.asm
./llc --filetype=obj /tmp/1.bc -o /tmp/1.obj
```

```asm
        .section        __TEXT,__text,regular,pure_instructions
        .build_version ios, 14, 5        sdk_version 14, 5
        .globl        _main                           ; -- Begin function main
        .p2align        2
_main:                                  ; @main
        .cfi_startproc
; %bb.0:
        sub        sp, sp, #32
        stp        x29, x30,              ; 16-byte Folded Spill
        add        x29, sp, #16
        .cfi_def_cfa w29, 16
        .cfi_offset w30, -8
        .cfi_offset w29, -16
        stur        wzr,
        str        w0,
        str        x1,
        adrp        x0, l_.str@PAGE
        add        x0, x0, l_.str@PAGEOFF
        bl        _printf
        mov        w0, #0                        ; =0x0
        ldp        x29, x30,              ; 16-byte Folded Reload
        add        sp, sp, #32
        ret
        .cfi_endproc
                                        ; -- End function
        .section        __TEXT,__cstring,cstring_literals
l_.str:                                 ; @.str
        .asciz        "Hello World!\n"

.subsections_via_symbols
```

### bitcode编译为可执行程序

lld是通用程序, 不同平台需要调用不同二进制

* Unix: ld.lld
* macOS: ld64.lld
* Windows: lld-link
* WebAssembly: wasm-ld

```bash
./ld64.lld -arch arm64 -platform_version ios 12.0 14.5 -dylib /tmp/1.bc -o /tmp/1.exe
```

### 运行bitcode

```bash
./lli /tmp/1.ll
./lli /tmp/1.bc
# 均输出"Hello World!"
```

## IR指令
```txt
Instruction
UnaryInstruction      一元指令
UnaryOperator         一元操作
CastInst            强制转换
    PossiblyNonNegInst非负指令
BinaryOperator          二进制操作
PossiblyDisjointInst
CmpInst               比较操作
CallBase                调用操作
FuncletPadInst   

                  Super
AllocaInst          UnaryInstruction    An instruction to allocate memory on the stack.
LoadInst            UnaryInstruction    An instruction for reading from memory. This uses the SubclassData
                                        field in Value to store whether or not the load is volatile.
StoreInst         Instruction         An instruction for storing to memory.
FenceInst         Instruction         An instruction for ordering other memory operations.
AtomicCmpXchgInst   Instruction         An instruction that atomically checks whether a specified value
                                        is in a memory location, and, if it is, stores a new value there.
                                        The value returned by this instruction is a pair containing the
                                        original value as first element, and an i1 indicating success
                                        (true) or failure (false) as second element.
AtomicRMWInst       Instruction         An instruction that atomically reads a memory location, combines
                                        it with another value, and then stores the result back.Returns
                                        the old value.
GetElementPtrInst   Instruction         An instruction for type-safe pointer arithmetic to access elements
                                        of arrays and structs
ICmpInst            CmpInst             This instruction compares its operands according to the predicate
                                        given to the constructor. It only operates on integers or pointers.
                                        The operands must be identical types. Represent an integer comparison
                                        operator.
FCmpInst            CmpInst             This instruction compares its operands according to the predicate
                                        given to the constructor. It only operates on floating point values
                                        or packed vectors of floating point values. The operands must be
                                        identical types. Represents a floating point comparison operator.
CallInst            CallBase            This class represents a function call, abstracting a target machine's
                                        calling convention. This class uses low bit of the SubClassData
                                        field to indicate whether or not this is a tail call. The rest
                                        of the bits hold the calling convention of the call.
SelectInst          Instruction         This class represents the LLVM 'select' instruction.
VAArgInst         UnaryInstruction    This class represents the va_arg llvm instruction, which returns
                                        an argument of the specified type given a va_list and increments
                                        that list
ExtractElementInstInstruction         This instruction extracts a single (scalar) element from a VectorType value
InsertElementInst   Instruction         This instruction inserts a single (scalar) element into a VectorType value
ShuffleVectorInst   Instruction         This instruction constructs a fixed permutation of two input vectors.
                                        For each element of the result vector, the shuffle mask selects an
                                        element from one of the input vectors to copy to the result.
                                        Non-negative elements in the mask represent an index into the
                                        concatenated pair of input vectors. PoisonMaskElem (-1) specifies
                                        that the result element is poison. For scalable vectors, all the
                                        elements of the mask must be 0 or -1. This requirement may be
                                        relaxed in the future.
ExtractValueInst    UnaryInstruction    This instruction extracts a struct member or array element value
                                        from an aggregate value.
InsertValueInst   Instruction         This instruction inserts a struct field of array element value
                                        into an aggregate value.
PHINode             Instruction         PHINode - The PHINode class is used to represent the magical mystical
                                        PHI node, that can not exist in nature, but can be synthesized in a
                                        computer scientist's overactive imagination.
LandingPadInst      Instruction         The landingpad instruction holds all of the information necessary
                                        to generate correct exception handling. The landingpad instruction
                                        cannot be moved from the top of a landing pad block, which itself
                                        is accessible only from the 'unwind' edge of an invoke. This uses
                                        the SubclassData field in Value to store whether or not the landingpad
                                        is a cleanup.
ReturnInst          Instruction         Return a value (possibly void), from a function. Execution does
                                        not continue in this function any longer.
BranchInst          Instruction         Conditional or Unconditional Branch instruction.
SwitchInst          Instruction         Multiway switch.
IndirectBrInst      Instruction         Indirect Branch Instruction.
InvokeInst          CallBase            Invoke instruction. The SubclassData field is used to hold the
                                        calling convention of the call.
CallBrInst          CallBase            CallBr instruction, tracking function calls that may not return
                                        control but instead transfer it to a third location. The SubclassData
                                        field is used to hold the calling convention of the call.
ResumeInst          Instruction         Resume the propagation of an exception.
CatchSwitchInst   Instruction
CleanupPadInst      FuncletPadInst
CatchPadInst      FuncletPadInst
CatchReturnInst   Instruction
CleanupReturnInst   Instruction
UnreachableInst   Instruction         This function has undefined behavior. In particular, the presence
                                        of this instruction indicates some higher level knowledge that
                                        the end of the block cannot be reached.
TruncInst         CastInst            This class represents a truncation of integer types.
ZExtInst            CastInst            This class represents zero extension of integer types.
SExtInst            CastInst            This class represents a sign extension of integer types.
FPTruncInst         CastInst            This class represents a truncation of floating point types.
FPExtInst         CastInst            This class represents an extension of floating point types.
UIToFPInst          CastInst            This class represents a cast unsigned integer to floating point.
SIToFPInst          CastInst            This class represents a cast from signed integer to floating point.
FPToUIInst          CastInst            This class represents a cast from floating point to unsigned integer.
FPToSIInst          CastInst            This class represents a cast from floating point to signed integer.
IntToPtrInst      CastInst            This class represents a cast from an integer to a pointer.
PtrToIntInst      CastInst            This class represents a cast from a pointer to an integer.
BitCastInst         CastInst            This class represents a no-op cast from one type to another.
AddrSpaceCastInst   CastInst            This class represents a conversion between pointers from one address
                                        space to another.
FreezeInst          UnaryInstruction    This class represents a freeze function that returns random concrete
                                        value if an operand is either a poison value or an undef value
```

页: [1]
查看完整版本: OLLVM学习之一 —— LLVM简介及Hikari模块化