USYD - COMP 5046

Lecture 1

N-Gram LMs

The 4 ways to deal with the unseen word sequence

smoothing:

Untitled

discounting:

Untitled

Interpolation: using the multiple N-Gram probability

Untitled

Kneser-Ney smoothing:

Lecture2

Lecture3

Lecture4

Note the formula macro and micro F1 score — ass2

Not the find_best_socre initial number should be negative — ass2

directly using average will make some features vanish (+ and -), also the order is meaningless

in slide, NER page 66

labels number * length * length means:

each token in sequence is possible for start or end, that is length * length

then, each span can be predicted as each label

so, the output should be (labels * length * length)

Untitled

Untitled

Lecture 5

beam search

in nature, greedy search is to find the local optimal solution in each step

Lecture 6

the label in current word from LSTM purely, B I O possibility(sum as 1)

score = current label p * best (previous p * transfer p)

it from

score = current label p * best (i.e. DP * transfer model)

DP refers to the optimal route in choice

transfer model refers to the Markov model

co reference

dependency parsing

Lecture 7

Lecture 8

disadvantages of self-attention

problem:

non-linear lacking

order lacking

resolve:

feedforward layer

additional position vector

issue: the length of the position vector is fixed, not flexible enough
the mask is for ignoring the future word in the training process, it lets the model know what is known in this step, and what is the prediction in this step. In math, put the infinite negative value for future words and dot the product in the current word
the multiple layers of self-attention are used to build better representation of the word

query: curve line

key: embedding word

value: passing into the weighted value

Lecture 10

data can from

fineweb

common craw

Lecture 11

Review

The content below is the missing slide week.

parsing, 句法分析, identity the grammar structure by sub-phrases

The span in the sentence called the phrase

treebanks

Method: Dependency Parsing / Grammar

compute every possible dependency to find the best score

Or, choose a way to get the best overall score

co-reference

USYD - COMP 5329

COMP 5329

Lecture 2

P9 XOR

P31 Activation

P36 CE

P37 KL, Entropy

P38 SoftMax

p36, p37 two attributes cross-entropy loss issue?
the disadvantage of batch gradient descent is when N is too large, the computation is very expensive, but the one example SGD only uses the single example. It may not be the best, due to the random.

mini-batches SGD: divide into mini-batches, in each epoch calculate the gradient

extension content:

sensitivity? make the back propagation similar to the feedforward

Lecture 3 GD

Different gradient descent.

Challenges: proper lr, same lr, saddle point

P9-10 Momentum

P11 NAG

P17 Adagrad

P21 Adadelta

P25 RMSprop

P26, 27 Adam

P34 Initialization

Batch Gradient descent: Accuracy

Stochastic gradient descent: efficiency

Mini-batch gradient descent: trade-off of accuracy and efficiency

Momentum is to update SGD, to accelerate and dampen oscillation

increase the momentum term in the same direction, reduce in change direction

NAG:

The big jump in the standard momentum is too aggressive

So, in the Nesterov accelerated gradient use a previous gradient to big jump and a correction

Untitled

theta - r v_t-1 means that update according to the previous, which is the big jump depending on the previous accumulated gradient.

Then, calculate the gradient plus the contributed previously accumulated gradient to update the theta

Adagrad:

to achieve the different learning rate for features

i means dimension

t means iteration

suitable for sparse data

but the global learning rate needed

Adadelta:

Untitled

in order to deal with the problem of Adagrad(infinitesimally small for the denominator in the end), modify the gt equation as above. Contributed the last square sum of gt, and current gt.

H is the secondary gradient

with consistent units

Lecture 4 Norm

P10 weight decay

pls write the structure of the normalization

inverted, dropout-connect

BN, (reduce covariance)

R means regularisation functions, r on page 6 is the upper boundary for parameter complex

Page 24 in slide, bottom 2 are independent methods

the group right top corner is the normal training process(i.e. only train on the training set)

drop out

scale down

in the training process, each unit has with p possibility

p present

1-p disappear

in the test process, they always present

p is times the w in layer

inverted

see slide

M in slide 36 is the matrix of binary (1 or 0)

adding the extra linear function after the normalization in each layer’s output. gamma means the scale parameter, beta means the shift parameter

batch norm: norm applies in all examples in this batch in each channel

layer norm: norm applies in each data sample in this batch

instance norm: norm applies in an example in this batch in a channel

group norm: norms apply in some channels in an example (split an example into multiple parts) in a data batch

Lecture 5 CNN

P46 different sorts of pooling

Spectral pooling?

un-pooling uses the (max) location information to reverse the process, keeping the 0 at information lacking place

more previous layer, the information is more simple. (line, color block)

more higher layer, the more meaningful information included

transposed convolution is also called the deconvolution method, enlarging the feature map until keeps the same as input size

in the process of deconvolution, the output will larger, it will happen to overlap in the process, summation the overlap region. Also, it has the stride, crop (like the padding, but crop the data in the output) Output size = (N-1) S + K - 2C

In pytorch, using the sequential in init rather than init function like tut note.
the kernel in CNN is similar to the digital, tut 6 notebook

its like the mouse in 5318, get a bigger score of the kernel output

Lecture 7

dead relu means that the zero output of the relu activation function

Data augmentation is rotate, distortion, color changing etc. methods to manipulate the data.

In local response normalization, a is the center of the region

overlapping pooling, polling stride less than pooling kernel size

smaller filter means deeper, more non-linearity, fewer parameters.

p is changing to ensure the size of the output is same on the googlenet

p28, right is the google net structure, and left is the normal way. It helps to decrease the depth of the feature map in the right method.

p37 is used to avoid the overconfident of the model, with the label smoothing

1 by 1 conv layer is used to exchange the information among channels ????

Shuffle net is another way to exchange information

Extension

Ghost filter?

Lego filter?

adder filter?

Lecture 8

Lecture 9

The masked LM in BERT in used to predict the masked word by linear regression

Classification in BERT is used in the binary classifier to identify the 1 or 2 sentences

P49 the place of dot product is, output of the classifier and the representation of token, multiple numbers is because the multiple-classifier of the model, then using the result passing into the softmax to predict the final prediction

the gates in code split the matrix into multiple chunks and pass into the gates.

Lecture 10

P26: if the degree of the J is large, the information from J to I node is very small. Because the denominator is Djj Dii.

the D^-0.5 A D^-0.5(connection) H(value of each neighbour, near features) W(parameter)

Laplacian means to reduce the shape of the matrix, in the spectral matrix, the shape is quality large

Each row and each column in matrix of Laplacian are zero

Untitled

the eigen matrix are used to reshape the original matrix, which similar to the PCA, focusing on the important part of the matrix

Lecture 11

the detected size of the boundary box many different. So, the resizing of the image is necessary for CNN. P16.

The proposal may overlap, it makes the duplication of calculation. P17

Fast R-CNN: instead of using the image level input like P16, the fast way is using the features in CNN as input to the bounding box task. P20

But the problem of different sizes still exists, fast R-CNN uses the max pooling way to ensure the size of the output.

Faster R-CNN: spp-net, improve at the extra pooling layer then fast R-CNN. In this model, the classification, detection, boundary box, etc tasks are all done by different CNNs.

In other words, the nature of the faster here is replacing the original image object detection by the feature object detection. It reduces the computational resources.

Mask R-CNN: deter the pixel in the image along to what label.

RoIAlign is more accurate in the pooling layer, directly dividing the result evenly without considering the number, and using the distance of each point to decide the number of the results after the max pooling.

excepted predict the xyhw, the dx,dy,dh,dw (difference of the xyhw) are also included in the prediction. Jargon called an anchor.

Lecture 12

high value in Dx, maximize the (1-D(G(z)))

Lecture 13

diffusion

GAN: 输入为随机输入

f(theta) = Y

Y 为图像

X 为随机输入

F()为神经网络, 映射xy之间的转换关系,从正态分布到图像的分布(特殊的集合)

不用MSE的原因是 噪音和图像之前的关系没有有意义的关系,也就是梯度会没有, 并且只能生成看到过的图片, 不会生成没看见过的图片

生成式对抗模型:两个神经网络,生成式网络(生成假图)和分类器(检验真假)

生成式模型从分类起学会到如何生成真图

分类无法区分的时候,就训练完了

VAE: encoder the image, then decode it

decode之后和原始的 计算loss

从参化

diffusion: 去燥模型,退热定律

X → XT.noisy(高斯)

X → X+Theta → XT.noisy(高斯)

X → … → X+Theta * n → XT.noisy(高斯)

每次加一点点噪音, 直到纯噪音

每次加的高斯分布的噪音

loss 为检测图片中的噪音数量

反过去,就是按照噪音去预测图片

diffusion只有一个NN,其他模型为多个

USYD - DATA 5207

DATA 5207

Lecture 2

presenting data to non-expert (visualization)

less technical knowledge

making data engage

convey the pattern

Data Graphics

3 consideration

what information want to communicate

who is the target audience

why design this feature relevant

Lecture 3

confounding factors: earning by height, may it occur by gender

select the topic this weekend

RMD template

Lecture 4

better R square, better job the model does. This means the better-fitting in model

maximize the variables in the model

not only use the technic thing to fit the data but adding the theoretical thing to increase the use in practice

observational data can not make the causal inference (confounding factor included)

model error

random errors (precision limitation - sample number)

systematic error (error in research design - non-sampling error)

difference between observed and actual

response, instrument, interviewer

sample design error

selection, frame

explore dataset graphs

variables

model choice

correlation plot all variables

variable selection methods - stepwise, lasso, or

Lecture 5

limitation of the LR is assuming the relationship is linear

logits?

ordinal logistic regression (agree, very agree, etc)

For the material in lab 5, the last image can be repainted as for each year, plot the importance of each variable (independent factor) into a single panel.

Lecture 6

fuzzyjoin is a function that similar operation in SQL

Research Plan Format

Format

Hide the R chunks, the template has the code to hide

key feature should be identified

why use the LR

Literature

theory from Literature

hypothesis is for testing? is that the previous section provided

literature: tells you, communicate the hypothesis you provide

inform the things you need to do

underpin the thing you want to explain

may error in the literature section, falsify the idea

Data

api missing

operation

Limitation

can be deleted

Lecture 7 Quiz week

data can from

consumer data

social media

AB testing (to decide the better version in different versions)

for instance, the color, and size of the button may sent to users, and the amount they click to decide the better version

census

individuals include surveys

web scraping

Lecture 8

survey

system error - no random

may younger people be more likely to respond to the phone (phone survey) - nonresponse bias

The census is not like the survey, due to its not doing the data sample, according to the entire residents in the country

random error refers to sampling

Assignment-1

  1. Economic: Q288, 50 (income)
    1. 1
  2. Occupation: Q281, 282, 283
  3. Education: Q275
    1. https://d1wqtxts1xzle7.cloudfront.net/49101438/18.01.053.20160304-libre.pdf?1474797472=&response-content-disposition=inline%3B+filename%3DEffect_of_Education_on_Quality_of_Life_a.pdf&Expires=1713509378&Signature=bTvJ0cklHa83ixDEhUTW02gYB4KW0iex7Mx6etlJqBNha-f0l-gvirWcVjlpbtaXdn5SsFoSsWtjeay-18z5De6i3e2wRtZvtx5cuzyJe2RLJHKYPPXrkiEORhb9c35JK-WjFa7T8c8OIQj5RxD11Gj3W7wCsC3jJwVOewTDYwkVBKXC1-7BjpWcbOSrkZnazJwulzVzLIERo0l6iO51LIqFi6wY8TSiTTdFGhiHctf9bu2Y7IapgVAwDLKXbpYTdXd3c4nVMPqQryYQ5iOjKVEmcCdMQwn0HUGe837Dn38-7ttCIbNASUOgpjEGQEjmNlznMsOW9jG~X9VHjw__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA
    2. https://www.sciencedirect.com/science/article/pii/S2214804314001153?via%3Dihub
  4. Societal Wellbeing: Q47 (health)
  5. Security: Q131-138, 52
  6. Social capital, trust: Q57-61
    1. https://link.springer.com/article/10.1007/s00148-007-0146-7 (neighbourhood only)

PCA to combine the multiple variables into one feature

Lecture 10

cable library in R

Lecture 11 - Causality

Lecture 12 - Journalism

datasplash platform

Final Project

the plots and tables can be included in the report

using the table to regression result (kable) function

USYD - INFO 5992

INFO 5992

Lecture 1

3 advantages and disadvantages of 5G wireless connections over 4G.

  1. More capacity for device connection in the meantime

  2. The higher transmission speed compared to the 4G

  3. The performance of latency shows the advantage

  4. The price of 5G IoT module almost doubled that 4G’s, showing more economic challenge, although showing the benefits on other areas.

  5. The coverage of 5G is weaker than 4G. Meanwhile, much time is still needed to establish an extensive area.

  6. Compatibility may become the upcoming problem in the future. Due to the out-of-date devices may not support the new standard(5G technology).

Q1: Yes.

  1. The benefits(speed, network capacity and lower latency etc.) of 5G are highly suitable for lots of organizations, especially for IT-based organizations.
  2. The development of 5G is unstoppable, the cost, coverage and other weaknesses will be solved in the near future. Therefore, end users, organizations and governments will embrace the network evolution (i.e. 5G).

Q2: The price will be lower.

  1. The cost of 5G network will be reduced, because of the more mature infrastructure and technology, which will be represented in the market price.
  2. The quantity of 5G users will increase gradually, which means that each 5G station cost will be separate for each user. It also will reduce the cost of 5G network use.

Q3: Yes.

  1. The 5G breaks many physical limitations. For instance, time latency. In the practice of clinics, one of the biggest limitations of remote operation is a delay in the network. The on-time network(5G) can lower the limitation and enlarge the feasibility.
  2. Also, the high speed of 5G can realize the large-capacity meeting, its all owed to the 5G.

Q4: No.

In my point of view, the network 5G is faster than 4G, not fundamentally changing the way of the network, but the development of the network.

Lecture 2

  1. Diffusion of innovation
    1. Innovation development process
  2. Technology adoption lifecycle model
  3. Dominate Design

Lecture 3

  1. Disruptive innovation
  2. Innovator’s dilemma

Lecture 4

API business model

API as product

API promoting means making the main business more popular

API enhancing means making the functionality better

May pro but not en, but it can not be en no pro

Lecture 5

Types of Crowdsourcing P14

Untitled

  1. put information and data to the platform
  2. can compare with others’ solutions (10% better etc.)
  3. creative things
  4. basic level of human intelligence

Lecture 6

user innovation: the innovations from the user or customer (company, B2B), due to the unfulfilled requirement.

Lecture 7

customer pivoting

solve the problem of the certain segment customers, and solve another problem of the same people

business pivoting

solve the different problems

Lecture 8

value proposition